# Project

## Data Cleaning and Imputation

We begin by loading the dataset, and showing a few of the matches to prove the data has been loaded.

In [4]:
import pandas as pd

df = pd.read_csv("data.csv")
df.head(5)

Unnamed: 0,R_fighter,B_fighter,Referee,date,location,Winner,title_bout,weight_class,no_of_rounds,B_current_lose_streak,...,R_win_by_KO/TKO,R_win_by_Submission,R_win_by_TKO_Doctor_Stoppage,R_wins,R_Stance,R_Height_cms,R_Reach_cms,R_Weight_lbs,B_age,R_age
0,Henry Cejudo,Marlon Moraes,Marc Goddard,2019-06-08,"Chicago, Illinois, USA",Red,True,Bantamweight,5,0.0,...,2.0,0.0,0.0,8.0,Orthodox,162.56,162.56,135.0,31.0,32.0
1,Valentina Shevchenko,Jessica Eye,Robert Madrigal,2019-06-08,"Chicago, Illinois, USA",Red,True,Women's Flyweight,5,0.0,...,0.0,2.0,0.0,5.0,Southpaw,165.1,167.64,125.0,32.0,31.0
2,Tony Ferguson,Donald Cerrone,Dan Miragliotta,2019-06-08,"Chicago, Illinois, USA",Red,False,Lightweight,3,0.0,...,3.0,6.0,1.0,14.0,Orthodox,180.34,193.04,155.0,36.0,35.0
3,Jimmie Rivera,Petr Yan,Kevin MacDonald,2019-06-08,"Chicago, Illinois, USA",Blue,False,Bantamweight,3,0.0,...,1.0,0.0,0.0,6.0,Orthodox,162.56,172.72,135.0,26.0,29.0
4,Tai Tuivasa,Blagoy Ivanov,Dan Miragliotta,2019-06-08,"Chicago, Illinois, USA",Blue,False,Heavyweight,3,0.0,...,2.0,0.0,0.0,3.0,Southpaw,187.96,190.5,264.0,32.0,26.0


### Missing Stances and Referee

Some of the matches feature fighters with missing stance information or missing referee information. Since this is a relatively small number of matches out of the whole dataset, we will drop these matches from our consideration.

In [5]:
# Show number of matches in dataset before removing matches with missing stance information
print('Number of matches prior to filtering: ' + str(len(df)))

# Remove matches with missing stance information
filter1 = df[df['B_Stance'].notnull()]
filter2 = filter1[filter1['R_Stance'].notnull()]
filter3 = filter2[filter2['Referee'].notnull()]
df = filter3
print('Number of matchs after filtering: ' + str(len(filter3)))

Number of matches prior to filtering: 5144
Number of matchs after filtering: 4865


### Missing Numerical Data

Many of the rows contain missing numerical data in certain columns. We will fill in these columns with the median for that column.

First we will find columns with missing numerical data to demonstrate that our imputation is successful.

In [6]:
df.columns[df.isnull().any()]

Index(['B_avg_BODY_att', 'B_avg_BODY_landed', 'B_avg_CLINCH_att',
       'B_avg_CLINCH_landed', 'B_avg_DISTANCE_att', 'B_avg_DISTANCE_landed',
       'B_avg_GROUND_att', 'B_avg_GROUND_landed', 'B_avg_HEAD_att',
       'B_avg_HEAD_landed',
       ...
       'R_avg_opp_SUB_ATT', 'R_avg_opp_TD_att', 'R_avg_opp_TD_landed',
       'R_avg_opp_TD_pct', 'R_avg_opp_TOTAL_STR_att',
       'R_avg_opp_TOTAL_STR_landed', 'R_total_time_fought(seconds)',
       'R_Reach_cms', 'B_age', 'R_age'],
      dtype='object', length=104)

We select R_age for our demonstration. Now we find a few rows that have a missing R_age.

In [7]:
people = df[df['R_age'].isnull()].head(5)
people

Unnamed: 0,R_fighter,B_fighter,Referee,date,location,Winner,title_bout,weight_class,no_of_rounds,B_current_lose_streak,...,R_win_by_KO/TKO,R_win_by_Submission,R_win_by_TKO_Doctor_Stoppage,R_wins,R_Stance,R_Height_cms,R_Reach_cms,R_Weight_lbs,B_age,R_age
4171,Per Eklund,Samy Schiavo,Leon Roberts,2008-10-18,"Birmingham, England, United Kingdom",Red,False,Lightweight,3,1.0,...,0.0,0.0,0.0,0.0,Orthodox,177.8,182.88,155.0,32.0,
4376,Jess Liaudin,Anthony Torres,Mario Yamasaki,2007-09-08,"London, England, United Kingdom",Red,False,Welterweight,3,0.0,...,0.0,1.0,0.0,1.0,Orthodox,175.26,182.88,170.0,29.0,
4438,Jess Liaudin,Dennis Siver,Steve Mazzagatti,2007-04-21,"Manchester, England, United Kingdom",Red,False,Welterweight,3,0.0,...,0.0,0.0,0.0,0.0,Orthodox,175.26,182.88,170.0,28.0,
4767,Keith Rockel,Chris Liguori,John McCarthy,2003-11-21,"Uncasville, Connecticut, USA",Red,False,Middleweight,3,0.0,...,0.0,0.0,0.0,0.0,Orthodox,182.88,,185.0,,
4908,Ben Earwood,Chris Lytle,Mario Yamasaki,2000-11-17,"Atlantic City, New Jersey, USA",Red,False,Welterweight,2,0.0,...,0.0,0.0,0.0,0.0,Orthodox,172.72,,170.0,26.0,


Now we find the median for that column.

In [8]:
df['R_age'].median()

29.0

Now we apply fillna to the entire dataset, and show that the column is filled in with the median.

In [9]:
df = df.fillna(df.median())
df.loc[people.index.tolist()]

Unnamed: 0,R_fighter,B_fighter,Referee,date,location,Winner,title_bout,weight_class,no_of_rounds,B_current_lose_streak,...,R_win_by_KO/TKO,R_win_by_Submission,R_win_by_TKO_Doctor_Stoppage,R_wins,R_Stance,R_Height_cms,R_Reach_cms,R_Weight_lbs,B_age,R_age
4171,Per Eklund,Samy Schiavo,Leon Roberts,2008-10-18,"Birmingham, England, United Kingdom",Red,False,Lightweight,3,1.0,...,0.0,0.0,0.0,0.0,Orthodox,177.8,182.88,155.0,32.0,29.0
4376,Jess Liaudin,Anthony Torres,Mario Yamasaki,2007-09-08,"London, England, United Kingdom",Red,False,Welterweight,3,0.0,...,0.0,1.0,0.0,1.0,Orthodox,175.26,182.88,170.0,29.0,29.0
4438,Jess Liaudin,Dennis Siver,Steve Mazzagatti,2007-04-21,"Manchester, England, United Kingdom",Red,False,Welterweight,3,0.0,...,0.0,0.0,0.0,0.0,Orthodox,175.26,182.88,170.0,28.0,29.0
4767,Keith Rockel,Chris Liguori,John McCarthy,2003-11-21,"Uncasville, Connecticut, USA",Red,False,Middleweight,3,0.0,...,0.0,0.0,0.0,0.0,Orthodox,182.88,185.42,185.0,29.0,29.0
4908,Ben Earwood,Chris Lytle,Mario Yamasaki,2000-11-17,"Atlantic City, New Jersey, USA",Red,False,Welterweight,2,0.0,...,0.0,0.0,0.0,0.0,Orthodox,172.72,185.42,170.0,26.0,29.0


### Dimensionality Reduction-PCA ### 

We apply dimensionality reduction to obtain an ordered list of components that account for the largest variance in the data set and ultimately group similar fighters based on their fighting styles



Firstly we have to group all the fighters by weight class 

In [38]:
df_group=df.groupby(['weight_class'])
print(df_group.size())


weight_class
Bantamweight             344
Catch Weight              37
Featherweight            417
Flyweight                184
Heavyweight              497
Light Heavyweight        493
Lightweight              931
Middleweight             709
Open Weight               78
Welterweight             936
Women's Bantamweight      93
Women's Featherweight      8
Women's Flyweight         31
Women's Strawweight      107
dtype: int64


In [39]:
df_group_bantamweight=df_group.get_group('Bantamweight')

In [43]:
df_group_Catch_Weight=df_group.get_group('Catch Weight')


In [44]:
df_group_Featherweight=df_group.get_group('Featherweight')


In [45]:
df_group_Flyweight=df_group.get_group('Flyweight')


In [46]:
df_group_Heavyweight=df_group.get_group('Heavyweight')


In [48]:
df_group_Light_Heavyweight=df_group.get_group('Light Heavyweight')


In [49]:
df_group_Lightweight=df_group.get_group('Lightweight')


In [None]:
df_group_Middleweight=df_group.get_group('Middleweight')


In [None]:
df_group_Open_Weight=df_group.get_group('Open_Weight')


In [50]:
df_group_Welterweight=df_group.get_group('Welterweight')


In [51]:
df_group_Womens_Bantamweight=df_group.get_group("Women's Bantamweight")


In [52]:
df_group_Womens_Featherweight=df_group.get_group("Women's Featherweight")


In [53]:
df_group_Womens_Flyweight=df_group.get_group("Women's Flyweight")


In [54]:
df_group_Womens_Flyweight=df_group.get_group("Women's Flyweight")


In [82]:
#pd.set_option('display.max_columns', 999)
#pd.set_option('display.max_rows', 999)

df_bantamweight_new=df_group_bantamweight.drop(columns=["R_fighter","B_fighter","Referee","date","location","Winner","title_bout","weight_class","B_Stance","R_Stance"])
pca = decomposition.PCA(n_components=6)
X_pca_bantamweight=pca.fit_transform(df_bantamweight_new)
print(len(X_pca_bantamweight))

344


In [None]:
# :
# # Generate scree plot

# N = 6
# ind = np.arange(N)  # the x locations for the groups

# vals = [0.19268752,
#         0.16491423,
#         0.11496179,
#         0.08523838,
#         0.05260764,
#         0.04703754,
#         0.03596589,
#         0.03111252]

# pl.figure(figsize=(10, 6), dpi=250)
# ax = pl.subplot(111)
# ax.bar(ind, pca.explained_variance_ratio_, 0.35, 
#        color=[(0.949, 0.718, 0.004),
#               (0.898, 0.49, 0.016),
#               (0.863, 0, 0.188),
#               (0.694, 0, 0.345),
#               (0.486, 0.216, 0.541),
#               (0.204, 0.396, 0.667),
#               (0.035, 0.635, 0.459),
#               (0.486, 0.722, 0.329),
#              ])

# ax.annotate(r"%d%%" % (int(vals[0]*100)), (ind[0]+0.2, vals[0]), va="bottom", ha="center", fontsize=12)
# ax.annotate(r"%d%%" % (int(vals[1]*100)), (ind[1]+0.2, vals[1]), va="bottom", ha="center", fontsize=12)
# ax.annotate(r"%d%%" % (int(vals[2]*100)), (ind[2]+0.2, vals[2]), va="bottom", ha="center", fontsize=12)
# ax.annotate(r"%d%%" % (int(vals[3]*100)), (ind[3]+0.2, vals[3]), va="bottom", ha="center", fontsize=12)
# ax.annotate(r"%d%%" % (int(vals[4]*100)), (ind[4]+0.2, vals[4]), va="bottom", ha="center", fontsize=12)
# ax.annotate(r"%d%%" % (int(vals[5]*100)), (ind[5]+0.2, vals[5]), va="bottom", ha="center", fontsize=12)
# ax.annotate(r"%s%%" % ((str(vals[6]*100)[:4 + (0-1)])), (ind[6]+0.2, vals[6]), va="bottom", ha="center", fontsize=12)
# ax.annotate(r"%s%%" % ((str(vals[7]*100)[:4 + (0-1)])), (ind[7]+0.2, vals[7]), va="bottom", ha="center", fontsize=12)

# ax.set_xticklabels(('       0',
#                     '       1',
#                     '       2',
#                     '       3',
#                     '       4',
#                     '       5',
#                     '       6',
#                     '       7',
#                     '       8'), 
#                    fontsize=12)
# ax.set_yticklabels(('0.00', '0.05', '0.10', '0.15', '0.20', '0.25'), fontsize=12)
# ax.set_ylim(0, .25)
# ax.set_xlim(0-0.45, 8+0.45)

# ax.xaxis.set_tick_params(width=0)
# ax.yaxis.set_tick_params(width=2, length=12)

# ax.set_xlabel("Principal Component", fontsize=12)
# ax.set_ylabel("Variance Explained (%)", fontsize=12)

# pl.title("Scree Plot for the Digits Dataset", fontsize=16)

In [77]:
#pd.set_option('display.max_columns', 999)
#pd.set_option('display.max_rows', 999)

df_Catch_Weight_new=df_group_Catch_Weight.drop(columns=["R_fighter","B_fighter","Referee","date","location","Winner","title_bout","weight_class","B_Stance","R_Stance"])
print(len(df_bantamweight_new.columns))
pca = decomposition.PCA(n_components=37)
X_pca_Catch_Weight=pca.fit_transform(df_Catch_Weight_new)
X_pca_Catch_Weight


135


array([[ 2.89249512e+02, -2.32275466e+01, -5.45931429e+01, ...,
         6.96609118e-01, -1.86766329e-01,  3.51411872e-14],
       [ 6.06048679e+02,  7.99798506e+01,  9.73351333e+01, ...,
        -2.91270027e-01,  7.80279379e-02,  3.51411872e-14],
       [ 4.11763514e+02,  1.05604514e+02, -1.16587505e+02, ...,
        -9.31724746e-02,  1.07391449e-01,  3.51411872e-14],
       ...,
       [-1.88035006e+02, -3.84972635e+02,  9.83014706e+00, ...,
        -9.57200213e-01, -2.00616831e-01,  3.51411872e-14],
       [-5.29041302e+02,  5.80219242e+00,  2.80103230e+01, ...,
         2.77921935e+00,  5.95893375e-01,  3.51411872e-14],
       [-5.65555708e+02,  1.34973065e+01,  3.28074175e+01, ...,
         3.22375693e-01, -2.59958759e+00,  3.51411872e-14]])

In [69]:
df_group_Featherweight_new=df_group_Featherweight.drop(columns=["R_fighter","B_fighter","Referee","date","location","Winner","title_bout","weight_class","B_Stance","R_Stance"])
pca = decomposition.PCA(n_components=3)
X_pca_group_Featherweight=pca.fit_transform(df_group_Featherweight_new)
X_pca_group_Featherweight


array([[ -31.20330593, -134.28014541,  153.73280311],
       [-234.09732261,  282.34696332,   13.23141326],
       [  20.30165943,   48.8114748 ,   -8.47870559],
       ...,
       [ -65.85370714,  -65.23508821,  -63.66848371],
       [ -86.85773831,   21.14553429,  -46.23532435],
       [ -86.82044269,   21.29071877,  -46.13478159]])

In [57]:
df_new=df.drop(columns=["R_fighter","B_fighter","Referee","date","location","Winner","title_bout","weight_class","B_Stance","R_Stance"])

In [26]:
from sklearn import (cluster, datasets, decomposition, ensemble, manifold, random_projection)

pca = decomposition.PCA(n_components=4)
X_pca=pca.fit_transform(df_new)
X_pca

array([[ 3.78933703e+01,  2.35715566e+02,  5.96958272e+01,
        -7.12575349e+00],
       [ 5.77895484e+02,  2.80375709e+01,  2.25171526e+01,
         7.26009099e+01],
       [ 5.10230400e+01,  7.70617163e+00,  1.36288831e+02,
        -3.01217769e+01],
       ...,
       [ 6.91667931e+00, -1.86914222e+00, -3.04090404e+01,
         1.71373027e+00],
       [-2.14425266e+00, -1.17340570e+00, -4.73642654e+01,
         8.26590327e+00],
       [-9.06997567e+00, -1.91545598e-01, -6.01375664e+01,
         1.40030517e+01]])

In [14]:
column_names=df.columns

In [12]:
df.data

AttributeError: 'DataFrame' object has no attribute 'data'