# Project

## Data Cleaning and Imputation

We begin by loading the dataset, and showing a few of the matches to prove the data has been loaded.

In [3]:
import pandas as pd

df = pd.read_csv("data.csv")
df.head(5)

Unnamed: 0,R_fighter,B_fighter,Referee,date,location,Winner,title_bout,weight_class,no_of_rounds,B_current_lose_streak,...,R_win_by_KO/TKO,R_win_by_Submission,R_win_by_TKO_Doctor_Stoppage,R_wins,R_Stance,R_Height_cms,R_Reach_cms,R_Weight_lbs,B_age,R_age
0,Henry Cejudo,Marlon Moraes,Marc Goddard,2019-06-08,"Chicago, Illinois, USA",Red,True,Bantamweight,5,0.0,...,2.0,0.0,0.0,8.0,Orthodox,162.56,162.56,135.0,31.0,32.0
1,Valentina Shevchenko,Jessica Eye,Robert Madrigal,2019-06-08,"Chicago, Illinois, USA",Red,True,Women's Flyweight,5,0.0,...,0.0,2.0,0.0,5.0,Southpaw,165.1,167.64,125.0,32.0,31.0
2,Tony Ferguson,Donald Cerrone,Dan Miragliotta,2019-06-08,"Chicago, Illinois, USA",Red,False,Lightweight,3,0.0,...,3.0,6.0,1.0,14.0,Orthodox,180.34,193.04,155.0,36.0,35.0
3,Jimmie Rivera,Petr Yan,Kevin MacDonald,2019-06-08,"Chicago, Illinois, USA",Blue,False,Bantamweight,3,0.0,...,1.0,0.0,0.0,6.0,Orthodox,162.56,172.72,135.0,26.0,29.0
4,Tai Tuivasa,Blagoy Ivanov,Dan Miragliotta,2019-06-08,"Chicago, Illinois, USA",Blue,False,Heavyweight,3,0.0,...,2.0,0.0,0.0,3.0,Southpaw,187.96,190.5,264.0,32.0,26.0


### Missing Stances and Referee

Some of the matches feature fighters with missing stance information or missing referee information. Since this is a relatively small number of matches out of the whole dataset, we will drop these matches from our consideration.

In [4]:
# Show number of matches in dataset before removing matches with missing stance information
print('Number of matches prior to filtering: ' + str(len(df)))

# Remove matches with missing stance information
filter1 = df[df['B_Stance'].notnull()]
filter2 = filter1[filter1['R_Stance'].notnull()]
filter3 = filter2[filter2['Referee'].notnull()]
df = filter3
print('Number of matchs after filtering: ' + str(len(filter3)))

Number of matches prior to filtering: 5144
Number of matchs after filtering: 4865


### Missing Numerical Data

Many of the rows contain missing numerical data in certain columns. We will fill in these columns with the median for that column.

First we will find columns with missing numerical data to demonstrate that our imputation is successful.

In [5]:
df.columns[df.isnull().any()]

Index(['B_avg_BODY_att', 'B_avg_BODY_landed', 'B_avg_CLINCH_att',
       'B_avg_CLINCH_landed', 'B_avg_DISTANCE_att', 'B_avg_DISTANCE_landed',
       'B_avg_GROUND_att', 'B_avg_GROUND_landed', 'B_avg_HEAD_att',
       'B_avg_HEAD_landed',
       ...
       'R_avg_opp_SUB_ATT', 'R_avg_opp_TD_att', 'R_avg_opp_TD_landed',
       'R_avg_opp_TD_pct', 'R_avg_opp_TOTAL_STR_att',
       'R_avg_opp_TOTAL_STR_landed', 'R_total_time_fought(seconds)',
       'R_Reach_cms', 'B_age', 'R_age'],
      dtype='object', length=104)

We select R_age for our demonstration. Now we find a few rows that have a missing R_age.

In [6]:
people = df[df['R_age'].isnull()].head(5)
people

Unnamed: 0,R_fighter,B_fighter,Referee,date,location,Winner,title_bout,weight_class,no_of_rounds,B_current_lose_streak,...,R_win_by_KO/TKO,R_win_by_Submission,R_win_by_TKO_Doctor_Stoppage,R_wins,R_Stance,R_Height_cms,R_Reach_cms,R_Weight_lbs,B_age,R_age
4171,Per Eklund,Samy Schiavo,Leon Roberts,2008-10-18,"Birmingham, England, United Kingdom",Red,False,Lightweight,3,1.0,...,0.0,0.0,0.0,0.0,Orthodox,177.8,182.88,155.0,32.0,
4376,Jess Liaudin,Anthony Torres,Mario Yamasaki,2007-09-08,"London, England, United Kingdom",Red,False,Welterweight,3,0.0,...,0.0,1.0,0.0,1.0,Orthodox,175.26,182.88,170.0,29.0,
4438,Jess Liaudin,Dennis Siver,Steve Mazzagatti,2007-04-21,"Manchester, England, United Kingdom",Red,False,Welterweight,3,0.0,...,0.0,0.0,0.0,0.0,Orthodox,175.26,182.88,170.0,28.0,
4767,Keith Rockel,Chris Liguori,John McCarthy,2003-11-21,"Uncasville, Connecticut, USA",Red,False,Middleweight,3,0.0,...,0.0,0.0,0.0,0.0,Orthodox,182.88,,185.0,,
4908,Ben Earwood,Chris Lytle,Mario Yamasaki,2000-11-17,"Atlantic City, New Jersey, USA",Red,False,Welterweight,2,0.0,...,0.0,0.0,0.0,0.0,Orthodox,172.72,,170.0,26.0,


Now we find the median for that column.

In [7]:
df['R_age'].median()

29.0

Now we apply fillna to the entire dataset, and show that the column is filled in with the median.

In [8]:
df = df.fillna(df.median())
df.loc[people.index.tolist()]

Unnamed: 0,R_fighter,B_fighter,Referee,date,location,Winner,title_bout,weight_class,no_of_rounds,B_current_lose_streak,...,R_win_by_KO/TKO,R_win_by_Submission,R_win_by_TKO_Doctor_Stoppage,R_wins,R_Stance,R_Height_cms,R_Reach_cms,R_Weight_lbs,B_age,R_age
4171,Per Eklund,Samy Schiavo,Leon Roberts,2008-10-18,"Birmingham, England, United Kingdom",Red,False,Lightweight,3,1.0,...,0.0,0.0,0.0,0.0,Orthodox,177.8,182.88,155.0,32.0,29.0
4376,Jess Liaudin,Anthony Torres,Mario Yamasaki,2007-09-08,"London, England, United Kingdom",Red,False,Welterweight,3,0.0,...,0.0,1.0,0.0,1.0,Orthodox,175.26,182.88,170.0,29.0,29.0
4438,Jess Liaudin,Dennis Siver,Steve Mazzagatti,2007-04-21,"Manchester, England, United Kingdom",Red,False,Welterweight,3,0.0,...,0.0,0.0,0.0,0.0,Orthodox,175.26,182.88,170.0,28.0,29.0
4767,Keith Rockel,Chris Liguori,John McCarthy,2003-11-21,"Uncasville, Connecticut, USA",Red,False,Middleweight,3,0.0,...,0.0,0.0,0.0,0.0,Orthodox,182.88,185.42,185.0,29.0,29.0
4908,Ben Earwood,Chris Lytle,Mario Yamasaki,2000-11-17,"Atlantic City, New Jersey, USA",Red,False,Welterweight,2,0.0,...,0.0,0.0,0.0,0.0,Orthodox,172.72,185.42,170.0,26.0,29.0


### Dimensionality Reduction-PCA ### 

We apply dimensionality reduction to obtain an ordered list of components that account for the largest variance in the data set in order to ultimately group similar fighters based on their fighting styles



Firstly, we drop the non numeric columns

In [46]:
#dealing with only the B_fighter-- usually the underdog-- and successful attacks landed. 
#because why would you recommend when a fighter "misses"

view = df[["B_fighter", "weight_class", "R_current_lose_streak", "B_current_win_streak", "B_draw", "B_avg_BODY_landed",
           "B_avg_CLINCH_landed", "B_avg_DISTANCE_landed", "B_avg_GROUND_landed", "B_avg_HEAD_landed", "B_avg_KD", 
           "B_avg_LEG_landed", "B_avg_PASS", "B_avg_REV", "B_avg_SIG_STR_landed", "B_avg_SIG_STR_pct", "B_avg_SUB_ATT", 
           "B_avg_TD_landed", "B_avg_TD_pct", "B_avg_TOTAL_STR_landed", "B_longest_win_streak", "B_losses", 
           "B_avg_opp_BODY_landed", "B_avg_opp_CLINCH_landed", "B_avg_opp_DISTANCE_landed", "B_avg_opp_GROUND_landed", 
           "B_avg_opp_HEAD_landed", "B_avg_opp_KD", "B_avg_opp_LEG_landed", "B_avg_opp_PASS", "B_avg_opp_REV", 
           "B_avg_opp_SIG_STR_landed", "B_avg_opp_SIG_STR_pct", "B_avg_opp_TD_landed", "B_avg_opp_TD_pct",
           "B_avg_opp_TOTAL_STR_landed", "B_total_title_bouts", "B_win_by_Decision_Majority", "B_win_by_Decision_Split", 
           "B_win_by_Decision_Unanimous", "B_win_by_KO/TKO", "B_win_by_Submission", "B_win_by_TKO_Doctor_Stoppage", 
           "B_wins"]]



#dropping duplicates
newView = view.drop_duplicates(subset = "B_fighter", keep = "first")

newView.head()

Unnamed: 0,B_fighter,weight_class,R_current_lose_streak,B_current_win_streak,B_draw,B_avg_BODY_landed,B_avg_CLINCH_landed,B_avg_DISTANCE_landed,B_avg_GROUND_landed,B_avg_HEAD_landed,...,B_avg_opp_TD_pct,B_avg_opp_TOTAL_STR_landed,B_total_title_bouts,B_win_by_Decision_Majority,B_win_by_Decision_Split,B_win_by_Decision_Unanimous,B_win_by_KO/TKO,B_win_by_Submission,B_win_by_TKO_Doctor_Stoppage,B_wins
0,Marlon Moraes,Bantamweight,0.0,4.0,0.0,6.0,0.0,20.6,2.0,11.2,...,0.1,19.2,0.0,0.0,1.0,0.0,2.0,1.0,0.0,4.0
1,Jessica Eye,Women's Flyweight,0.0,3.0,0.0,9.1,7.3,42.1,1.9,32.0,...,0.231,75.4,0.0,0.0,2.0,1.0,0.0,0.0,1.0,4.0
2,Donald Cerrone,Lightweight,0.0,3.0,0.0,11.322581,4.387097,38.580645,3.806452,23.258065,...,0.063548,49.774194,1.0,0.0,0.0,7.0,10.0,6.0,0.0,23.0
3,Petr Yan,Bantamweight,1.0,4.0,0.0,14.0,11.0,48.75,10.5,53.75,...,0.0975,34.25,0.0,0.0,0.0,2.0,2.0,0.0,0.0,4.0
4,Blagoy Ivanov,Heavyweight,1.0,1.0,0.0,14.5,2.0,59.5,0.0,45.0,...,0.0,90.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0


In [144]:
#changing the weight class from string to numeric: 





In [47]:
newView[['B_fighter', "B_wins"]].loc[newView['B_fighter'] == "Khabib Nurmagomedov"]


Unnamed: 0,B_fighter,B_wins
2478,Khabib Nurmagomedov,5.0


In [139]:
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import NearestNeighbors
from sklearn.decomposition import PCA

stats = newView.drop(columns = ["B_fighter", "R_current_lose_streak", "weight_class"])

In [140]:
stats.head()

Unnamed: 0,B_current_win_streak,B_draw,B_avg_BODY_landed,B_avg_CLINCH_landed,B_avg_DISTANCE_landed,B_avg_GROUND_landed,B_avg_HEAD_landed,B_avg_KD,B_avg_LEG_landed,B_avg_PASS,...,B_avg_opp_TD_pct,B_avg_opp_TOTAL_STR_landed,B_total_title_bouts,B_win_by_Decision_Majority,B_win_by_Decision_Split,B_win_by_Decision_Unanimous,B_win_by_KO/TKO,B_win_by_Submission,B_win_by_TKO_Doctor_Stoppage,B_wins
0,4.0,0.0,6.0,0.0,20.6,2.0,11.2,0.8,5.4,0.4,...,0.1,19.2,0.0,0.0,1.0,0.0,2.0,1.0,0.0,4.0
1,3.0,0.0,9.1,7.3,42.1,1.9,32.0,0.0,10.2,0.8,...,0.231,75.4,0.0,0.0,2.0,1.0,0.0,0.0,1.0,4.0
2,3.0,0.0,11.322581,4.387097,38.580645,3.806452,23.258065,0.645161,12.193548,0.935484,...,0.063548,49.774194,1.0,0.0,0.0,7.0,10.0,6.0,0.0,23.0
3,4.0,0.0,14.0,11.0,48.75,10.5,53.75,0.5,2.5,0.5,...,0.0975,34.25,0.0,0.0,0.0,2.0,2.0,0.0,0.0,4.0
4,1.0,0.0,14.5,2.0,59.5,0.0,45.0,0.0,2.0,0.0,...,0.0,90.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0


In [97]:
scaler = StandardScaler()
x = scaler.fit_transform(attributes)
display(x)


knn = NearestNeighbors(metric = "cosine", algorithm = "brute")
knn.fit(x)

array([[ 2.73854817,  0.        ,  0.07002636, ...,  0.4162804 ,
        -0.18312704,  0.46719495],
       [ 1.92651123,  0.        ,  0.78425064, ..., -0.46619157,
         4.80633435,  0.46719495],
       [ 1.92651123,  0.        ,  1.29632196, ...,  4.8286402 ,
        -0.18312704,  6.11538868],
       ...,
       [ 0.30243735,  0.        , -0.62115844, ..., -0.46619157,
        -0.18312704, -0.42462512],
       [-0.5095996 ,  0.        , -0.2010265 , ..., -0.46619157,
        -0.18312704, -0.72189847],
       [-0.5095996 ,  0.        , -0.2010265 , ..., -0.46619157,
        -0.18312704, -0.72189847]])

NearestNeighbors(algorithm='brute', leaf_size=30, metric='cosine',
                 metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                 radius=1.0)

In [101]:
fighter_indices = knn.kneighbors(x)[1]
fighter_indices
len(fighter_indices)

1662

In [134]:
def get_index(x):
    return df[df['B_fighter']==x].index.tolist()[0]

def recommend_me(player):
    print("5 Players similar to {} are : ".format(player))
    index=  get_index(player)
    for i in player_index[index][1:]:
        print(df.iloc[i]['B_fighter'])

In [137]:
get_index("Jorge Masvidal")

124

In [138]:
recommend_me("Jorge Masvidal")

5 Players similar to Jorge Masvidal are : 
Anthony Rocco Martin
Abdul Razak Alhassan
Alexander Gustafsson
Andre Fili
Anthony Smith
