# Project

## Data Cleaning and Imputation

We begin by loading the dataset, and showing a few of the matches to prove the data has been loaded.

In [1]:
import pandas as pd

df = pd.read_csv("data.csv")
df.head(5)

Unnamed: 0,R_fighter,B_fighter,Referee,date,location,Winner,title_bout,weight_class,no_of_rounds,B_current_lose_streak,...,R_win_by_KO/TKO,R_win_by_Submission,R_win_by_TKO_Doctor_Stoppage,R_wins,R_Stance,R_Height_cms,R_Reach_cms,R_Weight_lbs,B_age,R_age
0,Henry Cejudo,Marlon Moraes,Marc Goddard,2019-06-08,"Chicago, Illinois, USA",Red,True,Bantamweight,5,0.0,...,2.0,0.0,0.0,8.0,Orthodox,162.56,162.56,135.0,31.0,32.0
1,Valentina Shevchenko,Jessica Eye,Robert Madrigal,2019-06-08,"Chicago, Illinois, USA",Red,True,Women's Flyweight,5,0.0,...,0.0,2.0,0.0,5.0,Southpaw,165.1,167.64,125.0,32.0,31.0
2,Tony Ferguson,Donald Cerrone,Dan Miragliotta,2019-06-08,"Chicago, Illinois, USA",Red,False,Lightweight,3,0.0,...,3.0,6.0,1.0,14.0,Orthodox,180.34,193.04,155.0,36.0,35.0
3,Jimmie Rivera,Petr Yan,Kevin MacDonald,2019-06-08,"Chicago, Illinois, USA",Blue,False,Bantamweight,3,0.0,...,1.0,0.0,0.0,6.0,Orthodox,162.56,172.72,135.0,26.0,29.0
4,Tai Tuivasa,Blagoy Ivanov,Dan Miragliotta,2019-06-08,"Chicago, Illinois, USA",Blue,False,Heavyweight,3,0.0,...,2.0,0.0,0.0,3.0,Southpaw,187.96,190.5,264.0,32.0,26.0


### Missing Stances and Referee

Some of the matches feature fighters with missing stance information or missing referee information. Since this is a relatively small number of matches out of the whole dataset, we will drop these matches from our consideration.

In [2]:
# Show number of matches in dataset before removing matches with missing stance information
print('Number of matches prior to filtering: ' + str(len(df)))

# Remove matches with missing stance information
filter1 = df[df['B_Stance'].notnull()]
filter2 = filter1[filter1['R_Stance'].notnull()]
filter3 = filter2[filter2['Referee'].notnull()]
df = filter3
print('Number of matchs after filtering: ' + str(len(filter3)))

Number of matches prior to filtering: 5144
Number of matchs after filtering: 4865


### Missing Numerical Data

Many of the rows contain missing numerical data in certain columns. We will fill in these columns with the median for that column.

First we will find columns with missing numerical data to demonstrate that our imputation is successful.

In [3]:
df.columns[df.isnull().any()]

Index(['B_avg_BODY_att', 'B_avg_BODY_landed', 'B_avg_CLINCH_att',
       'B_avg_CLINCH_landed', 'B_avg_DISTANCE_att', 'B_avg_DISTANCE_landed',
       'B_avg_GROUND_att', 'B_avg_GROUND_landed', 'B_avg_HEAD_att',
       'B_avg_HEAD_landed',
       ...
       'R_avg_opp_SUB_ATT', 'R_avg_opp_TD_att', 'R_avg_opp_TD_landed',
       'R_avg_opp_TD_pct', 'R_avg_opp_TOTAL_STR_att',
       'R_avg_opp_TOTAL_STR_landed', 'R_total_time_fought(seconds)',
       'R_Reach_cms', 'B_age', 'R_age'],
      dtype='object', length=104)

We select R_age for our demonstration. Now we find a few rows that have a missing R_age.

In [4]:
people = df[df['R_age'].isnull()].head(5)
people

Unnamed: 0,R_fighter,B_fighter,Referee,date,location,Winner,title_bout,weight_class,no_of_rounds,B_current_lose_streak,...,R_win_by_KO/TKO,R_win_by_Submission,R_win_by_TKO_Doctor_Stoppage,R_wins,R_Stance,R_Height_cms,R_Reach_cms,R_Weight_lbs,B_age,R_age
4171,Per Eklund,Samy Schiavo,Leon Roberts,2008-10-18,"Birmingham, England, United Kingdom",Red,False,Lightweight,3,1.0,...,0.0,0.0,0.0,0.0,Orthodox,177.8,182.88,155.0,32.0,
4376,Jess Liaudin,Anthony Torres,Mario Yamasaki,2007-09-08,"London, England, United Kingdom",Red,False,Welterweight,3,0.0,...,0.0,1.0,0.0,1.0,Orthodox,175.26,182.88,170.0,29.0,
4438,Jess Liaudin,Dennis Siver,Steve Mazzagatti,2007-04-21,"Manchester, England, United Kingdom",Red,False,Welterweight,3,0.0,...,0.0,0.0,0.0,0.0,Orthodox,175.26,182.88,170.0,28.0,
4767,Keith Rockel,Chris Liguori,John McCarthy,2003-11-21,"Uncasville, Connecticut, USA",Red,False,Middleweight,3,0.0,...,0.0,0.0,0.0,0.0,Orthodox,182.88,,185.0,,
4908,Ben Earwood,Chris Lytle,Mario Yamasaki,2000-11-17,"Atlantic City, New Jersey, USA",Red,False,Welterweight,2,0.0,...,0.0,0.0,0.0,0.0,Orthodox,172.72,,170.0,26.0,


Now we find the median for that column.

In [5]:
df['R_age'].median()

29.0

Now we apply fillna to the entire dataset, and show that the column is filled in with the median.

In [6]:
df = df.fillna(df.median())
df.loc[people.index.tolist()]

Unnamed: 0,R_fighter,B_fighter,Referee,date,location,Winner,title_bout,weight_class,no_of_rounds,B_current_lose_streak,...,R_win_by_KO/TKO,R_win_by_Submission,R_win_by_TKO_Doctor_Stoppage,R_wins,R_Stance,R_Height_cms,R_Reach_cms,R_Weight_lbs,B_age,R_age
4171,Per Eklund,Samy Schiavo,Leon Roberts,2008-10-18,"Birmingham, England, United Kingdom",Red,False,Lightweight,3,1.0,...,0.0,0.0,0.0,0.0,Orthodox,177.8,182.88,155.0,32.0,29.0
4376,Jess Liaudin,Anthony Torres,Mario Yamasaki,2007-09-08,"London, England, United Kingdom",Red,False,Welterweight,3,0.0,...,0.0,1.0,0.0,1.0,Orthodox,175.26,182.88,170.0,29.0,29.0
4438,Jess Liaudin,Dennis Siver,Steve Mazzagatti,2007-04-21,"Manchester, England, United Kingdom",Red,False,Welterweight,3,0.0,...,0.0,0.0,0.0,0.0,Orthodox,175.26,182.88,170.0,28.0,29.0
4767,Keith Rockel,Chris Liguori,John McCarthy,2003-11-21,"Uncasville, Connecticut, USA",Red,False,Middleweight,3,0.0,...,0.0,0.0,0.0,0.0,Orthodox,182.88,185.42,185.0,29.0,29.0
4908,Ben Earwood,Chris Lytle,Mario Yamasaki,2000-11-17,"Atlantic City, New Jersey, USA",Red,False,Welterweight,2,0.0,...,0.0,0.0,0.0,0.0,Orthodox,172.72,185.42,170.0,26.0,29.0
