<h1>After Scraping: Cleaning and Feature Engineering</h1>

- Acknowledgements:
    - ufcstats for comprehensive data sets on past MMA bouts: http://ufcstats.com/
    - Rajeev Warrier for providing the groundwork for this prediction project: https://github.com/WarrierRajeev/UFC-Predictions

In [2437]:
import pandas as pd
import numpy as np

DATA_PATH ='./data'
df_fighters = pd.read_csv(DATA_PATH+'/fighter_details.csv')
df_fights = pd.read_csv(DATA_PATH+'/total_fight_data.csv', sep=';')

In [2438]:
df_fighters.head(3)

Unnamed: 0,fighter_name,Height,Weight,Reach,Stance,DOB
0,Tom Aaron,,155 lbs.,,,"Jul 13, 1978"
1,Danny Abbadi,"5' 11""",155 lbs.,,Orthodox,"Jul 03, 1983"
2,David Abbott,"6' 0""",265 lbs.,,Switch,


In [2439]:
df_fights.head(3)

Unnamed: 0,R_fighter,B_fighter,R_KD,B_KD,R_SIG_STR.,B_SIG_STR.,R_SIG_STR_pct,B_SIG_STR_pct,R_TOTAL_STR.,B_TOTAL_STR.,...,B_GROUND,win_by,last_round,last_round_time,Format,Referee,date,location,Fight_type,Winner
0,Kevin Lee,Charles Oliveira,0,0,41 of 80,43 of 65,51%,66%,61 of 100,51 of 73,...,6 of 7,Submission,3,0:28,5 Rnd (5-5-5-5-5),Mike Beltran,"March 14, 2020","Brasilia, Distrito Federal, Brazil",Lightweight Bout,Charles Oliveira
1,Demian Maia,Gilbert Burns,0,1,4 of 7,13 of 16,57%,81%,4 of 7,14 of 17,...,8 of 9,KO/TKO,1,2:34,3 Rnd (5-5-5),Osiris Maia,"March 14, 2020","Brasilia, Distrito Federal, Brazil",Welterweight Bout,Gilbert Burns
2,Renato Moicano,Damir Hadzovic,0,0,1 of 2,1 of 5,50%,20%,4 of 5,1 of 5,...,0 of 0,Submission,1,0:44,3 Rnd (5-5-5),Eduardo Herdy,"March 14, 2020","Brasilia, Distrito Federal, Brazil",Lightweight Bout,Renato Moicano


<h3>Processing Fighter data set</h3> 

In [2441]:
df_fighters.isna().sum()

fighter_name       0
Height           257
Weight            74
Reach           1714
Stance           786
DOB              739
dtype: int64

- fighters with NaN Weight values have little to no useful data
    - therefore, these rows will be excluded

In [2442]:
df_fighters[pd.isnull(df_fighters['Weight'])].isna().sum()

fighter_name     0
Height          68
Weight          74
Reach           74
Stance          65
DOB             72
dtype: int64

In [2443]:
df_fighters = df_fighters[df_fighters['Weight'].notna()]

- to fill NaN values in bodily metrics, find:
    - average reach for each height increment
    - average height for each weight increment

In [2444]:
df_fighters['Weight'] = df_fighters['Weight'].apply(lambda x: x.split(' ')[0])
df_fighters['Weight'] = df_fighters['Weight'].astype(float)

In [2445]:
df_fighters['Height'] = df_fighters['Height'].fillna('0\' 0\"')
df_fighters['Height'] = df_fighters['Height'].apply(lambda x: int(x.split('\' ')[0])*12 + int(x.split('\' ')[1].replace('\"','')))
df_fighters['Height'] = df_fighters['Height'].replace(0, np.nan).astype(float)

In [2446]:
df_fighters['Height'] = df_fighters.groupby('Weight')['Height'].apply(lambda x: x.fillna(x.mean()))
df_fighters['Height'] = df_fighters['Height'].fillna(df_fighters['Height'].mean())

In [2447]:
df_fighters['Reach'] = df_fighters['Reach'].fillna('0')
df_fighters['Reach'] = df_fighters['Reach'].apply(lambda x: x.replace('\"',''))
df_fighters['Reach'] = df_fighters['Reach'].replace('0', np.nan).astype(float)

In [2448]:
df_fighters['Reach'] = df_fighters.groupby('Height')['Reach'].apply(lambda x: x.fillna(x.mean()))
df_fighters['Reach'] = df_fighters['Reach'].fillna(df_fighters['Reach'].mean())

In [2449]:
df_fighters['Stance'].value_counts()

Orthodox       2047
Southpaw        460
Switch          100
Open Stance       7
Sideways          3
Name: Stance, dtype: int64

<h3>Processing Fight data set</h3>

- split attack stats into attempts/landed numerical format

In [2451]:
df_fights.columns
attack_cols = ['R_SIG_STR.', 'B_SIG_STR.','R_TOTAL_STR.', 'B_TOTAL_STR.',
       'R_TD', 'B_TD', 'R_HEAD', 'B_HEAD', 'R_BODY',
       'B_BODY', 'R_LEG', 'B_LEG', 'R_DISTANCE', 'B_DISTANCE', 'R_CLINCH',
       'B_CLINCH', 'R_GROUND', 'B_GROUND']

In [2452]:
for col in attack_cols:
    df_fights[col+'_ATT'] = df_fights[col].apply(lambda x: int(x.split('of')[1]))
    df_fights[col+'_LANDED'] = df_fights[col].apply(lambda x: int(x.split('of')[0]))

In [2453]:
df_fights.drop(attack_cols, axis=1, inplace=True)

- check for NULL values

In [2454]:
for col in df_fights:
    if df_fights[col].isnull().sum()!=0:
        print(f'Null count in {col} = {df_fights[col].isnull().sum()}')

Null count in Referee = 25
Null count in Winner = 94


In [2455]:
df_fights[df_fights['Winner'].isnull()]['win_by'].value_counts()

Overturned              38
Decision - Majority     23
Could Not Continue      15
Decision - Split        11
Decision - Unanimous     5
Other                    2
Name: win_by, dtype: int64

In [2456]:
df_fights['Winner'].fillna('Draw', inplace=True)

- convert percentages to decimal values

In [2457]:
percentage_columns = ['R_SIG_STR_pct', 'B_SIG_STR_pct', 'R_TD_pct', 'B_TD_pct']

for col in percentage_columns:
    df_fights[col] = df_fights[col].apply(lambda x : float(x.replace('%',''))/100)

- isolating Title fights and weight classes

In [2458]:
df_fights['Fight_type'].value_counts()[df_fights['Fight_type'].value_counts() > 1].index

Index(['Lightweight Bout', 'Welterweight Bout', 'Middleweight Bout',
       'Light Heavyweight Bout', 'Heavyweight Bout', 'Featherweight Bout',
       'Bantamweight Bout', 'Flyweight Bout', 'Women's Strawweight Bout',
       'Women's Bantamweight Bout', 'Open Weight Bout',
       'Women's Flyweight Bout', 'UFC Light Heavyweight Title Bout',
       'UFC Welterweight Title Bout', 'UFC Heavyweight Title Bout',
       'UFC Middleweight Title Bout', 'UFC Lightweight Title Bout',
       'Catch Weight Bout', 'UFC Flyweight Title Bout',
       'UFC Women's Bantamweight Title Bout', 'UFC Featherweight Title Bout',
       'UFC Bantamweight Title Bout', 'UFC Women's Strawweight Title Bout',
       'Women's Featherweight Bout', 'UFC Interim Heavyweight Title Bout',
       'UFC Women's Flyweight Title Bout',
       'UFC Women's Featherweight Title Bout',
       'UFC Superfight Championship Bout',
       'UFC Interim Bantamweight Title Bout',
       'UFC Interim Middleweight Title Bout',
       'UFC

In [2459]:
df_fights['title_bout'] = df_fights['Fight_type'].apply(lambda x: 1 if 'Title Bout' in x else 0) 

In [2460]:
weight_classes = ['Women\'s Strawweight', 'Women\'s Bantamweight', 
                  'Women\'s Featherweight', 'Women\'s Flyweight', 'Lightweight', 
                  'Welterweight', 'Middleweight','Light Heavyweight', 
                  'Heavyweight', 'Featherweight','Bantamweight', 'Flyweight', 'Open Weight']

def make_weight_class(x):
    for weight_class in weight_classes:
        if weight_class in x:
            return weight_class
    if x == 'Catch Weight Bout' or 'Catchweight Bout':
        return 'Catch Weight'
    else:
        return 'Open Weight'

In [2461]:
df_fights['weight_class'] = df_fights['Fight_type'].apply(make_weight_class)

In [2462]:
df_fights['weight_class'].value_counts()

Lightweight              1043
Welterweight             1027
Middleweight              763
Heavyweight               539
Light Heavyweight         536
Featherweight             488
Bantamweight              422
Flyweight                 206
Women's Strawweight       165
Women's Bantamweight      130
Open Weight                93
Women's Flyweight          78
Catch Weight               39
Women's Featherweight      14
Name: weight_class, dtype: int64

- isolate total fight time (seconds)

In [2463]:
df_fights['Format'].value_counts()

3 Rnd (5-5-5)           4860
5 Rnd (5-5-5-5-5)        459
1 Rnd + OT (12-3)         80
No Time Limit             37
3 Rnd + OT (5-5-5-5)      22
1 Rnd (20)                21
1 Rnd + 2OT (15-3-3)      20
2 Rnd (5-5)               14
1 Rnd (15)                 8
1 Rnd (10)                 6
1 Rnd (12)                 4
1 Rnd + OT (30-5)          3
1 Rnd (18)                 2
1 Rnd + OT (15-3)          2
1 Rnd (30)                 1
1 Rnd + 2OT (24-3-3)       1
1 Rnd + OT (27-3)          1
1 Rnd + OT (30-3)          1
1 Rnd + OT (31-5)          1
Name: Format, dtype: int64

In [2464]:
time_in_first_round = {'3 Rnd (5-5-5)': 5*60, 
                       '5 Rnd (5-5-5-5-5)': 5*60, 
                       '1 Rnd + OT (12-3)': 12*60,
                       'No Time Limit': 1, 
                       '3 Rnd + OT (5-5-5-5)': 5*60, 
                       '1 Rnd (20)': 1*20,
                       '2 Rnd (5-5)': 5*60, 
                       '1 Rnd (15)': 15*60, 
                       '1 Rnd (10)': 10*60,
                       '1 Rnd (12)':12*60, 
                       '1 Rnd + OT (30-5)': 30*60, 
                       '1 Rnd (18)': 18*60, 
                       '1 Rnd + OT (15-3)': 15*60,
                       '1 Rnd (30)': 30*60, 
                       '1 Rnd + OT (31-5)': 31*5,
                       '1 Rnd + OT (27-3)': 27*60, 
                       '1 Rnd + OT (30-3)': 30*60}

exception_format_time = {'1 Rnd + 2OT (15-3-3)': [15*60, 3*60], 
                         '1 Rnd + 2OT (24-3-3)': [24*60, 3*60]}

# '1 Rnd + 2OT (15-3-3)' and '1 Rnd + 2OT (24-3-3)' is not included because it has 3 uneven timed rounds. 
# We'll have to deal with it separately

In [2465]:
# Converting to seconds
df_fights['last_round_time'] = df_fights['last_round_time'].apply(lambda x: int(x.split(':')[0])*60 + int(x.split(':')[1]))

In [2466]:
def get_total_time(row):
    if row['Format'] in time_in_first_round.keys():
        return (row['last_round'] - 1) * time_in_first_round[row['Format']] + row['last_round_time']
    elif row['Format'] in exception_format_time.keys():
        if (row['last_round'] - 1) >= 2:
            return exception_format_time[row['Format']][0] + (row['last_round'] - 2) * \
                    exception_format_time[row['Format']][1] + row['last_round_time']
        else:
            return (row['last_round'] - 1) * exception_format_time[row['Format']][0] + row['last_round_time']

In [2467]:
df_fights['total_time_fought(sec)'] = df_fights.apply(get_total_time, axis=1)

In [2468]:
def get_num_rounds(x):
    if x == 'No Time Limit':
        return 1
    else:
        return len((x.split('(')[1].replace(')','').split('-')))
    
df_fights['no_of_rounds'] = df_fights['Format'].apply(get_num_rounds)

- there are too many distinct locations
    - in order to create a more signifcant feature, location is adapted to a binary indicator of whether or not the fight took place in Las Vegas, Nevada (i.e. the most popular fight location)

In [2523]:
df_fights['location'].value_counts()

0    4279
1    1264
Name: location, dtype: int64

In [2522]:
df_fights['location']=df_fights['location'].apply(lambda x: 1 if str(x).find('Las Vegas')!=-1 else 0)

- change Date of Birth and fight date from string to datetime

In [2524]:
from datetime import datetime

month_code = {'Jan ': 'January ', 
      'Feb ': 'February ', 
      'Mar ': 'March ', 
      'Apr ': 'April ', 
      'May ': 'May ', 
      'Jun ': 'June ', 
      'Jul ': 'July ', 
      'Aug ': 'August ', 
      'Sep ': 'September ', 
      'Oct ': 'October ', 
      'Nov ': 'November ', 
      'Dec ': 'December '}

for k, v in month_code.items():
    df_fighters['DOB'] = df_fighters['DOB'].apply(lambda x: x.replace(k, v) if type(x) == str else x)

df_fighters['DOB'] = df_fighters['DOB'].apply(lambda row: datetime.strptime(row, '%B %d, %Y') if type(row) == str else row)
df_fights['date'] = df_fights['date'].apply(lambda row: datetime.strptime(row, '%B %d, %Y') if type(row) == str else row)

- recode winner column to binary and drop obsolete columns

In [2527]:
df_fights['R_win'] = df_fights.apply(lambda row: 1 if row['Winner'] == row['R_fighter'] else 0, axis=1)

df_fights.drop(columns = ['Format', 'Referee','Fight_type','Winner'], inplace=True)

KeyError: ('Winner', 'occurred at index 0')

- recode win_by feature into bins for Submission, KO, or Other

In [2528]:
df_fights['win_by'].value_counts()

Decision - Unanimous       1903
KO/TKO                     1763
Submission                 1136
Decision - Split            533
TKO - Doctor's Stoppage      74
Decision - Majority          62
Overturned                   38
DQ                           17
Could Not Continue           15
Other                         2
Name: win_by, dtype: int64

<h3>Consolidate red/blue corner stats to align them with the correct fighter</h3>

In [2473]:
df_red = df_fights[['R_fighter','R_KD', 'R_SIG_STR_pct',
       'R_TD_pct', 'R_SUB_ATT',
       'R_PASS', 'R_REV', 'win_by', 'last_round',
       'last_round_time', 'Format', 'Referee', 'date', 'location',
       'Fight_type', 'Winner', 'R_SIG_STR._ATT', 'R_SIG_STR._LANDED',
       'R_TOTAL_STR._ATT',
       'R_TOTAL_STR._LANDED',
       'R_TD_ATT', 'R_TD_LANDED', 'R_HEAD_ATT',
       'R_HEAD_LANDED', 'R_BODY_ATT',
       'R_BODY_LANDED',  'R_LEG_ATT',
       'R_LEG_LANDED',  'R_DISTANCE_ATT',
       'R_DISTANCE_LANDED', 
       'R_CLINCH_ATT', 'R_CLINCH_LANDED',
       'R_GROUND_ATT', 'R_GROUND_LANDED',
       'title_bout', 'weight_class', 'total_time_fought(sec)', 'no_of_rounds']]

df_blue = df_fights[['B_fighter',  'B_KD',
       'B_SIG_STR_pct','B_TD_pct', 'B_SUB_ATT',
       'B_PASS',  'B_REV', 'win_by', 'last_round',
       'last_round_time', 'Format', 'Referee', 'date', 'location',
       'Fight_type', 'Winner',
       'B_SIG_STR._ATT', 'B_SIG_STR._LANDED',
       'B_TOTAL_STR._ATT', 'B_TOTAL_STR._LANDED',
       'B_TD_ATT', 'B_TD_LANDED',
       'B_HEAD_ATT', 'B_HEAD_LANDED', 
       'B_BODY_ATT', 'B_BODY_LANDED', 
       'B_LEG_ATT', 'B_LEG_LANDED', 
       'B_DISTANCE_ATT', 'B_DISTANCE_LANDED',
       'B_CLINCH_ATT', 'B_CLINCH_LANDED',
       'B_GROUND_ATT', 'B_GROUND_LANDED',
       'title_bout', 'weight_class', 'total_time_fought(sec)', 'no_of_rounds']]

- get rid of red/blue corner prefixes in order to union fighter history

In [2474]:
def drop_prefix(self, prefix):
    self.columns = self.columns.str.replace('^'+prefix,'')
    return self

pd.core.frame.DataFrame.drop_prefix = drop_prefix

In [2475]:
union = pd.concat([df_red.drop_prefix('R_'), df_blue.drop_prefix('B_')])

- join this combined fight history DataFrame to the originial fighter DataFrame

In [2478]:
union[union['fighter']=='Daniel Cormier'].head(3)

Unnamed: 0,fighter,KD,SIG_STR_pct,TD_pct,SUB_ATT,PASS,REV,win_by,last_round,last_round_time,...,DISTANCE_ATT,DISTANCE_LANDED,CLINCH_ATT,CLINCH_LANDED,GROUND_ATT,GROUND_LANDED,title_bout,weight_class,total_time_fought(sec),no_of_rounds
280,Daniel Cormier,0,0.68,0.33,0,2,0,KO/TKO,4,249,...,209,139,27,21,27,21,1,Heavyweight,1149,5
710,Daniel Cormier,0,0.76,1.0,1,4,0,Submission,2,134,...,6,4,1,1,18,14,1,Heavyweight,434,5
1064,Daniel Cormier,0,0.52,0.66,1,3,0,KO/TKO,2,120,...,46,24,8,3,7,5,1,Light Heavyweight,420,5


In [2479]:
df_fighter_history = pd.merge(df_fighters, union, left_on='fighter_name', right_on='fighter', how='left', indicator=True)

- 1,330 fighters without any fight stats (in original fighter dataset)
    - However, every fighter involved in a historical bout is contained in the original fighter dataset
    - UPDATE: after analysis using the above 1,330 fighters, they will be dropped to ensure data quality and avoid "garbage in, garbage out

In [2480]:
df_fighter_history._merge.value_counts()

both          11076
left_only      1330
right_only        0
Name: _merge, dtype: int64

In [2481]:
df_fighter_history = df_fighter_history[df_fighter_history._merge != 'left_only']

In [2482]:
union.shape

(11086, 38)

In [2483]:
df_fighter_history.shape

(11076, 45)

- lack of depth in individual fight history presents a problem for forecasting fighter performance

In [2484]:
df_fighter_history['fighter_name'].value_counts()

Donald Cerrone        34
Jim Miller            34
Demian Maia           32
Jeremy Stephens       32
Diego Sanchez         31
                      ..
Alberta Cerra Leon     1
Neil Grove             1
Scott Fiedler          1
Chris Sanford          1
Wade Shipp             1
Name: fighter_name, Length: 2008, dtype: int64

<h3>Feature Engineering</h3>

In [2519]:
df_fights.head(5).T

Unnamed: 0,0,1,2,3,4
R_fighter,Kevin Lee,Demian Maia,Renato Moicano,Johnny Walker,Francisco Trinaldo
B_fighter,Charles Oliveira,Gilbert Burns,Damir Hadzovic,Nikita Krylov,John Makdessi
R_KD,0,0,0,0,0
B_KD,0,1,0,0,0
R_SIG_STR_pct,0.51,0.57,0.5,0.74,0.43
B_SIG_STR_pct,0.66,0.81,0.2,0.77,0.54
R_TD_pct,0.66,1,1,0,0
B_TD_pct,0,0,0,0.37,0
R_SUB_ATT,0,0,1,0,0
B_SUB_ATT,2,0,0,0,0


In [2485]:
df_fighter_history.head(6).T

Unnamed: 0,1,2,3,4,5,6
fighter_name,Danny Abbadi,Danny Abbadi,David Abbott,David Abbott,David Abbott,David Abbott
Height,71,71,72,72,72,72
Weight,155,155,265,265,265,265
Reach,72.6813,72.6813,73.75,73.75,73.75,73.75
Stance,Orthodox,Orthodox,Switch,Switch,Switch,Switch
DOB,"Jul 03, 1983","Jul 03, 1983",,,,
fighter,Danny Abbadi,Danny Abbadi,David Abbott,David Abbott,David Abbott,David Abbott
KD,0,0,0,0,1,0
SIG_STR_pct,0.38,0.33,0.68,0.41,0.52,0.44
TD_pct,0,0,0,0.75,1,0


In [2486]:
df_fighter_history.drop(columns = ['fighter','Format','Referee','Fight_type'], inplace=True)

- NaN values in fight stats columns were resulting from the left merge
    - remaining NaN values only exist from fighter data set

In [2489]:
df_fighter_history.isna().sum()

fighter_name                0
Height                      0
Weight                      0
Reach                       0
Stance                     83
DOB                       229
KD                          0
SIG_STR_pct                 0
TD_pct                      0
SUB_ATT                     0
PASS                        0
REV                         0
win_by                      0
last_round                  0
last_round_time             0
date                        0
location                    0
Winner                      0
SIG_STR._ATT                0
SIG_STR._LANDED             0
TOTAL_STR._ATT              0
TOTAL_STR._LANDED           0
TD_ATT                      0
TD_LANDED                   0
HEAD_ATT                    0
HEAD_LANDED                 0
BODY_ATT                    0
BODY_LANDED                 0
LEG_ATT                     0
LEG_LANDED                  0
DISTANCE_ATT                0
DISTANCE_LANDED             0
CLINCH_ATT                  0
CLINCH_LAN

- replacing NaN values:
    - numerical: column mean
    - categorica: column mode
    - date: column mean

In [2492]:
df_fighter_history.fillna(df_fighter_history.mean(), inplace=True)
df_fighter_history = df_fighter_history.apply(lambda x:x.fillna(x.value_counts().index[0]))
df_fighter_history['date'] = df_fighter_history['date'].apply(lambda row: datetime.strptime(row, '%B %d, %Y') if type(row) == str else row)

- creating age (at fight date) feature

In [2493]:
df_fighter_history['age'] = df_fighter_history['date'] - df_fighter_history['DOB']
df_fighter_history['age']=df_fighter_history['age']/np.timedelta64(1,'Y')
df_fighter_history['age']=df_fighter_history['age'].apply(lambda x: 25 if x <=18 else x)

In [2494]:
df_fighter_history.isna().sum()

fighter_name              0
Height                    0
Weight                    0
Reach                     0
Stance                    0
DOB                       0
KD                        0
SIG_STR_pct               0
TD_pct                    0
SUB_ATT                   0
PASS                      0
REV                       0
win_by                    0
last_round                0
last_round_time           0
date                      0
location                  0
SIG_STR._ATT              0
SIG_STR._LANDED           0
TOTAL_STR._ATT            0
TOTAL_STR._LANDED         0
TD_ATT                    0
TD_LANDED                 0
HEAD_ATT                  0
HEAD_LANDED               0
BODY_ATT                  0
BODY_LANDED               0
LEG_ATT                   0
LEG_LANDED                0
DISTANCE_ATT              0
DISTANCE_LANDED           0
CLINCH_ATT                0
CLINCH_LANDED             0
GROUND_ATT                0
GROUND_LANDED             0
title_bout          

In [2495]:
df_fighter_history.drop(columns='_merge', inplace=True)

In [2496]:
df_fighter_history.columns

Index(['fighter_name', 'Height', 'Weight', 'Reach', 'Stance', 'DOB', 'KD',
       'SIG_STR_pct', 'TD_pct', 'SUB_ATT', 'PASS', 'REV', 'win_by',
       'last_round', 'last_round_time', 'date', 'location', 'SIG_STR._ATT',
       'SIG_STR._LANDED', 'TOTAL_STR._ATT', 'TOTAL_STR._LANDED', 'TD_ATT',
       'TD_LANDED', 'HEAD_ATT', 'HEAD_LANDED', 'BODY_ATT', 'BODY_LANDED',
       'LEG_ATT', 'LEG_LANDED', 'DISTANCE_ATT', 'DISTANCE_LANDED',
       'CLINCH_ATT', 'CLINCH_LANDED', 'GROUND_ATT', 'GROUND_LANDED',
       'title_bout', 'weight_class', 'total_time_fought(sec)', 'no_of_rounds',
       'won', 'age'],
      dtype='object')

- create features for 1) # of fights they've been in, 2) what % they won, and 3) the ranked order of past fights

In [2497]:
df_fighter_history['num_fights'] = df_fighter_history['date'].groupby(df_fighter_history['fighter_name']).transform('count')

In [2498]:
df_fighter_history['num_wins'] = df_fighter_history['won'].groupby(df_fighter_history['fighter_name']).transform('sum')

In [2499]:
df_fighter_history['record'] = df_fighter_history['num_wins']/df_fighter_history['num_fights']

In [2500]:
df_fighter_history['title_bout']=df_fighter_history['title_bout'].apply(lambda x: 1 if x == 1 else 0)

In [2501]:
df_fighter_history['fight_rank']=df_fighter_history.groupby('fighter_name')['date'].rank(ascending=True, method='first')

In [2502]:
df_fighter_history[df_fighter_history['fighter_name']=='David Abbott'].T

Unnamed: 0,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20
fighter_name,David Abbott,David Abbott,David Abbott,David Abbott,David Abbott,David Abbott,David Abbott,David Abbott,David Abbott,David Abbott,David Abbott,David Abbott,David Abbott,David Abbott,David Abbott,David Abbott,David Abbott,David Abbott
Height,72,72,72,72,72,72,72,72,72,72,72,72,72,72,72,72,72,72
Weight,265,265,265,265,265,265,265,265,265,265,265,265,265,265,265,265,265,265
Reach,73.75,73.75,73.75,73.75,73.75,73.75,73.75,73.75,73.75,73.75,73.75,73.75,73.75,73.75,73.75,73.75,73.75,73.75
Stance,Switch,Switch,Switch,Switch,Switch,Switch,Switch,Switch,Switch,Switch,Switch,Switch,Switch,Switch,Switch,Switch,Switch,Switch
DOB,1984-09-26 00:00:00,1984-09-26 00:00:00,1984-09-26 00:00:00,1984-09-26 00:00:00,1984-09-26 00:00:00,1984-09-26 00:00:00,1984-09-26 00:00:00,1984-09-26 00:00:00,1984-09-26 00:00:00,1984-09-26 00:00:00,1984-09-26 00:00:00,1984-09-26 00:00:00,1984-09-26 00:00:00,1984-09-26 00:00:00,1984-09-26 00:00:00,1984-09-26 00:00:00,1984-09-26 00:00:00,1984-09-26 00:00:00
KD,0,0,1,0,0,0,0,2,0,0,0,0,1,0,0,0,0,0
SIG_STR_pct,0.68,0.41,0.52,0.44,0.88,1,0.66,0.58,0.42,0,0.2,0.33,0.22,0.4,0.46,0.42,0.5,0.37
TD_pct,0,0.75,1,0,1,1,1,0,0,0,0,0.5,0,0,0,0,0,0.66
SUB_ATT,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,1


- define feature groups by broad data type

In [2503]:
num_features = ['fighter_name','KD', 'SIG_STR_pct', 'TD_pct', 'SUB_ATT', 'PASS', 'REV',
       'last_round', 'last_round_time','SIG_STR._ATT',
       'SIG_STR._LANDED', 'TOTAL_STR._ATT', 'TOTAL_STR._LANDED', 'TD_ATT',
       'TD_LANDED', 'HEAD_ATT', 'HEAD_LANDED', 'BODY_ATT', 'BODY_LANDED',
       'LEG_ATT', 'LEG_LANDED', 'DISTANCE_ATT', 'DISTANCE_LANDED',
       'CLINCH_ATT', 'CLINCH_LANDED', 'GROUND_ATT', 'GROUND_LANDED','num_fights','num_wins','record',
       'total_time_fought(sec)', 'no_of_rounds']

categorical_features = ['Stance','win_by',
       'last_round', 'last_round_time', 'location',
       'title_bout', 'weight_class']

date_features = ['DOB','date']

- creating data frame of estimated fighter stats
    - avg. from all fights
    - avg. from last 5 fights
    - values from last fight

In [2504]:
df_fighter_estimates = df_fighter_history[num_features].groupby('fighter_name',as_index=False).mean()

- removing unnecessary stat features
    - total_time_fought(sec) and no_of_rounds are not fighter-specific
    - num_wins is removed to prevent multicollinearity with num_fights and record

In [2505]:
df_fighter_estimates.drop(columns=['total_time_fought(sec)','no_of_rounds','num_wins'], inplace=True)

- Check top fighters by record (i.e. undefeated fighters)

In [2506]:
df_fighter_estimates[df_fighter_estimates['num_fights']>5].sort_values(by='record', ascending=False).head()

Unnamed: 0,fighter_name,KD,SIG_STR_pct,TD_pct,SUB_ATT,PASS,REV,last_round,last_round_time,SIG_STR._ATT,...,LEG_ATT,LEG_LANDED,DISTANCE_ATT,DISTANCE_LANDED,CLINCH_ATT,CLINCH_LANDED,GROUND_ATT,GROUND_LANDED,num_fights,record
155,Arnold Allen,0.285714,0.384286,0.321429,0.428571,1.0,0.0,3.0,251.714286,110.285714,...,5.285714,3.142857,100.285714,39.571429,4.0,1.571429,6.0,4.142857,7,1.0
1998,Zabit Magomedsharipov,0.0,0.54,0.496667,0.666667,3.0,0.5,2.666667,276.333333,130.666667,...,14.5,13.0,114.666667,51.333333,3.5,2.5,12.5,9.5,6,1.0
1070,Kamaru Usman,0.454545,0.54,0.356364,0.181818,3.727273,0.0,3.454545,263.363636,144.909091,...,8.181818,6.909091,96.545455,38.909091,16.272727,13.727273,32.090909,24.0,11,1.0
766,Israel Adesanya,1.25,0.4975,0.0,0.375,0.25,0.0,3.25,277.625,129.0,...,20.625,17.625,119.5,55.375,6.375,5.125,3.125,2.375,8,1.0
67,Alexander Volkanovski,0.5,0.5775,0.3425,0.375,1.75,0.125,2.875,272.125,149.5,...,30.0,24.625,107.5,55.25,14.875,11.5,27.125,18.75,8,1.0


In [2507]:
df_fighter_estimates.head()

Unnamed: 0,fighter_name,KD,SIG_STR_pct,TD_pct,SUB_ATT,PASS,REV,last_round,last_round_time,SIG_STR._ATT,...,LEG_ATT,LEG_LANDED,DISTANCE_ATT,DISTANCE_LANDED,CLINCH_ATT,CLINCH_LANDED,GROUND_ATT,GROUND_LANDED,num_fights,record
0,Aalon Cruz,0.0,0.16,0.0,0.0,0.0,0.0,1.0,85.0,12.0,...,4.0,0.0,12.0,2.0,0.0,0.0,0.0,0.0,1,0.0
1,Aaron Brink,0.0,0.0,0.0,0.0,0.0,0.0,1.0,55.0,5.0,...,0.0,0.0,3.0,0.0,2.0,0.0,0.0,0.0,1,0.0
2,Aaron Phillips,0.0,0.575,0.0,0.5,0.5,0.5,3.0,300.0,47.0,...,3.0,2.0,25.5,11.5,12.0,10.0,9.5,6.5,2,0.0
3,Aaron Riley,0.0,0.344444,0.252222,0.111111,0.777778,0.0,2.222222,269.111111,100.444444,...,12.111111,8.333333,79.333333,21.111111,17.777778,11.333333,3.333333,2.444444,9,0.333333
4,Aaron Rosa,0.0,0.33,0.0,0.0,0.0,0.0,2.333333,171.333333,96.666667,...,3.333333,3.0,75.666667,26.666667,21.0,17.333333,0.0,0.0,3,0.333333


In [2508]:
df_fighter_history.head()

Unnamed: 0,fighter_name,Height,Weight,Reach,Stance,DOB,KD,SIG_STR_pct,TD_pct,SUB_ATT,...,title_bout,weight_class,total_time_fought(sec),no_of_rounds,won,age,num_fights,num_wins,record,fight_rank
1,Danny Abbadi,71.0,155.0,72.68125,Orthodox,1983-07-03,0.0,0.38,0.0,0.0,...,0,Lightweight,900.0,3.0,0,23.225665,2,0,0.0,2.0
2,Danny Abbadi,71.0,155.0,72.68125,Orthodox,1983-07-03,0.0,0.33,0.0,0.0,...,0,Middleweight,176.0,3.0,0,22.976516,2,0,0.0,1.0
3,David Abbott,72.0,265.0,73.75,Switch,1984-09-26,0.0,0.68,0.0,0.0,...,0,Heavyweight,43.0,2.0,1,25.0,18,8,0.444444,14.0
4,David Abbott,72.0,265.0,73.75,Switch,1984-09-26,0.0,0.41,0.75,0.0,...,0,Heavyweight,900.0,2.0,1,25.0,18,8,0.444444,13.0
5,David Abbott,72.0,265.0,73.75,Switch,1984-09-26,1.0,0.52,1.0,0.0,...,0,Open Weight,63.0,2.0,1,25.0,18,8,0.444444,8.0


In [2517]:
df_fighter_estimates.drop(columns=['last_round_time','last_round'])
df_fighter_estimates.head(3).T #fighter_estimates is good to go

Unnamed: 0,0,1,2
fighter_name,Aalon Cruz,Aaron Brink,Aaron Phillips
KD,0,0,0
SIG_STR_pct,0.16,0,0.575
TD_pct,0,0,0
SUB_ATT,0,0,0.5
PASS,0,0,0.5
REV,0,0,0.5
last_round,1,1,3
last_round_time,85,55,300
SIG_STR._ATT,12,5,47


In [2518]:
df_fighter_history.head(3).T

Unnamed: 0,1,2,3
fighter_name,Danny Abbadi,Danny Abbadi,David Abbott
Height,71,71,72
Weight,155,155,265
Reach,72.6813,72.6813,73.75
Stance,Orthodox,Orthodox,Switch
DOB,1983-07-03 00:00:00,1983-07-03 00:00:00,1984-09-26 00:00:00
KD,0,0,0
SIG_STR_pct,0.38,0.33,0.68
TD_pct,0,0,0
SUB_ATT,0,0,0
