## Data description

This is a list of every UFC fight in the history of the organisation. Every row contains information about both fighters, fight details and the winner.

Each row is a compilation of both fighter stats. Fighters are represented by 'red' and 'blue' (for red and blue corner). So for instance, red fighter has the complied average stats of all the fights except the current one. The stats include damage done by the red fighter on the opponent and the damage done by the opponent on the fighter (represented by 'opp' in the columns) in all the fights this particular red fighter has had, except this one as it has not occured yet (in the data). Same information exists for blue fighter. The target variable is 'Winner' which is the only column that tells you what happened.

#### Columns Values

* R_ and B_ prefix signifies red and blue corner fighter stats respectively
* _opp_  containing columns is the average of damage done by the opponent on the fighter
* KD is number of knockdowns
* SIG_STR is no. of significant strikes 'landed of attempted'
* SIG_STR_pct is significant strikes percentage
* TOTAL_STR is total strikes 'landed of attempted'
* TD is number of takedowns
* TD_pct is takedown percentages
* SUB_ATT is no. of submission attempts
* PASS is number times the guard was passed
* REV is the number of Reversals landed
* HEAD is number of significant strinks to the head 'landed of attempted'
* BODY is number of significant strikes to the body 'landed of attempted'
* CLINCH is number of significant strikes in the clinch 'landed of attempted'
* GROUND is number of significant strikes on the ground 'landed of attempted'
* Win_by is method of win
* Last_round is last round of the fight (ex. if it was a KO in 1st, then this will be 1)
* Last_round_time is when the fight ended in the last round
* Format is the format of the fight (3 rounds, 5 rounds etc.)
* Referee is the name of the Ref
* Date is the date of the fight
* Location is the location in which the event took place
* Fight_type is which weight class and whether it's a title bout or not
* Winner is the winner of the fight
* Stance is the stance of the fighter (orthodox, southpaw, etc.)
* Height_cms is the height in centimeter
* Reach_cms is the reach of the fighter (arm span) in centimeter
* Weight_lbs is the weight of the fighter in pounds (lbs)
* Age is the age of the fighter
* Title_bout Boolean value of whether it is title fight or not
* Weight_class is which weight class the fight is in (Bantamweight, heavyweight, Women's flyweight, etc.)
* No_of_rounds is the number of rounds the fight was scheduled for
* Current_lose_streak is the count of current concurrent losses of the fighter
* Current_win_streak is the count of current concurrent wins of the fighter
* Draw is the number of draws in the fighter's ufc career
* Wins is the number of wins in the fighter's ufc career
* Losses is the number of losses in the fighter's ufc career
* Total_rounds_fought is the average of total rounds fought by the fighter
* Total_time_fought(seconds) is the count of total time spent fighting in seconds
* Total_title_bouts is the total number of title bouts taken part in by the fighter
* Win_by_Decision_Majority is the number of wins by majority judges decision in the fighter's ufc career
* Win_by_Decision_Split is the number of wins by split judges decision in the fighter's ufc career
* Win_by_Decision_Unanimous is the number of wins by unanimous judges decision in the fighter's ufc career
* Win_by_KO/TKO is the number of wins by knockout in the fighter's ufc career
* Win_by_Submission is the number of wins by submission in the fighter's ufc career
* Win_by_TKO_Doctor_Stoppage is the number of wins by doctor stoppage in the fighter's ufc career

#### New columns created

* Gender colum
* different between weight 
* Different between higth

### The new datasets

* data_modeling.csv
* mens_fighters.csv
* women_fighters.csv
* fighters_description.csv
* final_dataset.csv

In [2]:
import pandas as pd
import numpy as np

### Loading the data

In [777]:
df1 = pd.read_csv('data/UFC.csv')

## Data Exploration

In [778]:
df1.head()

Unnamed: 0,R_fighter,B_fighter,Referee,date,location,Winner,title_bout,weight_class,no_of_rounds,B_current_lose_streak,...,R_win_by_KO/TKO,R_win_by_Submission,R_win_by_TKO_Doctor_Stoppage,R_wins,R_Stance,R_Height_cms,R_Reach_cms,R_Weight_lbs,B_age,R_age
0,Henry Cejudo,Marlon Moraes,Marc Goddard,2019-06-08,"Chicago, Illinois, USA",Red,True,Bantamweight,5,0.0,...,2.0,0.0,0.0,8.0,Orthodox,162.56,162.56,135.0,31.0,32.0
1,Valentina Shevchenko,Jessica Eye,Robert Madrigal,2019-06-08,"Chicago, Illinois, USA",Red,True,Women's Flyweight,5,0.0,...,0.0,2.0,0.0,5.0,Southpaw,165.1,167.64,125.0,32.0,31.0
2,Tony Ferguson,Donald Cerrone,Dan Miragliotta,2019-06-08,"Chicago, Illinois, USA",Red,False,Lightweight,3,0.0,...,3.0,6.0,1.0,14.0,Orthodox,180.34,193.04,155.0,36.0,35.0
3,Jimmie Rivera,Petr Yan,Kevin MacDonald,2019-06-08,"Chicago, Illinois, USA",Blue,False,Bantamweight,3,0.0,...,1.0,0.0,0.0,6.0,Orthodox,162.56,172.72,135.0,26.0,29.0
4,Tai Tuivasa,Blagoy Ivanov,Dan Miragliotta,2019-06-08,"Chicago, Illinois, USA",Blue,False,Heavyweight,3,0.0,...,2.0,0.0,0.0,3.0,Southpaw,187.96,190.5,264.0,32.0,26.0


In [779]:
list(df1.columns)

['R_fighter',
 'B_fighter',
 'Referee',
 'date',
 'location',
 'Winner',
 'title_bout',
 'weight_class',
 'no_of_rounds',
 'B_current_lose_streak',
 'B_current_win_streak',
 'B_draw',
 'B_avg_BODY_att',
 'B_avg_BODY_landed',
 'B_avg_CLINCH_att',
 'B_avg_CLINCH_landed',
 'B_avg_DISTANCE_att',
 'B_avg_DISTANCE_landed',
 'B_avg_GROUND_att',
 'B_avg_GROUND_landed',
 'B_avg_HEAD_att',
 'B_avg_HEAD_landed',
 'B_avg_KD',
 'B_avg_LEG_att',
 'B_avg_LEG_landed',
 'B_avg_PASS',
 'B_avg_REV',
 'B_avg_SIG_STR_att',
 'B_avg_SIG_STR_landed',
 'B_avg_SIG_STR_pct',
 'B_avg_SUB_ATT',
 'B_avg_TD_att',
 'B_avg_TD_landed',
 'B_avg_TD_pct',
 'B_avg_TOTAL_STR_att',
 'B_avg_TOTAL_STR_landed',
 'B_longest_win_streak',
 'B_losses',
 'B_avg_opp_BODY_att',
 'B_avg_opp_BODY_landed',
 'B_avg_opp_CLINCH_att',
 'B_avg_opp_CLINCH_landed',
 'B_avg_opp_DISTANCE_att',
 'B_avg_opp_DISTANCE_landed',
 'B_avg_opp_GROUND_att',
 'B_avg_opp_GROUND_landed',
 'B_avg_opp_HEAD_att',
 'B_avg_opp_HEAD_landed',
 'B_avg_opp_KD',
 'B_av

In [780]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5144 entries, 0 to 5143
Columns: 145 entries, R_fighter to R_age
dtypes: bool(1), float64(134), int64(1), object(9)
memory usage: 5.7+ MB


In [781]:
df1.shape

(5144, 145)

In [782]:
df1.isnull().sum().sort_values(ascending = True).head(60)

R_fighter                         0
R_total_title_bouts               0
R_total_rounds_fought             0
R_losses                          0
R_longest_win_streak              0
R_draw                            0
R_current_win_streak              0
R_current_lose_streak             0
B_total_title_bouts               0
B_wins                            0
B_win_by_Submission               0
B_win_by_KO/TKO                   0
B_win_by_Decision_Unanimous       0
B_win_by_Decision_Split           0
B_longest_win_streak              0
B_losses                          0
B_win_by_Decision_Majority        0
B_win_by_TKO_Doctor_Stoppage      0
R_win_by_Decision_Split           0
R_win_by_Decision_Majority        0
R_win_by_KO/TKO                   0
B_fighter                         0
date                              0
location                          0
Winner                            0
title_bout                        0
weight_class                      0
no_of_rounds                

## Dealing with the the NaN values

In [783]:
(df1.isnull().sum() > 1).value_counts()

True     109
False     36
dtype: int64

In [784]:
df1['Referee'].fillna('Noinfo', inplace=True)

### We are going to fill the nan values of the Height with the mean of the column per weight class

In [785]:
df1[['weight_class', 'R_Height_cms', "B_Height_cms"]].sort_values(by="B_Height_cms" , ascending=False).sort_values(by="B_Height_cms", ascending=True)

Unnamed: 0,weight_class,R_Height_cms,B_Height_cms
1212,Women's Strawweight,157.48,152.40
1027,Women's Strawweight,165.10,152.40
1365,Women's Strawweight,170.18,152.40
2525,Women's Bantamweight,170.18,154.94
2807,Women's Bantamweight,167.64,154.94
...,...,...,...
4992,Middleweight,187.96,
5020,Heavyweight,175.26,
5029,Lightweight,,
5116,Open Weight,185.42,


In [786]:
df1.R_Height_cms.loc[103]= 187.13
df1.R_Height_cms.loc[3607]= 187.13
df1.R_Height_cms.loc[5029]= 176.48
df1.R_Height_cms.loc[5132]= 186.42

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


In [787]:
df1.B_Height_cms.loc[4992]= 184.63
df1.B_Height_cms.loc[5020]= 189.77
df1.B_Height_cms.loc[5029]= 176.48
df1.B_Height_cms.loc[5116]= 186.42
df1.B_Height_cms.loc[5134]= 186.42
df1.B_Height_cms.loc[3402]= 176.48
df1.B_Height_cms.loc[3518]= 176.48
df1.B_Height_cms.loc[4978]= 176.48

### We are going to fill the nan values of the Weight with the mean of the column per weight class

In [795]:
df1[['weight_class', 'R_Weight_lbs', "B_Weight_lbs"]].sort_values(by="R_Weight_lbs" , ascending=False).sort_values(by="B_Weight_lbs", ascending=True)

Unnamed: 0,weight_class,R_Weight_lbs,B_Weight_lbs
2148,Women's Strawweight,115.0,115.0
1999,Women's Strawweight,115.0,115.0
176,Women's Strawweight,115.0,115.0
1859,Women's Strawweight,115.0,115.0
1930,Women's Strawweight,115.0,115.0
...,...,...,...
5061,Open Weight,190.0,350.0
5097,Open Weight,265.0,400.0
5073,Open Weight,219.0,410.0
5143,Open Weight,216.0,430.0


In [790]:
df1.R_Weight_lbs.loc[5029]= 155
df1.R_Weight_lbs.loc[5132]= 224
df1.R_Weight_lbs.loc[103] = 202

In [789]:
df1.B_Weight_lbs.loc[4992]= 184
df1.B_Weight_lbs.loc[5116]= 224
df1.B_Weight_lbs.loc[4978]= 155
df1.B_Weight_lbs.loc[5134]= 224
df1.B_Weight_lbs.loc[5020]= 245
df1.B_Weight_lbs.loc[5029]= 155

### We fixed the R_Stance and B_Stance column replacing them with the most commun 

In [792]:
df1.R_Stance.fillna('Orthodox', inplace=True) # We fixed the R_Stance column replacing them with the most commun 

In [791]:
df1.B_Stance.fillna('Orthodox', inplace=True) # We fixed the B_Stance column replacing them with the most commun 

### we fixed the R_age and B_age column with the mean of the age of the column

In [793]:
df1.B_age.fillna(29, inplace=True) # we fixed this column with the mean of the age of the column

In [794]:
df1.R_age.fillna(29,inplace=True)

## Changing the columns type

In [692]:
category = ['R_fighter', 'B_fighter', 'Referee', 'Winner','location', 'weight_class', 'B_Stance',
               'R_Stance']

for x in category:
    df1[x] = df1[x].astype('category')

In [693]:
cat = ['R_age', 'B_age',"R_wins","B_wins"]

for x in cat:
    df1[x] = df1[x].astype('int')

In [694]:
df1['date'] = pd.to_datetime(df1['date'])

In [537]:
df1.dtypes

R_fighter             category
B_fighter             category
Referee               category
date            datetime64[ns]
location              category
                     ...      
R_Height_cms           float64
R_Reach_cms            float64
R_Weight_lbs           float64
B_age                    int32
R_age                    int32
Length: 145, dtype: object

# We will delete rest of the NaN value in a new DataFrame  for modeling purpose

In [799]:
data_modeling = pd.DataFrame(df1)

In [800]:
data_modeling.dropna(axis=0, how='any', inplace=True)

In [801]:
data_modeling.isnull().sum()

R_fighter       0
B_fighter       0
Referee         0
date            0
location        0
               ..
R_Height_cms    0
R_Reach_cms     0
R_Weight_lbs    0
B_age           0
R_age           0
Length: 145, dtype: int64

In [802]:
data_modeling.to_csv('data/data_modeling.csv',index=False)

Openweight is an unofficial weight class in combat sports and professional wrestling. It refers to bouts where there is no weight limit and fighters with a dramatic difference in size can compete against each other. It is different from catch weight, where competitors agree to weigh in at a certain amount without an official weight class.


### We will create the new csv that contain the mens fighters for 

In [807]:
mens_fighters1 = pd.DataFrame(df1)

In [809]:
mens_fighters1.fillna(0,inplace=True)

In [812]:
mens_fighters = pd.DataFrame(mens_fighters1[(mens_fighters1.weight_class != "Women's Strawweight")&(mens_fighters1.weight_class != "Women's Bantamweight") & (mens_fighters1.weight_class != "Women's Flyweight") & (mens_fighters1.weight_class != "Women's Featherweight") ])

In [813]:
mens_fighters.shape

(4830, 145)

In [814]:
mens_fighters["gender"] = "M"

In [815]:
mens_fighters["weight_diff_red-blue"]= mens_fighters["R_Weight_lbs"] - mens_fighters["B_Weight_lbs"] # Red less o more than Blue

In [816]:
mens_fighters["height_diff_red-blue"]= mens_fighters["R_Height_cms"] - mens_fighters["B_Height_cms"] # Red less o more than Blue

In [817]:
mens_fighters.shape

(4830, 148)

In [828]:
mens_fighters.head()

Unnamed: 0,R_fighter,B_fighter,Referee,date,location,Winner,title_bout,weight_class,no_of_rounds,B_current_lose_streak,...,R_wins,R_Stance,R_Height_cms,R_Reach_cms,R_Weight_lbs,B_age,R_age,gender,weight_diff_red-blue,height_diff_red-blue
0,Henry Cejudo,Marlon Moraes,Marc Goddard,2019-06-08,"Chicago, Illinois, USA",Red,True,Bantamweight,5,0.0,...,8.0,Orthodox,162.56,162.56,135.0,31.0,32.0,M,0.0,-5.08
2,Tony Ferguson,Donald Cerrone,Dan Miragliotta,2019-06-08,"Chicago, Illinois, USA",Red,False,Lightweight,3,0.0,...,14.0,Orthodox,180.34,193.04,155.0,36.0,35.0,M,0.0,-5.08
3,Jimmie Rivera,Petr Yan,Kevin MacDonald,2019-06-08,"Chicago, Illinois, USA",Blue,False,Bantamweight,3,0.0,...,6.0,Orthodox,162.56,172.72,135.0,26.0,29.0,M,0.0,-7.62
4,Tai Tuivasa,Blagoy Ivanov,Dan Miragliotta,2019-06-08,"Chicago, Illinois, USA",Blue,False,Heavyweight,3,0.0,...,3.0,Southpaw,187.96,190.5,264.0,32.0,26.0,M,14.0,7.62
6,Aljamain Sterling,Pedro Munhoz,Marc Goddard,2019-06-08,"Chicago, Illinois, USA",Red,False,Bantamweight,3,0.0,...,9.0,Orthodox,170.18,180.34,135.0,32.0,29.0,M,0.0,2.54


In [419]:
mens_fighters.to_csv('data/mens_fighters.csv',index=False)

### We will create the new csv that contain the woman fighters

In [819]:
women_fighters1 = pd.DataFrame(df1)

In [820]:
women_fighters1.fillna(0,inplace=True)

In [821]:
women_fighters = pd.DataFrame(women_fighters1[(women_fighters1.weight_class == "Women's Strawweight")|(women_fighters1.weight_class == "Women's Bantamweight") | (women_fighters1.weight_class == "Women's Flyweight") | (women_fighters1.weight_class == "Women's Featherweight") ])

In [823]:
women_fighters["gender"] = "F"

In [824]:
women_fighters["weight_diff_red-blue"]= women_fighters["R_Weight_lbs"] - women_fighters["B_Weight_lbs"] # Red less o more than Blue

In [825]:
women_fighters["height_diff_red-blue"]= women_fighters["R_Height_cms"] - women_fighters["B_Height_cms"] # Red less o more than Blue

In [826]:
women_fighters.shape

(314, 148)

In [827]:
women_fighters.head()

Unnamed: 0,R_fighter,B_fighter,Referee,date,location,Winner,title_bout,weight_class,no_of_rounds,B_current_lose_streak,...,R_wins,R_Stance,R_Height_cms,R_Reach_cms,R_Weight_lbs,B_age,R_age,gender,weight_diff_red-blue,height_diff_red-blue
1,Valentina Shevchenko,Jessica Eye,Robert Madrigal,2019-06-08,"Chicago, Illinois, USA",Red,True,Women's Flyweight,5,0.0,...,5.0,Southpaw,165.1,167.64,125.0,32.0,31.0,F,0.0,-2.54
5,Tatiana Suarez,Nina Ansaroff,Robert Madrigal,2019-06-08,"Chicago, Illinois, USA",Red,False,Women's Strawweight,3,0.0,...,4.0,Orthodox,165.1,167.64,115.0,33.0,28.0,F,0.0,0.0
7,Karolina Kowalkiewicz,Alexa Grasso,Kevin MacDonald,2019-06-08,"Chicago, Illinois, USA",Blue,False,Women's Strawweight,3,1.0,...,5.0,Orthodox,160.02,162.56,115.0,25.0,33.0,F,0.0,-5.08
9,Yan Xiaonan,Angela Hill,Robert Madrigal,2019-06-08,"Chicago, Illinois, USA",Red,False,Women's Strawweight,3,0.0,...,3.0,Orthodox,165.1,160.02,115.0,34.0,29.0,F,0.0,5.08
12,Katlyn Chookagian,Joanne Calderwood,Dan Miragliotta,2019-06-08,"Chicago, Illinois, USA",Red,False,Women's Flyweight,3,0.0,...,4.0,Orthodox,175.26,172.72,125.0,33.0,30.0,F,0.0,7.62


In [420]:
women_fighters.to_csv('data/women_fighters.csv',index=False)

### We will create the new csv that contains the final dataset for Visualization purpose

In [829]:
frames=[ mens_fighters, women_fighters]

In [830]:
final_dataset= pd.concat(frames)

In [831]:
pd.DataFrame(final_dataset)

Unnamed: 0,R_fighter,B_fighter,Referee,date,location,Winner,title_bout,weight_class,no_of_rounds,B_current_lose_streak,...,R_wins,R_Stance,R_Height_cms,R_Reach_cms,R_Weight_lbs,B_age,R_age,gender,weight_diff_red-blue,height_diff_red-blue
0,Henry Cejudo,Marlon Moraes,Marc Goddard,2019-06-08,"Chicago, Illinois, USA",Red,True,Bantamweight,5,0.0,...,8.0,Orthodox,162.56,162.56,135.0,31.0,32.0,M,0.0,-5.08
2,Tony Ferguson,Donald Cerrone,Dan Miragliotta,2019-06-08,"Chicago, Illinois, USA",Red,False,Lightweight,3,0.0,...,14.0,Orthodox,180.34,193.04,155.0,36.0,35.0,M,0.0,-5.08
3,Jimmie Rivera,Petr Yan,Kevin MacDonald,2019-06-08,"Chicago, Illinois, USA",Blue,False,Bantamweight,3,0.0,...,6.0,Orthodox,162.56,172.72,135.0,26.0,29.0,M,0.0,-7.62
4,Tai Tuivasa,Blagoy Ivanov,Dan Miragliotta,2019-06-08,"Chicago, Illinois, USA",Blue,False,Heavyweight,3,0.0,...,3.0,Southpaw,187.96,190.50,264.0,32.0,26.0,M,14.0,7.62
6,Aljamain Sterling,Pedro Munhoz,Marc Goddard,2019-06-08,"Chicago, Illinois, USA",Red,False,Bantamweight,3,0.0,...,9.0,Orthodox,170.18,180.34,135.0,32.0,29.0,M,0.0,2.54
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2813,Julie Kedzie,Germaine de Randamie,Randy Corley,2013-07-27,"Seattle, Washington, USA",Blue,False,Women's Bantamweight,3,0.0,...,0.0,Orthodox,165.10,162.56,135.0,29.0,32.0,F,0.0,-10.16
2830,Alexis Davis,Rosi Sexton,Herb Dean,2013-06-15,"Winnipeg, Manitoba, Canada",Red,False,Women's Bantamweight,3,0.0,...,0.0,Orthodox,167.64,172.72,125.0,35.0,28.0,F,-10.0,7.62
2882,Sara McMann,Sheila Gaff,Gasper Oliver,2013-04-27,"Newark, New Jersey, USA",Red,False,Women's Bantamweight,3,0.0,...,0.0,Orthodox,167.64,167.64,135.0,23.0,32.0,F,0.0,2.54
2900,Miesha Tate,Cat Zingano,Kim Winslow,2013-04-13,"Las Vegas, Nevada, USA",Blue,False,Women's Bantamweight,3,0.0,...,0.0,Orthodox,167.64,165.10,135.0,30.0,26.0,F,-10.0,0.00


In [832]:
final_dataset.shape

(5144, 148)

In [833]:
final_dataset.gender.value_counts()

M    4830
F     314
Name: gender, dtype: int64

In [421]:
final_dataset.to_csv('data/final_dataset.csv',index=False)

### We will create the new csv that contain the individual information about the fighter

In [834]:
fighters_description=pd.DataFrame(final_dataset[["R_fighter",'R_Weight_lbs','R_Height_cms','R_Stance']])

In [835]:
fighters_description= fighters_description.rename(columns={"R_fighter":"fighter_name","R_Weight_lbs":"weight_lbs","R_Height_cms":"height_cms","R_Stance":"stance"})

In [836]:
fighters_description.drop_duplicates(inplace=True)

In [837]:
fighters_description.head()

Unnamed: 0,fighter_name,weight_lbs,height_cms,stance
0,Henry Cejudo,135.0,162.56,Orthodox
2,Tony Ferguson,155.0,180.34,Orthodox
3,Jimmie Rivera,135.0,162.56,Orthodox
4,Tai Tuivasa,264.0,187.96,Southpaw
6,Aljamain Sterling,135.0,170.18,Orthodox


In [838]:
fighters_description.shape

(1334, 4)

In [422]:
fighters_description.to_csv('data/fighters_description.csv',index=False)