# UFC Fight Analysis


## Todo


1. Change landed and attempted data into percentages.
2. Rewrite models to account for the unbalanced data.
  1. Most fights are won by decision, so that should be accounted for


## Introduction

## Background

## Data

This data was forked from the [UFC Predictor and Notes](https://www.kaggle.com/calmdownkarm/ufc-predictor-and-notes) kaggle kernel. They scraped the data using Beautiful Soup and a Javascript API that pulled fighter data from ufc.com. The scripts pulled data from JSON objects and wrote them to a CSV. All data is from 2014 onwards and consists of fighter statistics merged with fight outcomes. They were unable to get data with the same level of detail in prior years, so all fighter records were reset to zero at the beginning of 2014 and built from there. It was an interesting project and I wanted to see if I could push it further.

In [52]:
# Imports and Helper Functions
# data Analysis
import pandas as pd
import numpy as np
import random as rng
from pprint import pprint

# Web Scraping
import json
import codecs
import csv
import datetime
from bs4 import BeautifulSoup
import requests
import cStringIO
import pprint

# Visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

#SciKit Learn Models
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier


from sklearn.metrics import accuracy_score, classification_report,confusion_matrix

from subprocess import check_output
print check_output(["ls", "data"]).decode("utf8")
data = pd.read_csv("data/data.csv")
# Any results you write to the current directory are saved as output.

# Noteboook Functionality
from IPython.core.interactiveshell import InteractiveShell # All statements are printed to output
InteractiveShell.ast_node_interactivity = "all"

CalledProcessError: Command '['ls', 'data']' returned non-zero exit status 1

In [None]:
data.info()

## Variables

There are 895 variables, so it might be worth looking through them to get a better sense of the data. Right now, I'm only aware of a handful of variables and have no idea how many missing values are in the dataset. At least listing the variable names will allow me to categorize them and possibly engineer new features.

In [None]:
var_list = data.columns.values.tolist()
pprint(var_list[:100])
pprint(var_list[100:200])
pprint(var_list[200:300])
pprint(var_list[300:400])
pprint(var_list[400:500])
pprint(var_list[500:600])
pprint(var_list[600:700])
pprint(var_list[700:800])
pprint(var_list[800:895])

## Description
From this, we can see that we have a total of 879 Columns and one dependent variable. 
The columns themselves have 4 integer types (Streaks, Previous Wins etc), 5 object types (Names, Winner - basically strings and arrays) and 870 Float types. 
This however does not give us a complete picture of our data, so we're using a few other pandas functions to get a better glimpse. 
We also had to engineer a few features that weren't available in the JSONs as explained in the data explanation in the introduction of the project. 

In [None]:
data.describe()

In [None]:
data.describe(include=['O'])

In [None]:
data.describe(include=[np.int])

In [None]:
count = loc_data['B_Location'].str.split().apply(len).value_counts()
count.index = count.index.astype(str) + ' words:'
count.sort_index(inplace=True)
count

In [None]:
data['winby'].value_counts()

In [None]:
data['winner'].value_counts()

### Some Notes to observe
1. Red Side seems to win slightly more than blue (867/1477 = 58.7%)
2. Donald Cerrone fights on Red side more than any other fighter, with 11 fights
3. Tim Means fights on Blue side more than any other fighter with 8
4. There are more fighters fighting debut fights. This statistic however could be skewed by the fact that our data set assumes debuts of every fighter in 2013
5. Most Fights are won by decision, and 2015 had the most fights. 
6. The Most common hometown and training location for fighters is Rei De Janeiro in Brazil
We also notice that 3 fighters don't have an age and 1 doesn't have a height. 

In [None]:
data.head()

In [None]:
data.tail()

## Data Cleaning

The first step in data cleaning is to remove obvious outliers and columns that will not contribute to the model. One starting point is narrowing down the fights to just wins and losses, excluding no contests and draws. No contests are nearly impossible to predict as are draws, so it doesn't make sense to account for them. Here's a list of ideas so far:

1. Draws or no contest ['winner']
2. Blue and red ID
3. Blue and red Name
4. Blue and red Name
5. Date

The data is currently not in "tidy" format, so I may consider reshaping it. One obvious hint that this is the case is that every row has a round_4 and round_5 column, even though not all fights go to the last round. I'll see what I can do in terms of reshaping the data and look into whether it makes sense to do so. I wonder if there's a dplyr and tidyr package for Python? The R equivalents are pretty robust, so I'll probably start there. Here's an ongoing list of ideas that may be worth pursuing:

1. Rearrange data to get rid of empty Round4 and Round5 data
2. Separate city and country (Brazil vs RioDijanero Brazil and USA vs Stockton, California USA may add more predictive power)
3. Change landed and attempted data into percentages. This may help make better comparisons across fighters making debuts vs fighters with established records.
4. Add a column for submission wins, KO wins, etc

I'm debating whether to turn the database into a database of fighters instead of a database of fights. One reason is that you have the accumulated statistics on each fighter's history going into each event. If you have Jon Jones matching up against DC, it may be useful to ask, "What is it about Jon Jones' record that makes him a favorite? This may require a lot or rearranging, so it may be worth it to think this through a bit first.

If I were to go this route, I would reset each fighter's stats and only add stats that came prior to each fight. For example, if Frankie Edgar fought in 2013 against Anderson Silva and then fought again against Matt Hughes in 2014, I'd want their fighter record inputs to be different. I think this may be a more comprehensive way of looking at the data and would align more closely to real-world applications. Here's a rough sketch of what the process might look like:

1. Sort the dataframe by date
2. Add each fighter's stats cumulatively based on prior fights
  1. Figure out a way to accomplish this even though red is on one side and blue is on the other
  2. Maybe put fighters on just red or blue? Might be impossible..
  3. Maybe just account for this in the code? If B_Name == 'X' do this or if R_Name == 'X' do that
3. Add columns for kicks taken, punches taken, etc?
  1. One disadvantage is that this would mean nearly doubling the number of variables in a dataset that is already massive. This will hurt some machine learning models. Actually, this could lead to more variables than observations which would be no bueno, although it would still be possible to pare it down afterwards. Let's omit this for now and come back to it later if necessary.
4. Consider consolidating round data. Why do you need 5 rounds of data for each fighter? Consider building a granular model and an aggragated model.

The head() and tail() functions give us a snapshot of the dataset's values. Not all rows and columns are represented, but it gives just enough to get a sense of how the data is organized. There are many NaN values and they need to be changed before passing the data onto the classifiers, but I want to be careful before making any changes. For instance, the winby column is extremely unbalanced and it may make sense to get rid of some of the outliers. For instance, there are a total of 16 draws and no-contests out of 1477. It makes no sense to include these since they are anomolies, so I will drop them.

### Extracting Country from B_Location and R_Location

I want to extract each figher's country location. Currently there are 438 unique locations in the dataset. This is too many categories to get a sense of where fighters are training. There may also be some correlation between a fighter's location and his or her record. The first thing I did was exclude any fighters with null values for their current location. Thankfully, there were only 8 such cases. Then I cleaned up some of the mispellings / duplicate entries (USA and United States were consolidated into just USA). Finally, I replaced all [City Country] values to [Country] values.

In [None]:
data['B_Location'].value_counts()[:10]

In [None]:
data = data[data['winby'].notnull()]
loc_data = data[(data['B_Location'].notnull()) & data['R_Location'].notnull()]
data = data[data['winner'] != 'no contest']
#data = data.reset_index()

In [None]:
len(loc_data)
len(data)

In [None]:
locations = ['B_Location', 'R_Location']
countries = ['Japan','Singapore']
for location in locations:
    for country in countries:
        loc_data.loc[(loc_data[location] == country), location] = 'Unknown '+country

In [None]:
locations = ['B_Location', 'R_Location']
c_dict = {
    'United': 'USA',
    'Brasil': 'Brazil',
    'Englad': 'England',
    'Czech': 'CzechRepublic',
    'Moldova': 'Moldova',
}

post_c_dict = {
    'Africa': 'South Africa',
    'CzechRepublic': 'Czech Republic',
    'PAN': 'Panama',
    'Zealand': 'New Zealand'
}
countries = ['']
for location in locations:
    for k,v in c_dict.items():
        loc_data.loc[loc_data[location].str.contains(k), location] = v
new_cols = ['R_Country_Location', 'B_Country_Location']
for new_col, location in zip(new_cols, locations):
    loc_data[new_col] = loc_data[location].str.split().str[-1]
for col in new_cols:
    for k,v in post_c_dict.items():
        loc_data.loc[loc_data[col].str.contains(k), col] = v

for location, col in zip(locations, new_cols):
    loc_data[location] = loc_data[col]
    del loc_data[col]
sorted(set(loc_data['B_Location'].values.tolist()))

In [None]:
loc_data['B_Country_Location'].value_counts()

In [None]:
data.fillna(value=0,inplace=True)

In [None]:
data.tail()

In [None]:
dropdata = data.drop(['B_ID','B_Name','R_ID','R_Name','Date'],axis=1)
dropdata.rename(columns={'BPrev':'B__Prev',
                         'RPrev':'R__Prev',
                         'B_Age':'B__Age',
                         'B_Height':'B__Height',
                         'B_Weight':'B__Weight',
                         'R_Age':'R__Age',
                         'R_Height':'R__Height',
                         'R_Weight':'R__Weight',
                         'BStreak':'B__Streak',
                         'RStreak': 'R__Streak'},inplace=True)
dropdata.describe()

In [None]:
dropdata.describe(include=['O'])

In [None]:
dropdata.describe(include=[np.int])

Next we need to convert our object types of columns into categorical columns. This is done to sort by logical order

In [None]:
objecttypes = list(dropdata.select_dtypes(include=['O']).columns)
for col in objecttypes:
    dropdata[col] = dropdata[col].astype('category')

In [None]:
cat_columns = dropdata.select_dtypes(['category']).columns
dropdata[cat_columns] = dropdata[cat_columns].apply(lambda x: x.cat.codes)
dropdata.info()
dropdata.tail()

## Data Correlation
While it would otherwise be normal practise to draw a heatmap or correlation matrix of our data to look for linear relationships, this is highly illogical due to the sheer number of features we're currently looking at.  Instead we examine the n largest correlations with our dependent variable (winner) to look for linear relationships - as you can see from the plot below, the relationships (if they exist) are highly non-linear. This suggests that alterations to our Data Set are required. 

Further, it seems that Round 4 statistics for the Red Fighter are the more correlating, this suggests that a split or delta sort of data set should produce better results. 

In [None]:
# Basic Correlation Matrix
# corrmat = data.corr()
# f, ax = plt.subplots(figsize=(12, 9))
# sns.heatmap(corrmat, vmax=.8, square=True);

In [None]:
# Subset Correlation Matrix
k = 10 #number of variables for heatmap
corrmat = dropdata.corr()
cols = corrmat.nlargest(k, 'winner')['winner'].index
cm = np.corrcoef(dropdata[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

## Modeling
We're evaluating the following models

1. Perceptron
2. Random Forests
3. Decision Trees Classifier
4. SGD Classifier
5. Linear SVC
6. Gaussian NB
7. KNN

I each model's random_state when appropriate and set the SVM's class_weight to balanced to account for the unbalanced data. There's a lot that can be done in terms of tuning the hyperparameters. I may have to come back to this later to further test different configurations.

In [None]:
# help(sklearn.ensemble.RandomForestClassifier())
# help(sklearn.naive_bayes.GaussianNB)
help(sklearn.svm.LinearSVC)
# import sklearn
# help(sklearn)

In [None]:
# We Store prediction of each model in our dict
# Helper Functions for our models. 

def percep(X_train,Y_train,X_test,Y_test,Models):
    perceptron = Perceptron(max_iter = 1000, tol = 0.001, random_state=42)
    perceptron.fit(X_train, Y_train)
    Y_pred = perceptron.predict(X_test)
    Models['Perceptron'] = [accuracy_score(Y_test,Y_pred),confusion_matrix(Y_test,Y_pred)]
    return

def ranfor(X_train,Y_train,X_test,Y_test,Models):
    randomfor = RandomForestClassifier(max_features="sqrt",
                                       n_estimators = 700,
                                       max_depth = None,
                                       n_jobs=-1,
                                       random_state=42
                                      )
    randomfor.fit(X_train,Y_train)
    Y_pred = randomfor.predict(X_test)
    Models['Random Forests'] = [accuracy_score(Y_test,Y_pred),confusion_matrix(Y_test,Y_pred)]
    return

def dec_tree(X_train,Y_train,X_test,Y_test,Models):
    decision_tree = DecisionTreeClassifier(class_weight="balanced",random_state=42)
    decision_tree.fit(X_train, Y_train)
    Y_pred = decision_tree.predict(X_test)
    Models['Decision Tree'] = [accuracy_score(Y_test,Y_pred),confusion_matrix(Y_test,Y_pred)]
    return

def SGDClass(X_train,Y_train,X_test,Y_test,Models):
    sgd = SGDClassifier(max_iter = 1000, tol = 0.001, class_weight = "balanced", random_state=42)
    sgd.fit(X_train, Y_train)
    Y_pred = sgd.predict(X_test)
    Models['SGD Classifier'] = [accuracy_score(Y_test,Y_pred),confusion_matrix(Y_test,Y_pred)]
    return

def linSVC(X_train,Y_train,X_test,Y_test,Models):
    linear_svc = LinearSVC(class_weight="balanced", random_state=42)
    linear_svc.fit(X_train, Y_train)
    Y_pred = linear_svc.predict(X_test)
    Models['SVM'] = [accuracy_score(Y_test,Y_pred),confusion_matrix(Y_test,Y_pred)]
    return

def bayes(X_train,Y_train,X_test,Y_test,Models):
    gaussian = GaussianNB()
    gaussian.fit(X_train, Y_train)
    Y_pred = gaussian.predict(X_test)
    Models['Bayes'] = [accuracy_score(Y_test,Y_pred),confusion_matrix(Y_test,Y_pred)]
    return

def Nearest(X_train,Y_train,X_test,Y_test,Models):
    knn = KNeighborsClassifier(n_neighbors = 3)
    knn.fit(X_train, Y_train)
    Y_pred = knn.predict(X_test)
    Models['KNN'] = [accuracy_score(Y_test,Y_pred),confusion_matrix(Y_test,Y_pred)]

def run_all_and_Plot(df):
    Models = dict()
    from sklearn.model_selection import train_test_split
    X_all = df.drop(['winner'], axis=1)
    y_all = df['winner']
    X_train, X_test, Y_train, Y_test = train_test_split(X_all, y_all, test_size=0.2, random_state=0)
    percep(X_train,Y_train,X_test,Y_test,Models)
    ranfor(X_train,Y_train,X_test,Y_test,Models)
    dec_tree(X_train,Y_train,X_test,Y_test,Models)
    SGDClass(X_train,Y_train,X_test,Y_test,Models)
    linSVC(X_train,Y_train,X_test,Y_test,Models)
    bayes(X_train,Y_train,X_test,Y_test,Models)
    Nearest(X_train,Y_train,X_test,Y_test,Models)
    return Models


def plot_bar(dict):
    labels = tuple(dict.keys())
    y_pos = np.arange(len(labels))
    values = [dict[n][0] for n in dict]
    plt.bar(y_pos, values, align='center', alpha=0.5)
    plt.xticks(y_pos, labels,rotation='vertical')
    plt.ylabel('accuracy')
    plt.title('Accuracy of different models')
    plt.show()


def plot_cm(dict):
    count = 1
    fig = plt.figure(figsize=(10,10))
    for model in dict:
        cm = dict[model][1]
        labels = ['W','L']
        ax = fig.add_subplot(4,4,count)
        cax = ax.matshow(cm)
        plt.title(model,y=-0.8)
        fig.colorbar(cax)
        ax.set_xticklabels([''] + labels)
        ax.set_yticklabels([''] + labels)
        plt.xlabel('Predicted')
        plt.ylabel('True')
        # plt.subplot(2,2,count)
        count+=1
    plt.tight_layout()
    plt.show()

In [None]:
accuracies = run_all_and_Plot(dropdata)
CompareAll = dict()
CompareAll['Baseline'] = accuracies
for key,val in accuracies.items():
    print(str(key) +' '+ str(val[0]))
plot_bar(accuracies)
plot_cm(accuracies)

Theoretically, we should get best results from our Random Forests Model, thus attempting to tune hyper parameters using GridSearch from Scikit learn

In [None]:
from sklearn.model_selection import train_test_split, GridSearchCV
#X_all = dropdata.drop(['winner'], axis=1)
#y_all = dropdata['winner']
#X_train, X_test, Y_train, Y_test = train_test_split(X_all, y_all, test_size=0.2, random_state=23)
#rfc = RandomForestClassifier(n_jobs=-1,max_features= 'sqrt' ,n_estimators=50, oob_score = True, max_depth=None) 
#param_grid = { 
#    'n_estimators': [200,700],
#    'max_features': ['auto', 'sqrt', 'log2']
#}

#CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= 5)
#CV_rfc.fit(X_train, Y_train)
#print(CV_rfc.best_params_)

### Trying to improve results by dividing features

This block turns each individual red and blue round stat into a ratio of red to blue values. It divides the number of features from 895 to 450. For instance, it turns R_Round4_Strikes_Kicks_Landed and B_Round4_Strikes_Kicks_Landed into a single ratio of red to blue strikes kicks landed. Interesting approach, it appears to be part of th iteration process to see what will be most effective in this analysis.

In [None]:
dontchange = ['winner','Event_ID','Fight_ID','Max_round','Last_round','B_Age','R_Age']
numeric_cols = [col for col in dropdata if col not in dontchange]
dropdata[numeric_cols] += 1 

In [None]:
newDF = dropdata.copy()
blue_cols = [col for col in dropdata.columns if 'B__' in col]
red_cols = [col for col in dropdata.columns if 'R__' in col]
for (blue,red) in zip(blue_cols,red_cols):
    newkey = ''.join(str(blue).split('_')[2:])
    dropdata[newkey] = dropdata[str(blue)]/dropdata[str(red)]
    del dropdata[str(blue)]
    del dropdata[str(red)]
newDF.head()

In [None]:
accuracies = run_all_and_Plot(dropdata)
for key,val in accuracies.items():
    print(str(key) +' '+ str(val[0]))
CompareAll['Blue/Red'] = accuracies
plot_bar(accuracies)
plot_cm(accuracies)


#### Dropping Round 4 and Round 5 since most fights are 3 round Max.

This block drops round 4 and round 5 columns from the dataset but keeps fights that last 5 rounds. It may be an attempt to overcome the amount of null values in the dataset. I'd go a different route since this doesn't seem very precise. I'd tidy and consolidate the data from all rounds into one so there wouldn't be a need to drop round 4 and 5 data. 

In [None]:
r4 = [col for col in dropdata.columns if "Round4" in col]
r5 = [col for col in dropdata.columns if "Round5" in col]
threerounds = dropdata.drop(r4+r5,axis = 1)
accuracies = run_all_and_Plot(threerounds)
for key,val in accuracies.items():
    print(str(key)+' '+str(val[0]))
CompareAll['DropR4&R5'] = accuracies
plot_bar(accuracies)
plot_cm(accuracies)

#### Dropping 5 round fights entirely

This block drops all title fights. I think this is a mistake for a couple reasons. The dataset is small enough already (only 1477 observations) so dropping any records would probably have a negative impact on the machine learning algorithm. Also, there may be a better way to represent the data that would negate the reasoning behind this step.

In [None]:
foobar = threerounds.loc[threerounds['Max_round'] == 3]
bewb = threerounds.drop(['Max_round','Last_round'],axis=1)
accuracies = run_all_and_Plot(bewb)
for key,val in accuracies.items():
    print(str(key)+' '+str(val[0]))
CompareAll['Drop5RoundFights'] = accuracies
plot_bar(accuracies)
plot_cm(accuracies)

### Dropping First Fights

This drops any fights in which the fighter has no previous recorded. I may use this in my model but again I'm hesitant to do so because it involves a huge loss of data. Looking at the original CSV, there are 342 records in for Red and 499 records for blue in which they have no previously recorded fight data. I can't afford to lose more than half the data.

This is actually a pretty interesting problem because I'm not sure how to deal with new fighters. By definition, their stats will show up as zeros if they don't have any fight data. It might be worth exploring how to get records prior to 2014 and going from there. Possible sources include:

1. Fightmetrics
2. Fight data from other fight organizations
3. Looking into what the original analyst meant when he said the previous data was not as granular

In [None]:
data[data.RPrev == False]

In [None]:
data[data.BPrev == False]

In [None]:
blahblah = bewb[bewb.Prev != 1]
accuracies = run_all_and_Plot(blahblah)
for key,val in accuracies.items():
    print(str(key)+' '+str(val[0]))
CompareAll['DroppingDebut'] = accuracies
plot_bar(accuracies)
plot_cm(accuracies)

In [None]:
blue_cols

### Aggregate Round Stats

This block sums the stats for each round into one value. For example, it combines round 1-5 strikes_landed into a single value. This is what I was thinking of doing. May have to borrow this code in my analysis.

In [None]:
blue_cols
newDF.info()
b_feats = list(set([x[10:] for x in blue_cols if "Round" in x]))
r_feats = list(set([x[10:] for x in red_cols if "Round" in x]))
def sumshit(b_feats,cols):
    for x in b_feats:
        newDF.loc[:,x] = 0
        for y in cols:
            if x in y:
                newDF[x] += newDF[y]
                newDF.drop(y,axis=1,inplace=True)
sumshit(b_feats,blue_cols)
sumshit(r_feats,red_cols)
newDF.info()


In [None]:
newDF.describe()
accuracies = run_all_and_Plot(newDF)
for key,val in accuracies.items():
    print(str(key) +' '+ str(val[0]))
CompareAll['SumRounds'] = accuracies
plot_bar(accuracies)
plot_cm(accuracies)



### Comparing Red to Blue

This block compares red stats to blue stats. It creates a ratio of strikes landed by red vs strikes landed by blue for each category. Very useful. I think I'm going to borrow this code as well since it's an idea I was thinking about implementing in my reanalysis.

In [None]:
blue_cols = [col for col in newDF.columns if 'B__' in col]
red_cols = [col for col in newDF.columns if 'R__' in col]
for (blue,red) in zip(blue_cols,red_cols):
    newkey = ''.join(str(blue).split('_')[2:])
    newDF[newkey] = newDF[str(blue)]/newDF[str(red)]
    del newDF[str(blue)]
    del newDF[str(red)]

In [None]:
accuracies = run_all_and_Plot(newDF)
for key,val in accuracies.items():
    print(str(key) +' '+ str(val[0]))
CompareAll['SumRounds'] = accuracies
plot_bar(accuracies)
plot_cm(accuracies)

### Reducing Features

This block drops features that are seemingly arbitrary and may have little effect on the data. It drops weight, hometown, event location, event id, fight id, max round and last round. Interestingly, the accuracy scores dip slightly after this is done. I'm not sure I agree with the decisions made here. For instance, I still want to take a look at splitting the columns by city and country. I'm not sure about the others. I'll have to combe back to this later down the road.

In [None]:
reduced_features = newDF.drop(["Weight","B_HomeTown","B_Location", "Event_ID", "Fight_ID", "Max_round", "Last_round", "R_HomeTown", "R_Location"],axis = 1)
accuracies = run_all_and_Plot(reduced_features)
for key,val in accuracies.items():
    print(str(key) +' '+ str(val[0]))
CompareAll['Reduced Features'] = accuracies
plot_bar(accuracies)
plot_cm(accuracies)


In [None]:
reduced_features.info()

In [None]:
sorted(reduced_features.columns.values.tolist())

## Conclusion
Our model has a best predictive value that lies between 58-63% on average between runs. Despite a very low accuracy model, we believe this is the best possible given the amount of available data and its inherent noise. 

## Stretch Goals

### Rewrite and re-run scraper to pull data from earlier years

It won't be as granular, but it may strengthen the analysis. There might not be enough records from 2014 onwards to model the data accurately. On the other hand, I'm not even sure the previous analyst modeled the data accurately. He took fight data from each fight and used that to predict the winner. That's not realistic. You don't have a round-by-round analysis before the fight happens! It would be much better to use the running totals to predict a fight's outcome. Looks like we're learning Beautiful Soup! At least you have a starting point.


## Scraper Rewrite

The following scraper retrieves data from ufc.com. It currently pulls a single fighter's data. I added functionality to convert that data into a JSON file, extract the list of fights, and write that to a CSV but it's still incomplete. Still need the following steps:

1. Pull from list of all fighters
2. Merge all fight data so opponent data shows up
3. Flatten data where appropriate
4. Format data appropriately

I'm scrapping this and building my own scraper. The data on fightmetric is more robust. The data I'm getting from ufc.com has a lot of missing data in general and it does not break down each fight as well. It lists strike percentages but does not give raw numbers. The fightmetric data lists raw data as well as percentages and may lead to a more complete picture of each fighter. It also does not require as much merging. Looks like we're starting fresh!

### Fight URL Scraper

This first script pulls the urls of all fights from fightmetric.com and exports them as a CSV.

In [None]:
from bs4 import BeautifulSoup
import pandas as pd
import urllib
import os
os.chdir('/Users/courtneyfergusonlee/kaggle-practice/ufc-scraper/MMA-scraper-master')

# Open the main page with events listed
sock = urllib.urlopen('http://www.fightmetric.com/statistics/events/completed?page=all')
page = sock.read()
soup = BeautifulSoup(page)

# Scrape event URLs from the main page
event_urls = []
trs = soup.find_all('tr')
for tr in trs:
    for link in tr.find_all('a'):
        event_urls.append(link.get('href'))


# Pull Fight URLs from each Event URL
fight_urls = []
for event_url in event_urls: 
    print event_url
    try:
        sock = urllib.urlopen(event_url)
        event_html = sock.read()
        event_soup = BeautifulSoup(event_html)

        tds = event_soup.find_all('td')
        for td in tds:
            for link in td.find_all('a'):
                url = link.get('href')
                url_type = url.split('-')[0][-2:] # Fight vs. Fighter; last 2 letters
                if url_type == 'ht': # use er for fighter, ht for fight
                    #print url
                    fight_urls.append(url)
    
    except urllib2.HTTPError:
        print "HTTP Error"
        pass

# Save fight URLs to a csv file
fight_urls = pd.DataFrame(fight_urls, columns=['link'])
fight_urls.to_csv('fight urls.csv', index=False)



### Fight Scraper

This scrapes all fight data from fightmetric.com/fight-details. It reads a list of URLs, pulls the fight statistics from each fight, and concatenates it onto a Pandas dataframe. There are over 4000 urls in the list, so it takes a while to run (over 3 hours).

In [36]:
from bs4 import BeautifulSoup
import pandas as pd
import urllib
import os
os.chdir('/Users/courtneyfergusonlee/ufc_fight_analysis/MMA-scraper-master')

# Load URLs from CSV (created in fightmetric_scraper.py)
fight_urls = pd.read_csv('fight urls.csv', encoding='utf-8')['link'].values.tolist()

# Initialize an empty dataframe
fighter_df = pd.DataFrame(columns=['name_first', 'name_last', 'kd', 'sig_strikes', 'sig_attempts', 'strikes', 'strike_attempts', 
                                   'takedowns', 'td_attempts', 'sub_attempts', 'pass', 'reversals', 'head', 'head_attempts', 'body', 
                                   'body_attempts','leg', 'leg_attempts', 'distance', 'distance_attempts', 'clinch', 'fight_id',
                                   'clinch_attempts', 'ground', 'ground_attempts', 'win/loss', 'referee', 'round', 'method'])


# Iterate through the fight urls, and pull relevant variables/fields
for i in range(len(fight_urls)):
    if i==10:
        print i, fight_urls[i]
        break
    
    sock = urllib.urlopen(fight_urls[i]) # specific URL for a fight
    fight_html = sock.read()
    fight_soup = BeautifulSoup(fight_html, "html.parser")
    trs = fight_soup.find_all('tr') # all the tables in each fight URL
    headers = fight_soup.find_all('i')
    bad_call = 0
    try: 
        referee = str(headers[24].get_text()).split()[1] + ' ' + str(headers[24].get_text()).split()[-1]
    except:
        referee = None
    try:
        rounds = str(headers[18].get_text()).split()[1]
    except:
        rounds = None
    try:
        method = str(headers[17].get_text()).split()[0]
    except:
        method = None
    try:
        tr1 = str(trs[1].get_text()).split()
        # Find the location of the 2nd table tr2 (it varies)
        j = 0
        while j < 10:
            if str(trs[j].get_text()).split()[6] == 'Head':
                #print j+1
                tr2 = str(trs[j+1].get_text()).split()
                j = 10
            else:
                j += 1
        #print tr1; #print tr2
        
        # Test for the end of names
        k = 0
        while k < len(tr1):
            try:
                int(tr1[k])
                break
            except:
                k += 1
                continue
        #print k
    except:
        print str(i) + ' bad call' if i%20 == 0 else None
        bad_call += 1
        continue


    # Add each fighter's information to the dataframe
    fighter1 = pd.DataFrame({'name_first': tr1[:1], 'name_last': tr1[1:2], 'kd': tr1[k], 'sig_strikes': tr1[k+2],
    'sig_attempts': tr1[k+4], 'strikes': tr1[k+10], 'strike_attempts': tr1[k+12], 'takedowns': tr1[k+16],'td_attempts': tr1[k+18],
    'sub_attempts': tr1[k+24], 'pass': tr1[k+26], 'reversals': tr1[k+28], 'head': tr2[k+8], 'head_attempts': tr2[k+10],
    'body': tr2[k+14], 'body_attempts': tr2[k+16], 'leg': tr2[k+20], 'leg_attempts': tr2[k+22], 'distance': tr2[k+26],
    'distance_attempts': tr2[k+28], 'clinch': tr2[k+32], 'clinch_attempts': tr2[k+34], 'ground': tr2[k+38], 
    'ground_attempts': tr2[k+40], 'win/loss': 1, 'referee': referee, 'round': rounds, 'method': method, 'fight_id': i})

    fighter2 = pd.DataFrame({'name_first': tr1[2:3], 'name_last': tr1[3:4], 'kd': tr1[k+1], 'sig_strikes': tr1[k+5], 
    'sig_attempts': tr1[k+7], 'strikes': tr1[k+13], 'strike_attempts': tr1[k+15], 'takedowns': tr1[k+19],'td_attempts': tr1[k+21],
    'sub_attempts': tr1[k+25], 'pass': tr1[k+27], 'reversals': tr1[k+29], 'head': tr2[k+11], 'head_attempts': tr2[k+13],
    'body': tr2[k+17], 'body_attempts': tr2[k+19], 'leg': tr2[k+23], 'leg_attempts': tr2[k+25], 'distance': tr2[k+29],
    'distance_attempts': tr2[k+31], 'clinch': tr2[k+35], 'clinch_attempts': tr2[k+37], 'ground': tr2[k+41], 
    'ground_attempts': tr2[k+43], 'win/loss': 0, 'referee': referee, 'round': rounds, 'method': method, 'fight_id': i})
    
    fighter_df = pd.concat([fighter_df, fighter1, fighter2], axis=0, ignore_index=True)
    


10 http://www.fightmetric.com/fight-details/24438a644b975751


In [51]:
# Creates a copy of the dataframe for testing
fight_df_copy = fighter_df.copy()

# Rearranges the columns so the first name, last name and win/loss appear first
cols = fight_df_copy.columns.tolist()
cols = cols[6:7] + cols[15:17] + cols[28:29] + cols[:6] + cols[7:15] + cols[17:28]
fight_df_copy = fight_df_copy[cols]
cols

# Sets the index as the fight_id
fight_df_copy.set_index('fight_id', inplace=True)

fight_df_copy
fight_df_copy.to_csv('fights.csv', index=False)

### Fighter Scraper



In [None]:
from bs4 import BeautifulSoup
import pandas as pd
import urllib
import os
os.chdir('/Users/courtneyfergusonlee/kaggle-practice/ufc-scraper/MMA-scraper-master')

# Load URLs from CSV (created in fightmetric_scraper.py)
fighter_urls = pd.read_csv('fighter urls.csv', encoding='utf-8')

# Initialize an empty dataframe
fighter_bio_df = pd.DataFrame(columns=['name', 'height', 'reach', 'age', 'win_%'])

# Iterate through fighter_urls and pull relevant information
for i in range(len(fighter_urls)):
    fighter_bio_dict = {}
    name_dict = {}
    sock = urllib.urlopen(fighter_urls[i]) # specific URL for a fight
    fight_html = sock.read()
    fight_soup = BeautifulSoup(fight_html)
    headers = fight_soup.find_all('li')
    names = fight_soup.find_all('span')
    bad_call = 0
    
    try: 
        year = int(str(headers[13].get_text()).split()[-1])
    except:
        year = 0
        bad_call +=1
        print str(i) + ' bad call' if i%50 == 0 else None
    try:
        reach = float(str(headers[11].get_text()).split()[1][:2])
    except:
        reach = None
        bad_call +=1
        print str(i) + ' bad call' if i%50 == 0 else None
    try:
        height = float(str(headers[9].get_text()).split()[1][0]) + float(str(headers[9].get_text()).split()[2][:-1])/12
    except:
        height = None
        bad_call +=1
        print str(i) + ' bad call' if i%50 == 0 else None
    try:
        win = float(str(names[1].get_text()).split()[1].split('-')[0]) / (float(str(names[1].get_text()).split()[1].split('-')[0]) + float(str(names[1].get_text()).split()[1].split('-')[1]))
    except:
        win = 0
        bad_call +=1
        print str(i) + ' bad call' if i%50 == 0 else None
    try: 
        name = str(names[0].get_text()).split()[0] + ' ' + str(names[0].get_text()).split()[1]
    except:
        name = None
        bad_call +=1
        print str(i) + ' bad call' if i%50 == 0 else None
        
    
    fighter_bio_dict = {'name': name, 'height': height, 'reach': reach, 'age': 2014 - year, 'win_%': win}
    
    # Add dictionary information to the dataframe
    if name not in name_dict:
        name_dict[name] = 0
        fighter_bio_df = fighter_bio_df.append(fighter_bio_dict, ignore_index=True)


### Fight Aggregator

This 

In [None]:
import pandas as pd
import numpy as np
import os
os.chdir('/Users/courtneyfergusonlee/kaggle-practice/ufc-scraper/MMA-scraper-master')

df = pd.read_csv('FightMetric_Data_Master v2.csv')
df = df.drop_duplicates()
df['name'] = df.apply(lambda x: x['name_first'] + ' ' + x['name_last'], axis=1)
df['fight_id']= [i/2 if i%2 == 0 else (i-1)/2 for i in df.index]

bio_stats = pd.read_csv('Fighter_Stats_Master v2.csv')
bio_stats = bio_stats.drop_duplicates(cols=['name'])
bio_stats['age'] = bio_stats['age'].apply(lambda x: bio_stats['age'][bio_stats['age'] != 2014].median() if x == 2014 else x)


# Create a dictionary of unique fighters, with a corresponding list of their fight ids
fighter_dict = {}
for i in range(len(df['name'])):
    if df['name'][i] not in fighter_dict:
        fighter_dict[df['name'][i]] = [i]
    else:
        fighter_dict[df['name'][i]].append(i)

# Aggregate metrics
# Note: if you include head, leg, and body, you can linearly combine them to form sig_strike #
# Strikes is okay because it includes non-sig strikes which are not accounted for in location shots

fighter_df = pd.DataFrame(columns=['name', 'body', 'body_att', 'clinch', 'clinch_att', 'distance', 'distance_att',
                                   'ground', 'ground_att', 'head', 'head_att', 'leg', 'leg_att', 'method', 'pass', 
                                   'referee', 'reversal', 'round', 'sig_strike', 'sig_strike_att', 'strike', 'strike_att', 
                                   'sub', 'sub_att', 'td', 'td_att', 'win/loss', 'fight_id', 'fight_count', 'td_rec', 'td_rec_att',
                                   'head_rec', 'head_rec_att', 'sub_rec', 'sub_rec_att', 'strike_rec', 'strike_rec_att', 
                                   'sig_strike_rec', 'sig_strike_rec_att', 'clinch_rec', 'clinch_rec_att', 'ground_rec', 
                                   'ground_rec_att', 'body_rec', 'body_rec_att', 'leg_rec', 'leg_rec_att', 'total_rounds'])


# iterate through fighters such that...
for key in fighter_dict: 
    add_df = {'name': key}
    for column in fighter_df:
        if column != 'name':
            add_df[column] = 0
    
    count = 0
     # for each fighter, iterate backwards, from his earliest fight to most recent 
    for i in range(len(fighter_dict[key])-1, -1, -1):
        count += 1
        data_row = df.loc[fighter_dict[key][i]]
        
        if i-1 >= 0:
        	# i is the current fight, i+1 is the last fight, i-1 is the next fight
            result_row = df.loc[fighter_dict[key][i-1]] 
            add_df['win/loss'] = result_row['win/loss']
            add_df['fight_id'] = result_row['fight_id']
            add_df['referee'] = result_row['referee']
            add_df['round'] = result_row['round']
            add_df['total_rounds'] += result_row['round']
            add_df['method'] = result_row['method']
        else:
            add_df['win/loss'] = 2
            add_df['fight_id'] = 9999
            #add_df['fight_id'] 
            add_df['referee'] = 'None'
            add_df['round'] = 5
            add_df['total_rounds'] += 1
            add_df['method'] = 'None'

        # Find the opponent row
        if fighter_dict[key][i] < len(df)-1:
            if df.loc[fighter_dict[key][i]+1]['fight_id'] != data_row['fight_id']:
                opponent_row = df.loc[fighter_dict[key][i]-1]
            else:
                opponent_row = df.loc[fighter_dict[key][i]+1]
        else:
            opponent_row = df.loc[fighter_dict[key][i]-1]
            
        add_df['td_rec'] += opponent_row['takedowns']
        add_df['td_rec_att'] += opponent_row['td_attempts']
        add_df['head_rec'] += opponent_row['head']
        add_df['head_rec_att'] += opponent_row['head_attempts']
        add_df['sub_rec_att'] += opponent_row['sub_attempts']
        add_df['sig_strike_rec'] += opponent_row['sig_strikes']
        add_df['sig_strike_rec_att'] += opponent_row['sig_attempts']
        add_df['strike_rec'] += opponent_row['strikes']
        add_df['strike_rec_att'] += opponent_row['strike_attempts']
        add_df['clinch_rec'] += opponent_row['clinch']
        add_df['clinch_rec_att'] += opponent_row['clinch_attempts']
        add_df['ground_rec'] += opponent_row['ground']
        add_df['ground_rec_att'] += opponent_row['ground_attempts']
        add_df['head_rec'] += opponent_row['head']
        add_df['head_rec_att'] += opponent_row['head_attempts']
        add_df['body_rec'] += opponent_row['body']
        add_df['body_rec_att'] += opponent_row['body_attempts']
        add_df['leg_rec'] += opponent_row['leg']
        add_df['leg_rec_att'] += opponent_row['leg_attempts']
        
        if data_row['method'] == 'Submission' and data_row['win/loss'] == 1:
            add_df['sub'] += 1
        elif data_row['method'] == 'Submission' and data_row['win/loss'] == 0:
            add_df['sub_rec'] += 1
        
        add_df['body'] += data_row['body']
        add_df['body_att'] += data_row['body_attempts']
        add_df['clinch'] += data_row['clinch']
        add_df['clinch_att'] += data_row['clinch_attempts']
        add_df['distance'] += data_row['distance']
        add_df['distance_att'] += data_row['distance_attempts']
        add_df['ground'] += data_row['ground']
        add_df['ground_att'] += data_row['ground_attempts']
        add_df['head'] += data_row['head']
        add_df['head_att'] += data_row['head_attempts']
        add_df['leg'] += data_row['leg']
        add_df['leg_att'] += data_row['leg_attempts']
        add_df['pass'] += data_row['pass']
        add_df['reversal'] += data_row['reversals']
        add_df['sig_strike'] += data_row['sig_strikes']
        add_df['sig_strike_att'] += data_row['sig_attempts']
        add_df['strike'] += data_row['strikes']
        add_df['strike_att'] += data_row['strike_attempts']
        add_df['sub_att'] += data_row['sub_attempts']
        add_df['td'] += data_row['takedowns']
        add_df['td_att'] += data_row['td_attempts']            
        add_df['fight_count'] = count
        
        fighter_df = fighter_df.append(add_df, ignore_index=True)


# Define the variables that are not features
non_stats = ['name', 'win/loss', 'referee', 'round', 'method', 'fight_count', 'fight_id', 'distance_per', 'distance_%', 'pass_per',
             'reversal_per', 'total_rounds']

# Add the percentage and per fight columns
for column in fighter_df:
    if column not in non_stats:
        if column[-3:] == 'att':
            base = column[:-4]
            if column[-7:-4] == 'rec':
                fighter_df[column[:-7] + 'def_%'] = fighter_df.apply(lambda x: 0 if x[column] == 0 else x[base] / float(x[column]), axis=1)
            else:
                fighter_df[base + '_%'] = fighter_df.apply(lambda x: 1 if x[column] == 0 else 1 - (x[base] / float(x[column])), axis=1)
        else:
            fighter_df[column + '_per'] = fighter_df.apply(lambda x: x[column] / x['fight_count'], axis=1)
            
# Drop columns where necessary:
for column in fighter_df:
    if column not in non_stats:
        if column[-3:] != 'per' and column[-1:] != '%':
            fighter_df = fighter_df.drop(column, axis=1)


# Join fighter_df with the bio_df, drop null values and reset the index
fighter_df = fighter_df.merge(bio_stats, how='inner', on='name')
fighter_df = fighter_df.drop(labels=['win_%', 'referee'], axis=1)

fighter_df['reach'] = fighter_df['reach'].apply(lambda x: np.median(fighter_df['reach']) if pd.isnull(x) else x)
fighter_df['height'] = fighter_df['height'].apply(lambda x: np.median(fighter_df['height']) if pd.isnull(x) else x)
fighter_df = fighter_df.reset_index(drop=True)


# Create a dictionary of fights, with values as the fighters in each fight
fight_dict = {}
for i in range(len(fighter_df['fight_id'])):
    if fighter_df['fight_id'][i] not in fight_dict:
        fight_dict[fighter_df['fight_id'][i]] = [i]
    else:
        fight_dict[fighter_df['fight_id'][i]].append(i)


# Subtract each fighter's stats against their corresponding opponent's matching stats
# i.e., fighter1 Head Strikes vs. fighter2 Head Strikes defended
count = 0
for fight_key in fight_dict:
    if len(fight_dict[fight_key]) < 2:
        count += 1
    else:
        current = fighter_df.loc[fight_dict[fight_key][0]]
        opponent = fighter_df.loc[fight_dict[fight_key][1]]
        for column in current.index:
            if column not in non_stats:
                if column[-1:] == '%' and column[-3:] != 'f_%':
                    base_column = column[:-1]
                    temp_current = current[column] - opponent[base_column+'def_%']
                    temp_opponent = opponent[column] - current[base_column+'def_%']
                    fighter_df.loc[fight_dict[fight_key][0], column] = temp_current
                    fighter_df.loc[fight_dict[fight_key][1], column] = temp_opponent
                elif column[-5:] == 'def_%':
                    base_column = column[:-5]
                    temp_current = current[column] - opponent[base_column+'%']
                    temp_opponent = opponent[column] - current[base_column+'%']
                    fighter_df.loc[fight_dict[fight_key][0], column] = temp_current
                    fighter_df.loc[fight_dict[fight_key][1], column] = temp_opponent
                elif column[-3:] == 'per' and column[-5:] != 'c_per':
                    base_column = column[:-3]
                    temp_current = current[column] - opponent[base_column+'rec_per']
                    temp_opponent = opponent[column] - current[base_column+'rec_per']
                    fighter_df.loc[fight_dict[fight_key][0], column] = temp_current
                    fighter_df.loc[fight_dict[fight_key][1], column] = temp_opponent
                elif column[-7:] == 'rec_per':
                    base_column = column[:-7]
                    temp_current = current[column] - opponent[base_column+'per']
                    temp_opponent = opponent[column] - current[base_column+'per']
                    fighter_df.loc[fight_dict[fight_key][0], column] = temp_current
                    fighter_df.loc[fight_dict[fight_key][1], column] = temp_opponent
                elif column == 'age' or column == 'height' or column == 'reach':
                    temp_current = current[column] - opponent[column]
                    temp_opponent = opponent[column] - current[column]
                    fighter_df.loc[fight_dict[fight_key][0], column] = temp_current
                    fighter_df.loc[fight_dict[fight_key][1], column] = temp_opponent

fighter_df.to_csv('aggregated fightmetric stats.csv')
