# Machine Learning Project on MMA Dataset

## Introduction

Mixed Martial Arts is a very unpredictable sport, with analysts having a hard time picking out eventual winners, especially in the UFC organization. However, can we use/create Machine Learning models to aid those experts in making more informed decisions?

In this project, I explore a large UFC database and utilise a range of models to find a model which accurately predicts fight outcomes.

Some data cleaning and feature engineering was required first, which was then followed by experimentation of the models.

Information on the maning of variables are available at the end.

I do not own this data set, it was scraped using Python, this is available on GitHub:

https://github.com/remypereira99/UFC-Web-Scraping

### Explaining the Variables

For ufc_fights:
* event_id - Secondary key from ufc_events\
* referee - Referee of the fight\
* f_1 - Fighter 1\
* f_2 - Fighter 2\
* winner - Winner of the fight\
* num_rounds - Number of rounds\
* title_fight - Boolean for whether fight is a title fight or not\
* weight_class - Weight class of the fight\
* gender - Male or female fight\
* result - How did the fight end, e.g. decision, KO\
* result_details - Specific details of how the fight ended, e.g. KO by elbows, split decision\
* finish_round - What round did the fight finish\
* finish_time - What minute and second did the fight finish in that round (m:ss)\
* fight_url - URL used to scrape fight data from ufcstats.com\

For ufc_stats:
* fight_id - Foreign key from ufc_fights\
* fighter_id - Foreign key from ufc_fighters\
* Knockdowns - No. of knockdowns landed \
* total_strikes_att - No. of strikes attempted\
* total_strikes_succ - No. of successful strikes\
* sig_strikes_att - No. of significant strikes attempted\
* sig_strikes_succ - No. of significant strikes successful\
* takedown_att - No. of takedown attempts\
* takedown_succ - No. of successful takedowns\
* submission_att - No. of submission attempts\
* reversals - No. of reversals\
* ctrl_time - Control time \
* fighter_age - Age of the fighter\
* winner - Boolean for whether fighter won or lost\

For ufc_fighters:
* fighter_id - Primary key for ufc_fighters, unique for each fighter\
* fighter_f_name - Fighter first name\
* fighter_l_name - Fighter last name\
* fighter_nickname - Fighter nickname\
* fighter_height_cm - Fighter height in cm\
* fighter_weight_lbs - Fighter weight in lbs \
* fighter_reach_cm - Fighter reach in cm\
* fighter_stance - Fighter stance - e.g. southpaw, orthodox\
* fighter_dob - Fighter date of birth\
* fighter_w - No. of wins (at the time of scraping)\
* fighter_l - No. of losses (at the time of scraping)\
* fighter_d - No. of draws (at the time of scraping)\
* fighter_nc_dq - No. of no contests or disqualifications (at the time of scraping)\
* fighter_url - URL used to scrape fighter data from ufcstats.com}

The rest of the variables will be explained throughout the notebook, with a bit of intuition needed to understand the abbreviations

### Setup

In [None]:
# Load up Required Datasets
ufc_fights <- read.csv('/kaggle/input/mma-dataset-2023-ufc/ufc_fights.csv')
ufc_stats <- read.csv('/kaggle/input/mma-dataset-2023-ufc/ufc_fight_stats.csv')
ufc_fighters <- read.csv('/kaggle/input/mma-dataset-2023-ufc/ufc_fighters.csv')

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score

## Data

### Data Cleaning

This data was scraped from the ufcstats.com. Unfortunately, it is not necessarily in the form needed for the models. Furthermore, the three datasets have useful statistics, and thus needed to be merged. For context, the ufc_stats data contains fight stats including striking output and control times, the ufc_fights data contains data on the fight such as weightclass and who won, the ufc_fighters data contain information on the fighters such as their reach and records.

First, a full name variable was added for simplicity. Followng on from this, the datasets were merged and control/finish stats were changed to the right object type.

In [None]:
# First, a full name column was created for simplicity
ufc_fighters['fighter_fullname'] = ufc_fighters['fighter_f_name'] + ' ' + ufc_fighters['fighter_l_name']
ufc_fighters = ufc_fighters.drop(columns=['fighter_f_name', 'fighter_l_name', 'fighter_dob'])

# Then, the datasets were joined such that we had the fight and fighter stats for each fight
df = ufc_fights.merge(ufc_stats, on='fight_id', how='outer').merge(ufc_fighters, on='fighter_id', how='outer')
df = df.drop(columns=['fight_url', 'fighter_url'])  # Removed url columns as not relevant

# We have a column on control time and finish time, but the MM:SS format is not ideal
df['ctrl_time'] = df['ctrl_time'].replace('NULL', '0:00')
df['finish_split'] = df['finish_time'].str.split(':')
df['ctrl_split'] = df['ctrl_time'].str.split(':')
df['finish_time'] = 60 * df['finish_split'].str[0].astype(int) + df['finish_split'].str[1].astype(int) + 300 * df['finish_round']
df['ctrl_time'] = 60 * df['ctrl_split'].str[0].astype(int) + df['ctrl_split'].str[1].astype(int)
df = df.drop(columns=['finish_split', 'ctrl_split'])
print(df.head())


We now have fight and fighter statistics merged for every fight. It is important to note that every fight is mirrored. What this means is that two consecutive rows depict the same fight, but contain in-fight statistics (striking, takedowns, submisions, control etc.) for each seperate fighter on one row each. To understand this, by looking at the fight_id column, we see that the first two rows have identical fight_id value (1), but the fighter_fight_id value differs (1 & 2). This is because row 1 contains in-fight data for Sean Daugherty, whereas row 2 contains in-fight data for Scott Morris.

The next step was to add the opponent ID and opponent full name for each row. For context, f_1 and f_2 are variables which contain the ID number for both fighters. We assigned the opponent_id value by using the ID (f_1 or f_2) which did not match the fighter_id column. The opponent_fullname was then assigned by matching the opponent_id between df and ufc_fighters, pulling out the fullname for this ID.

In [None]:
# Setting opponent ID and fullname as NA
df$opponent_id <- NA
df$opponent_fullname <- NA

# Detecting rows where fighter ID data is not available
na_or_null <- is.na(df$f_1) | is.null(df$f_1)

# Setting opponent ID, then pulling the full name by matching the ID
df$opponent_id[!na_or_null] <- ifelse(df$f_1[!na_or_null] == df$fighter_id[!na_or_null], df$f_2[!na_or_null], df$f_1[!na_or_null])
df$opponent_fullname <- ufc_fighters$fighter_fullname[match(df$opponent_id, ufc_fighters$fighter_id)]

# Removing some irrelevant columns
df <- df %>%
  select(-c('referee', 'f_1', 'f_2', 'winner.x', 'num_rounds', 'weight_class_rank', 'result_details', 'finish_round',
            'fighter_fight_id', 'fighter_nickname'))

### Data Engineering

Now, we have done some of the more tedious work. Maybe you have noticed, but these statistics give us information from fights that have already happened, but we are interested in predicitng future outcomes in our project. This means that access to this data will not be available for our prediction, since the fight will not have already happened.

This is where this project gets interesting, we need to create our own statistics which will be useful when predicting upcoming fights.

Typical statistics use include striking, takedown, knockdown, submission (sub), reversal, control and finish time data.

These statistics can be caculated in diffeent ways, including success percentage and average. Two functions were created, the first for calculating percentage accuracy, the second for simple average.

Each fighter's percentage accuracy and average statistics were calculated over their whole career (I believe this gives the best representation of their output at any given time) and then added to a dataframe which can be merged with the base dataframe (df).

In [None]:
# Creating an empty data frame where statistics will be stored for each fighter
my_stats_df <- data.frame(fighter_id = 1:nrow(ufc_fighters))

# Mutating relevant variables into a numeric type
df <- df%>%
  mutate(across(c(knockdowns:ctrl_time), as.numeric))

# Percentage accuracy function
percentage.stat <- function(a,b,c,d){
  for(i in 1:nrow(ufc_fighters)){
    fighter.data <- subset(df, fighter_id == i)
    for(j in 1:nrow(fighter.data)){
      if(nrow(fighter.data)>0){
        fighter.data[j,a] <- fighter.data[j,b]/fighter.data[j,c]
        my_stats_df[i,d] <- mean(fighter.data[,a])
        if(is.nan(my_stats_df[i,d])){
          my_stats_df[i,d] <- 0
        }else{
        }
      }else {
      }
    }
  }
  return(my_stats_df)
}

# Average function
average.stat <- function(a,b){
  for(i in 1:nrow(ufc_fighters)){
    fighter.data <- subset(df, fighter_id == i)
    if(nrow(fighter.data) > 0){
      my_stats_df[i,a] <- mean(fighter.data[,b], na.rm = TRUE)
      if(is.nan(my_stats_df[i,a])){
        my_stats_df[i,a] <- NA
      }else{
      }
    }else{
    }
  }
  return(my_stats_df)
}

# Calculating percentage accuracy for strikes, significant strikes and takedowns
my_stats_df <- percentage.stat('acc_strike', 'total_strikes_succ', 'total_strikes_att', 'strike_perc')
my_stats_df <- percentage.stat('acc_sigstrike', 'sig_strikes_succ', 'sig_strikes_att', 'sigstrike_perc')
my_stats_df <- percentage.stat('acc_td', 'takedown_succ', 'takedown_att', 'takedown_perc')

# Calculating average for strikes, significant strikes, knockdowns, takedowns, submission attempts, reversals, control and finish time
my_stats_df <- average.stat('strike_avg', 'total_strikes_succ')
my_stats_df <- average.stat('sig_strike_avg', 'sig_strikes_succ')
my_stats_df <- average.stat('knockdown_avg', 'knockdowns')
my_stats_df <- average.stat('takedown_avg', 'takedown_succ')
my_stats_df <- average.stat('sub_avg', 'submission_att')
my_stats_df <- average.stat('reversal_avg', 'reversals')
my_stats_df <- average.stat('control_avg', 'ctrl_time')
my_stats_df <- average.stat('finish_avg', 'finish_time')

# We then calculate the average strikes and significant strikes per minute
my_stats_df <- my_stats_df %>%
  mutate(strike_per_minute = strike_avg/(finish_avg/60), sig_strike_per_minute = sig_strike_avg/(finish_avg/60))

# We also calculated win percentage
my_stats_df <- my_stats_df %>%
  mutate(win_perc = (ufc_fighters$fighter_w)/(rowSums(ufc_fighters[,7:9])))

We will now create a new dataframe such that it contains the opponent's fight statistics, making it easier to merge.

We then merge the three dataframes such that on one row we have fight statistics, in-fight statistics, fighter statistics and opponent statistics.

In [None]:
# Opponent stat dataframe
opp_stats_df <- my_stats_df
for(i in 2:ncol(opp_stats_df)){
  colnames(opp_stats_df)[i] <- paste0('opp_', colnames(my_stats_df[i]))
}

# Joining the dataframes
df <- df %>%
  full_join(my_stats_df, by = 'fighter_id')%>%
  full_join(opp_stats_df, by = c('opponent_id' = 'fighter_id'))

Calculating stat differentials

In [None]:
# Calculating differentials
df <- df %>%
  mutate(strike_diff = strike_avg - opp_strike_avg, sig_strike_diff = sig_strike_avg - opp_sig_strike_avg,
         spm_diff = strike_per_minute - opp_strike_per_minute, sspm_diff = sig_strike_per_minute - opp_sig_strike_per_minute,
         knockdown_diff = knockdown_avg - opp_knockdown_avg, takedown_diff = takedown_avg - opp_takedown_avg,
         sub_diff = sub_avg - opp_sub_avg, control_diff = control_avg - opp_control_avg)

# Removing some columns which are now not useful, Ordering data frame as wanted.
df <- df %>%
  select(-c('fighter_d', 'fighter_l', 'fighter_nc_dq'))%>%
  select(event_id, fight_id, fighter_id, fighter_fullname, fighter_age, fighter_height_cm:fighter_stance, 
         title_fight, weight_class, gender, winner.y, result: finish_time, total_strikes_att:ctrl_time, strike_perc:win_perc,
         fighter_w, opponent_id, opponent_fullname, opp_strike_perc:control_diff)%>%
  mutate(across(fighter_age:fighter_reach_cm, as.numeric))# Changing variable types to numeric

As visible in th table above, there seems to be a lot of missing data. A quick visualisation of NA values location reveals that most of these are located in the earlier section of the dataframe. This is because he UFC did not consistently track metrics such as age, reach, height and control time in the earlier stages of the organisation. A decision was made to remove rows with NA values. This decision was based around 2 reasons:
* For simplicity - removing NA values will facilitate the use of machine learning models
* Old data -  this data is fairly old, dating back to the early 2000s, thus may not be very useful for predicting fight outcomes some 20 years later.

Removing this data led to a reduction of 10.7%, which is substantial, but makes sense. Originally, the ufc_stats dataframe, which contains in-fight statistics was made up of 14148 rows, our reduced df dataframe contains 12626 rows.

In [None]:
# Visualising Location of NA Values
library(Amelia)
missmap(df, col = c('yellow', 'black')) # The yellow lines at the top represent the missing values from earliest fights. The large yellow block at the bottom is representative of rows corresponding to fighters who have not fought in the UFC, so data is not useful either.

# Removing rows with NA values
df <- na.omit(df)

# Removing NULL values to Unknown, then changin variables types from character to factor
df$fighter_stance <- ifelse(df$fighter_stance == 'NULL', 'Unknown', df$fighter_stance)
df$weight_class <- ifelse(df$weight_class == 'NULL' | is.na(df$weight_class), 'Unknown', df$weight_class)
df <- df %>%
  mutate(weight_class = factor(weight_class, levels = c("Women's Strawweight", "Women's Flyweight", "Women's Bantamweight", 
                                                        "Women's Featherweight", "Flyweight", 'Bantamweight', 'Featherweight', 
                                                        'Lightweight', 'Welterweight', 'Middleweight', 'Light Heavyweight', 
                                                        'Heavyweight', 'Catch Weight', 'Open Weight', 'Unknown')))%>%
  mutate(across(c('fighter_stance', 'title_fight', 'gender', 'winner.y', 'result'), as.factor))

dim(df) # 12626 x 63 
dim(ufc_stats) # 14148 x 15

The next step involved calculating a new set of statistics. First, the original data contains overall records, whereas UFC record should be used for our models, due to the level of competition outside of the UFC potentially leading to an inflation in the number of victories for fighters. Thus, the number of wins within the UFC was calculated for a more accurate representation of the fighters' success within the organisation.

Next, the form of the fighter was calculated, both before the fight and after the fight. Essentially, the last 5 fights were used (or last however many if the fighter does not have 5 fights in the UFC), a victory scores 1 point, whilst a loss gives you -1. Therefore, a fighter who is 3-2 in their last 5 scores a 1, whereas a fighter who is 1-4 scores -3. The calculation for the last 5 form after the fight was performed for simplicity later down the line.

Similarly, the streak before and after the fight was calculated. A streak of 8 wins scores an 8, whereas a losing streak of 3 fights score a -3. When a fighter loses a winning streak, their streak is changed to -1, and vice versa when breaking a losing streak, their streak is updated to 1.

(newlast5 and newstreak variables will not be used in models)

In [None]:
# Calculating wins with the UFC
full_df <- data.frame()
for(name in unique(df$fighter_fullname)){
  fighter_df <- df %>%
    filter(fighter_fullname == name)
  for(i in 1:nrow(fighter_df)){
    fighter_df$fighter_w[i] <- sum(fighter_df$winner.y[0:(i-1)] == 'T')
  }
  full_df <- rbind(full_df, fighter_df)
}
df <- full_df # Actual fighter_w

# Calculating form from last 5 fights
full_df <- data.frame()
for(name in unique(df$fighter_fullname)){
  vector <- c()
  fighter_df <- df %>%
    filter(fighter_fullname == name)
  for(i in 1:nrow(fighter_df)){
    if(i <4){
      newlast5 <- sum(fighter_df[0:(i), 'winner.y'] == 'T') - sum(fighter_df[0:(i), 'winner.y'] == 'F')
    }else{
      newlast5 <- sum(fighter_df[(i-4):(i), 'winner.y'] == 'T') - sum(fighter_df[(i-4):(i), 'winner.y'] == 'F')
    }
    vector[i] <- newlast5
  }
  fighter_df$last5 <- append(vector[-length(vector)], 0, 0) # last5 = last 5 fights form before fight
  fighter_df$newlast5 <- vector # newlast5 = updated last 5 fights form after fight
  full_df <- rbind(full_df, fighter_df)
} # last 5 fight form (before and after fight)
df <- full_df

# Calculating the streak
full_df <- data.frame()
for(name in unique(df$fighter_fullname)){
  streak <- 0
  full_streak <- c()
  fighter_df <- df%>%
    filter(fighter_fullname == name)
  for(j in 1:nrow(fighter_df)){
    if(streak == 0){
      if(fighter_df$winner.y[j] == 'T'){
        streak <- 1
      }else{
        streak <- -1
      }
    }else if(streak<0){
      if(fighter_df$winner.y[j] == 'T'){
        streak <- 1
      }else{
        streak <- streak - 1
      }
    }else{
      if(fighter_df$winner.y[j] == 'T'){
        streak <- streak + 1
      }else{
        streak <- -1
      }
    }
    full_streak <- append(full_streak, streak)
  }
  fighter_df$streak <- append(full_streak[-length(full_streak)],0,0) # streak = streak before fight
  fighter_df$newstreak <- full_streak # newstreak = streak after fight
  full_df <- rbind(full_df, fighter_df)
}
df <- full_df

# Making a copy of current df dataframe
full_df <- df

head(df)

We are done with feature engineering, we have calculated a range of interesting statistics and changed their types to the appropriate  structure.

### Final Data Cleaning

There is still an issue with the df dataset, many of the fights are duplicated, as in we have the statistics for each fighter for the same fight. The issue with this is that you are at risk of heavily correlating error terms, since if one win, the other must lose. We also won't be able to calculate true accuracy and precision

In this section, the dataset was reduced such that the only 1 side of each fight is included. Particular attention was drawn to having equal amount of winner.y = TRUE (won the fight) as winner.y = FALSE (lost the fight) such that the models do not have a bias in predicting losses.

In [None]:
# For replicating results
set.seed(5550)

# Finding fights which have stats for both fighters (some only have stats for 1)
duplicated_fights <- duplicated(df$fight_id) | duplicated(df$fight_id, fromLast = T)

# Creating dataframe of fights which are not uplicated
df.unique.fights <- df[!duplicated_fights,] 
df.duplicated.fights <- df[duplicated_fights,] # Automatically using those fights for our final dataframe

# Getting random sample from remaining fights, ensuring TRUE and FALSE are equally spread
sample <- sample(unique(df.duplicated.fights$fight_id), ((length(unique(df$fight_id))/2)-sum(df.unique.fights$winner.y == 'T')), replace = FALSE)
df.duplicated.winners <- df.duplicated.fights %>%
  filter(winner.y == 'T')%>%
  subset(fight_id%in%sample)
df.duplicated.losers <- df.duplicated.fights %>%
  filter(winner.y == 'F')%>%
  subset(!fight_id%in%sample)
df <- rbind(df.unique.fights,df.duplicated.winners, df.duplicated.losers)%>%
  arrange(fight_id)

# Checking we have an equal spread
dim(df) # 6707 x 67 (6707 unique fights)
sum(df$winner.y == 'T') # 3353 of those fights contain stats on winning fighter stats

# Removing some final columns
df <- df%>%
  select(-event_id, -(total_strikes_att:ctrl_time), -finish_time, -newlast5, -newstreak)

head(df)


The dataset is ready to be used.

## Data Visualisation

Some Data visualisation work was carried out, with some being made to visualise and understand potential impactful variables on winning, and others just for fun and practice.

In [None]:
# Visualising if significant strike output impacts winning
ggplot(df, aes(x = winner.y, y = sspm_diff, fill = winner.y)) + geom_boxplot() + theme_economist_white() + labs(x = 'Victory?', y = 'Difference in Significant Strikes per Minute', title = 'Difference in Striking Output') + scale_fill_discrete(guide = guide_legend('Winner?', legend.position = 'top'))

In [None]:
# Finding if there are any discrepancies between males and females in fighting stance. This also allows us to visualise if the stance impacts winning.
df %>%
  filter(fighter_stance != 'Open Stance' & fighter_stance != 'Unknown')%>%
  ggplot(aes(x = fighter_stance, fill = winner.y)) + geom_bar(col = 'black', position = 'fill') + scale_fill_discrete(guide = guide_legend('Gender', title.position = 'bottom',label.position = 'top')) + theme(legend.position = 'top') + facet_grid(.~gender)

In [None]:
# Plotting average age by weightclass
myplot <- df%>%
  group_by(weight_class)%>%
  mutate(mean_age = mean(fighter_age))%>%
  ungroup()%>%
  mutate(overall.mean.age = mean(fighter_age))%>%
  filter(weight_class != 'Open Weight' & weight_class != 'Unknown')%>%
  ggplot(aes(x = fighter_age, y = weight_class)) + geom_jitter(aes(col = weight_class), alpha = 0.5) + 
  geom_vline(xintercept = mean(df$fighter_age),lty = 2, lwd = 1) + 
  geom_segment(aes(x = mean_age, xend = mean(fighter_age), y = weight_class, yend = weight_class)) +
  stat_summary(geom = 'point', size = 6, fun = mean, col = 'black')
myplot # There aren't any particular differences, maybe fighters ge older for heavier weight classes? looks colourful though

## Machine Learning

### Prepping Training and Test Sets

When carrying out Machine Learning, it is important to separate the whole dataset into a training portion and a test portion. For this project, a 70/30 ratio was used, guaranteeing that enough data was available for training whilst having a large enough sample for testing.

In [None]:
# Removing irrelevant columns
df <- df%>%
  select(-c(fight_id:fighter_fullname, result, opponent_id, opponent_fullname)) # Final selection of variables
df_save <- df # Making a copy, useful later

# Training and Test Data
set.seed(123)
split <- sample.split(df$winner.y, SplitRatio = 0.7)
training <- subset(df, split == TRUE)
test <- subset(df, split == FALSE)

### Forest Method

Tree methods are useful, but somewhat limited when intricate trends need to be identifies. A Random Forest method is a good alternative as it combines multiple trees which are trained on different portions of the training dataset and having only part of the variables for each split.

The number of wins and control differential were given particular importance as, intuitively, they should reflect the fighter's probability of winning fairly well.

In [None]:
# Model
set.seed(101)
forest <- randomForest(data = training, winner.y ~.  + I(fighter_w)^2 + I(control_diff)^2 - fighter_stance - fighter_age -
                         fighter_height_cm - finish_avg, importance = TRUE, method = 'class') # fighter_w
forest$importance # This gives you insight into the most infleuncial predictor/variable (MeanDecreaseGini higher = more impact)
forest.pred <- predict(forest, test, method = 'class')
table(forest.pred, test$winner.y)
confidence.forest <- mean(forest.pred == test$winner.y)
precision.forest <- table(forest.pred, test$winner.y)[2,2]/(rowSums(table(forest.pred, test$winner.y))[2])
confidence.forest # 70%
precision.forest # 71%

### Support Vector Machine

Support Vector Machine Models aim to create a decision boundary based on the data by using support vectors. Essentially, it chooses (you can choose this manually) a number of points in the multidimensional space as key points. The model then attempts to separate these points through a decision boundary, this can be a linear, radial or polynomial boundary.

In large datasets with many predictors, it is very often impossible to completely separate these points through a boundary, thus the model allows missclassifications in order to produce this boundary. This means that some of the data points may purposely be allowed to be misclassified by your decision boundary in order to benefit the model as a whole. The number of missclassifications can be controlled by controlling the 'cost'. A low cost allows a lot of missclassification.

To decide on the decision boundary, the model follows this process. Since missclassifications are allowed, the model can produce many decision lines/curves which separate the data into two categories (in our case, TRUE and FALSE) based on the predictors. The model then produces a margin between the decision line/curve and the support vectors, such that the margin extends to the support vector. We then choose the decision line/curve which maximizes this margin as the decision boundary.

By increasing the number of missclassifications (decreasing the cost), we allow the model to be more robust to changes in data, but you also increase your bias. Similarly, by increasing your cost, which decreases the number of missclassifications, your model is mor susceptible to changes in data but your bias is lower. You must strike a good balance between the two.

In this model, a radial kernel (decision boundary) was utilised due to the complexity of the dataset.

In [None]:
# Model
set.seed(123)
mysvm <- svm(data = training, winner.y ~. + I(fighter_w)^2 + I(control_diff)^2 - fighter_stance - fighter_age -
               fighter_height_cm, kernel = 'radial')
svm.pred <- predict(mysvm, test)
table(svm.pred, test$winner.y)
confidence.svm <- mean(svm.pred == test$winner.y)
precision.svm <-  table(svm.pred, test$winner.y)[2,2]/(rowSums(table(svm.pred, test$winner.y))[2])
confidence.svm # 71%
precision.svm # 72%

### Logistic Regression

Logistic Regression Models are some of the simpler classifier models, which is what is investeigated in this project. The Logistic Regression Model functions in the following way:
* Fitting curves - since we only have two outcomes, the model attempts to adjust predictor coefficients such that it produces a curve (once data has been passed through a logit link function) which attempts to correctly match the TRUE and FALSE points.
* Maximizes the likelihood function - the likelihood function (LF) expresses the probability that the data points are distributed as per thye data if the function assumed in previous step is true. Thus, by maximizing the LF, we can find preditor coefficients which are most likely to result in the training data results.
* Classifies test data - Once we have our function, the test data can be classified by calculating the probability of the data point being in the class of interest. Once above a threshold (usually 0.5/50%, but can be chosen), the data point will be classified as the class of interest.

In this section, we investigate this model and how the decision threshold affects the accuracy/precision of the model.

In [None]:
# Model
set.seed(123)
logmodel <- glm(data = training, winner.y ~. + I(fighter_w)^2 + I(control_diff)^2, family = binomial(link = logit))
logmodel <- step(logmodel)

# Investigating impact of decision threshold on accuracy/confidence
confidence.log <- c()
for(i in seq(min(0.4), max(0.6), by = 0.01)){
  log.pred <- predict(logmodel, test, type = 'response')
  log.pred <- ifelse(log.pred > i, 'T', 'F')
  confidence.log <- append(confidence.log, mean(log.pred == test$winner.y))
}
plot(seq(min(0.4), max(0.6), by = 0.01),confidence.log)
lines(seq(min(0.4), max(0.6), by = 0.01),confidence.log) # max accuracy at i = 0.51

# Setting decision threshold at 0.51 (peak accuracy)
log.pred <- predict(logmodel, test, type = 'response')
log.pred <- ifelse(log.pred > 0.51, 'T', 'F')
confidence.log <- mean(log.pred == test$winner.y)
precision.log <- table(log.pred, test$winner.y)[2,2]/(rowSums(table(log.pred, test$winner.y))[2])
confidence.log # 70%
precision.log # 71%

### Neural Network

Neural Network is a method of deep learning which is complicated and I do not have enough knowledge to explain it accurately. In short, it uses a number of nodes and adjusts weights of predictors based on the accuracy of the previous predictions.

In Neural Networks, it is important to scale your data, such that not one is considered more important than the other originally.

In [None]:
# Scaling training and testing dataframes
scaled.df <- df%>%
  select(-c(fighter_stance:gender, winner.y))%>% # removing factors
  scale()%>%
  data.frame()%>%
  mutate(winner.y = df$winner.y)

set.seed(123)
split <- sample.split(scaled.df$winner.y, SplitRatio = 0.7)
scaled.training <- subset(scaled.df, split == TRUE)
scaled.test <- subset(scaled.df, split == FALSE)

# Model
n <- names(scaled.training)
f <- as.formula(paste("winner.y ~", paste(n[!n %in% "winner.y"], collapse = " + ")))
nn <- neuralnet(data = scaled.training, f, linear.output = FALSE)
nn.pred <- predict(nn, scaled.test[,-ncol(scaled.test)])
nn.pred <- ifelse(nn.pred[,2] >0.5, 'T', 'F')
table(nn.pred, scaled.test$winner.y)
confidence.nn <- mean(nn.pred == scaled.test$winner.y)
precision.nn <- table(nn.pred, test$winner.y)[2,2]/(rowSums(table(nn.pred, test$winner.y))[2])
confidence.nn # 70%
precision.nn # 71%


### Final Function And Prediction Times

Models have now been built and we are ready to predict fight outcomes. All you need to input are the fighter names, TRUE/FALSE (whetehr it is a title fight or not), M/F (for gender), the weightclass and the model of choice.

The options for the model are:
* mysvm - Support Vector Machine (type in mysvm as final argument in function)
* forest - Random Forest
* logmodel - Logistic Regression
* nn - Neural Network

Please ensure to correctly write the fighter's names and weightclass (if you go back up, the correct spelling will be available where the weight class predictor was converted from character to factor type).

Also bare in mind the fact that some fighters are not in the original database since they had not made their debut yet. Another thing where you have to be careful is when choosing a fighter who's name may be replaced with their nicknam (Mike Matheta Diamond = Blood Diamon in the dataset), or have multiple names within their names, the original dataset may only include a part of the whole name.

In [None]:
# Creating the Function
my.function <- function(a, b, title, gender, category, model){
  fighter_names <- unique(full_df$fighter_fullname)
  get_fighter1_data <- function(name) {
    fighter_data <- full_df %>%
      filter(fighter_fullname == name) %>%
      slice_max(fight_id)
    return(fighter_data)
  }
  get_fighter2_data <- function(name){
    fighter_data <- full_df %>%
      filter(opponent_fullname == name) %>%
      slice_max(fight_id)
    return(fighter_data)
  }
  
  if(a%in%fighter_names & b%in%fighter_names){
    get_prediction <- function(a,b){
      fighter1 <- get_fighter1_data(a)%>%
        select(fighter_age:fighter_stance, strike_perc:fighter_w, newstreak, newlast5)%>%
        rename(last5 = newlast5, streak = newstreak)
      fighter2 <- get_fighter2_data(b)%>%
        select(opp_strike_perc:opp_win_perc)
      
      df <- cbind(fighter1, fighter2)
      df <- df %>%
        mutate(strike_diff = strike_avg - opp_strike_avg, sig_strike_diff = sig_strike_avg - opp_sig_strike_avg,
               spm_diff = strike_per_minute - opp_strike_per_minute, sspm_diff = sig_strike_per_minute - opp_sig_strike_per_minute,
               knockdown_diff = knockdown_avg - opp_knockdown_avg, takedown_diff = takedown_avg - opp_takedown_avg,
               sub_diff = sub_avg - opp_sub_avg, control_diff = control_avg - opp_control_avg)
      df$title_fight <- factor(title, levels = c('FALSE', 'TRUE'))
      df$gender <- factor(gender, levels = c('F', 'M'))
      df$weight_class <- factor(category, levels = c("Women's Strawweight", "Women's Flyweight", "Women's Bantamweight", 
                                                     "Women's Featherweight", "Flyweight", 'Bantamweight', 'Featherweight', 
                                                     'Lightweight', 'Welterweight', 'Middleweight', 'Light Heavyweight', 
                                                     'Heavyweight', 'Catch Weight', 'Open Weight', 'Unknown'))
      
      if(identical(model, nn)){
        df <- df_save%>%
          select(-winner.y)%>%
          rbind(df)%>%
          select(-c(fighter_stance:gender))%>% # removing factors
          scale()%>%
          data.frame()%>%
          tail(1)
        prediction <- predict(nn, df)
        prediction <- ifelse(prediction[,2] >0.5, 'T', 'F')
      }else{
        prediction <- predict(model, df)
        if(identical(model, logmodel)){
          prediction <- ifelse(prediction>0.52, 'T', 'F')
        }
      }
      return(prediction)
    }
    
    prediction <- get_prediction(a,b)
    prediction.bis <- get_prediction(b,a)
    
    models <- list(
      mysvm = c(confidence.svm, precision.svm, 'Support Vector Machine Model'),
      forest = c(confidence.forest, precision.forest, 'Random Forest Model'),
      logmodel =c(confidence.log, precision.log, 'Logistic Regression Model'),
      nn = c(confidence.nn, precision.nn, 'Neural Network Model')
    )
    model <- deparse(substitute(model))
    confidence <- as.numeric(models[[model]][1])
    precision <- as.numeric(models[[model]][2])
    model <- models[[model]][3]
    
    if(prediction == prediction.bis){
      if(prediction == 'T'){ # 2 is the factor level of TRUE
        print("Sorry, I can't accurately predict this fight, I feel like both should win ;)")
      }else{
        print("Sorry, I can't accurately predict this fight, I feel like both will lose :(")
      }
    }else if(prediction == 'T'){
      paste0('I predict ', a,  ' wins this fight. This has been predicted using a ', model, ' with ', round(confidence*100), '% confidence and ', round(precision*100), '% precision')
    }else{
      paste0('I predict ', b,  ' wins this fight. This has been predicted using a ', model, ' with ', round(confidence*100), '% confidence and ', round(precision*100), '% precision')
    }
  }else{
    if(!a%in%fighter_names & !b%in%fighter_names){
      cat('Sorry, I do not have data on either fighters, did you spell them correctly?')
    }else if(a%in%fighter_names){
      cat('Sorry I do not have data on', b, 'unfortunately, have you spelt it correctly?')
    }else{
      cat('Sorry, I do not have data on', a, 'unfortunately, have you spelt it correctly?')
    }
  }
}

Testing the function on various fictitious fights

In [None]:
my.function("Sean O'Malley", "Aljamain Sterling", 'TRUE', 'M', "Bantamweight", mysvm)
my.function("Jon Jones", "Francis Ngannou", 'TRUE', 'M', "Heavyweight", forest) 
my.function("Paddy Pimblett", "Charles Oliveira", 'FALSE', 'M', "Lightweight", logmodel)
my.function("Erin Blanchfield", "Alexa Grasso", 'TRUE', 'F', "Women's Flyweight", nn)
# This fight is not in the original dataset

## Conclusions

This is the end of the project, it has been very fun to make and share.

For a first project, I think I have been able to dip my toes into a range of different aspects of coding and Machine Learning. Plenty of work remains to be done on optimising my code and the models, but I can't wait to see the progression.

I doubt anybody will see this project to be honest, but if you do, please please give me some feedback, I would honestly appreciate it so much.

One main drawback of this project is the fact that the dataset does not include a metric to represent the level of competition faced. This may explain the Paddy Pimblett vs Charles Oliveira prediction. In short, Paddy Pimblett has fought low-level fighters, and thus his stats are inflated, whereas Charles Oliveira has gone against the best, so his stats will not look as good. To counter this, the total wins predictor (winner.y) was given higher importance since this would reflect the longevity and quality of the fighter, but of course it won't be enough sometimes to push the balance on one side.