In [11]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import openpyxl
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LogisticRegression

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score, roc_auc_score, confusion_matrix

import math



# Background

One of the most pervasive controversies in fighting is weight cutting. Fighters are divided into weight classes that they choose to fight in for the sake of fairness. However, fighters have realized they can gain a massive advantadge by temporarily reducing their weight through water cutting, the process of temporarily losing weight through sweat, to make weight classes that otherwise would be impossible for their body composition. Fighters have been known to regain up to 20 lbs between weigh in and fight night, with the highest ever gain from Geoff Neal, who gained 30.3 pounds at UFC 298 (https://www.espn.com/mma/story/_/id/39610394/seven-ufc-298-fighters-flagged-rehydration-issue). 

Since weigh-ins are done the day before fights, fighters can then recover the lost water and go back to their "real" weight by fight night. The advantadge of weight cutting, especially against an opponent who cuts less, is drastic. Fighters that weigh more hit harder, can more easily smother their opponents in wrestling, and can have a signficant reach advantadge. 

Because of this signficant advantadge from weight cutting, fighting has increasingly become a game of who can cut the most weight. On one hand, this phenomenon hurts the sport by making fighting skill less important to the overall equation. Some fighters who may be less skilled but their bodies are naturally adapted to rapidly losing and gaining water weight. Ultimately, this hurts spectators when fights become "weight-bullying" contests rather than tests of skill and ability. Furthermore, exciting fights can get cancelled from fighters failing to make weight from attempting too large of a weight cut, or even having to pull out due to health complications from bad water cuts (https://www.mmamania.com/2022/9/9/23345292/ufc-279-dana-white-reveals-khamzat-chimaevs-weight-cut-ended-after-locking-and-cramping). Even more concerning is the danger that comes with weight cutting. Fighters have been known to not only have serious health complications frmo bad weight cuts, but even have died from it (https://www.espn.com/mma/story/_/id/14344041/chinese-mma-fighter-yang-jian-bing-dies-trying-make-weight).

To combat this isue, organizations such as One Championship, a rival organization to the UFC, has implemented measuresn such as measuring hydration levels post weigh in to ensure fighters are not cutting too much water weight. However, the UFC, the biggest and most popular MMA organization in the world, currently has no such measures.

Due to how pervasive weight cutting has become, it is important to investigagte how crucial weight cutting has become to fighting. In this study, I intend to investiage the impact weight cutting has on the odds of winning fights.


# Explanation of the Data and the Data Sources

- Variables: lo

- Weight regain percentage: Done

# Hypothesis

In [2]:
stats_path = '/Users/caseymoser/Desktop/UFC Analysis/UFC/ufc_fight_stats.csv'

# data source: https://www.reddit.com/r/MMA/comments/evbnjd/released_offical_ufc_fight_night_weights/
weight_path = '/Users/caseymoser/Desktop/UFC Analysis/UFC/UFC Fight Night Weights.xlsx'

results_path = '/Users/caseymoser/Desktop/UFC Analysis/UFC/ufc_fight_results.csv'

stats_df = pd.read_csv(stats_path)

weight_df = pd.read_excel(weight_path)

results_df = pd.read_csv(results_path)


In [3]:
# Step 1: Split fighters into separate columns
results_df[['Fighter_1', 'Fighter_2']] = results_df['BOUT'].str.split(' vs. ', expand=True)

fighter1_df = results_df.copy()
fighter1_df['FIGHTER'] = fighter1_df['Fighter_1']
fighter1_df['RESULT'] = fighter1_df['OUTCOME'].str[0].map({'W': 'Win', 'L': 'Loss'})
fighter1_df = fighter1_df.drop(columns=['Fighter_1','Fighter_2'])


fighter2_df = results_df.copy()
fighter2_df['FIGHTER'] = fighter2_df['Fighter_2']
fighter2_df['RESULT'] = fighter2_df['OUTCOME'].str[2].map({'W': 'Win', 'L': 'Loss'})
fighter2_df = fighter2_df.drop(columns=['Fighter_1','Fighter_2'])


# Note to self, clean this code so that I first drop fighter_1 and fighter_2, then combine the data sets together/

# Combine both into a single dataframe
results_df_clean = pd.concat([fighter1_df, fighter2_df])

results_df_clean['UFC_EVENT'] = results_df_clean['EVENT'].str.extract(r'(UFC \d+)', expand=False)





In [4]:
weight_df['UFC_EVENT'] = weight_df['EVENT'].str.extract(r'(UFC \d+)', expand=False)
weight_df

weight_df['FIGHTER'] = weight_df['FIGHTER'].str.strip().str.lower()
weight_df['UFC_EVENT'] = weight_df['UFC_EVENT'].str.strip().str.upper()

results_df_clean['FIGHTER'] = results_df_clean['FIGHTER'].str.strip().str.lower()
results_df_clean['UFC_EVENT'] = results_df_clean['UFC_EVENT'].str.strip().str.upper()

merged_weight = pd.merge(
    results_df_clean,
    weight_df,
    on=['FIGHTER', 'UFC_EVENT'],
    how='left', 
    suffixes=('_result', '_weight')
)
results_df_clean

# Drop rows with any NaN values
merged_weight_clean = merged_weight.dropna(subset=['WEIGH IN WEIGHT (lbs)'])



In [5]:
#df.groupby('FIGHTER')['SIG.STR. %'].mean()

win_weight_df = merged_weight_clean

win_weight_df['RESULT'] = win_weight_df['RESULT'].map({'Win': 1, 'Loss': 0})


win_weight_df_clean = win_weight_df.dropna(subset=['RESULT', 'WEIGHT INCREASE (lbs)'])

# calculating percent regain (heavyweights more likely to just have higher raw numbers because they have more fat than lower weight classes)
win_weight_df_clean['PERCENT_REGAIN'] =win_weight_df_clean['WEIGHT INCREASE (lbs)']/win_weight_df_clean['WEIGH IN WEIGHT (lbs)']*100


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  win_weight_df['RESULT'] = win_weight_df['RESULT'].map({'Win': 1, 'Loss': 0})
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  win_weight_df_clean['PERCENT_REGAIN'] =win_weight_df_clean['WEIGHT INCREASE (lbs)']/win_weight_df_clean['WEIGH IN WEIGHT (lbs)']*100


### Choice of Regression

Given that the dependent variable is binary (win or loss), I have chosen to fit a logistic model to predict the log-odds of winning fights based on percentage weight regained.

In [12]:
#Logit Model

X = win_weight_df_clean[['PERCENT_REGAIN']]


y = win_weight_df_clean['RESULT']

# Add a constant (intercept) to the independent variable
X = sm.add_constant(X)

# Fit the OLS model
model = sm.Logit(y, X)
result = model.fit()
print(result.summary())


math.exp(0.0432)

Optimization terminated successfully.
         Current function value: 0.684487
         Iterations 4
                           Logit Regression Results                           
Dep. Variable:                 RESULT   No. Observations:                  437
Model:                          Logit   Df Residuals:                      435
Method:                           MLE   Df Model:                            1
Date:                Tue, 29 Jul 2025   Pseudo R-squ.:                0.004868
Time:                        12:29:07   Log-Likelihood:                -299.12
converged:                       True   LL-Null:                       -300.58
Covariance Type:            nonrobust   LLR p-value:                   0.08714
                     coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------
const             -0.2334      0.275     -0.849      0.396      -0.772       0.306
PERCENT_REGAIN   

1.0441467033097327


In this first log-odds model, the coefficient on PERCENT_REGAIN is 0.0432, meaning every 1 percent increase in weight is linked to a 1.044 change in odds of winning. In other words, a 1% increase in weight regained following fight-weigh in is is linked to a 4.4% increase in odds of winning a fight.

This model is not statistically significant at alpha = 0.05, but it is significant at alpha =0.10.  This means that PERCENT_REGAIN has a statistically signficant impact on winning at a 90% condifidence interval. However, given that the R$^{2}$ value is only 0.007, less than 1% of the variation in fight outcome is explained by percentage weight regain. In this way, this model has insufficient predictive power to be useful for fight prediction.


In [13]:
# testing for causality on randomzied data in logistic model

from sklearn.model_selection import train_test_split

X = win_weight_df_clean[['PERCENT_REGAIN']]

y = win_weight_df_clean['RESULT']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Logistic Regression
log_model = LogisticRegression()
log_model.fit(X_train, y_train)
y_pred_log = log_model.predict(X_test)

print("Logistic Accuracy:", accuracy_score(y_test, y_pred_log))
print("Logistic AUC:", roc_auc_score(y_test, log_model.predict_proba(X_test)[:, 1]))

### Random Forest Classifier
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)

print("\nRF Accuracy:", accuracy_score(y_test, y_pred_rf))
print("RF AUC:", roc_auc_score(y_test, rf_model.predict_proba(X_test)[:, 1]))

Logistic Accuracy: 0.5568181818181818
Logistic AUC: 0.4728947368421053

RF Accuracy: 0.5227272727272727
RF AUC: 0.4723684210526315


In the logistic regression, the performance of the model further confirms that weight regain alone is insufficient to predict fight outcome. Since the AUC values is less than 0.5, the model is worse than random guessing at predicting fight outcomes. In this way, the amount of weight regained by a fighter does not have sufficient predictive power.

In [15]:
# Model with squared term

win_weight_df_clean['PERCENT_REGAIN_SQ'] = win_weight_df_clean['PERCENT_REGAIN'] ** 2


X = win_weight_df_clean[['PERCENT_REGAIN', 'PERCENT_REGAIN_SQ']]
X = sm.add_constant(X)

y = win_weight_df_clean['RESULT']


# Fit logistic regression
logit_model = sm.Logit(y, X)
results = logit_model.fit()

# Print summary
print(results.summary())


Optimization terminated successfully.
         Current function value: 0.683942
         Iterations 4
                           Logit Regression Results                           
Dep. Variable:                 RESULT   No. Observations:                  437
Model:                          Logit   Df Residuals:                      434
Method:                           MLE   Df Model:                            2
Date:                Tue, 29 Jul 2025   Pseudo R-squ.:                0.005661
Time:                        12:31:57   Log-Likelihood:                -298.88
converged:                       True   LL-Null:                       -300.58
Covariance Type:            nonrobust   LLR p-value:                    0.1824
                        coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------
const                -0.5447      0.533     -1.022      0.307      -1.589       0.500
PERCENT_

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  win_weight_df_clean['PERCENT_REGAIN_SQ'] = win_weight_df_clean['PERCENT_REGAIN'] ** 2


In this third logistics model, the coefficients on PERCENT_REGAIN and PERCENT_REGAIN_SQ are both statsitically insignificant at a 95% and 90% confidence interval, implying that we fail to reject the null hypothesis that regaining weight has a downside at higher levels (in other words, the square term is used to see if there is a fall off in the benefit of regaining weight at higher percentage regains due to dehydration issues).

Again, 

### Limitation to Analysis of Weight Regain Models and Room to Expand

# Note to Self: Complete the write up here

The biggest limitation to this analysis is the data set. Not all regions require fighters to publicize their fight night weight, meaning a lot of weight regains are not capture in my analysis. Having more post fight weigh in data could let us more accurately assess if there is a limitation to the benefit of cutting and regaining weight.

Unnamed: 0,EVENT,BOUT,ROUND,FIGHTER,KD,SIG.STR.,SIG.STR. %,TOTAL STR.,TD,TD %,SUB.ATT,REV.,CTRL,HEAD,BODY,LEG,DISTANCE,CLINCH,GROUND
0,UFC 318: Holloway vs. Poirier 3,Max Holloway vs. Dustin Poirier,Round 1,Max Holloway,1.0,26 of 64,40%,26 of 64,0 of 0,---,0.0,0.0,0:08,9 of 39,11 of 16,6 of 9,23 of 57,0 of 0,3 of 7
1,UFC 318: Holloway vs. Poirier 3,Max Holloway vs. Dustin Poirier,Round 2,Max Holloway,0.0,44 of 67,65%,47 of 70,0 of 0,---,0.0,0.0,1:24,31 of 52,11 of 12,2 of 3,31 of 51,0 of 0,13 of 16
2,UFC 318: Holloway vs. Poirier 3,Max Holloway vs. Dustin Poirier,Round 3,Max Holloway,0.0,38 of 61,62%,38 of 61,0 of 0,---,0.0,0.0,0:00,18 of 36,9 of 12,11 of 13,38 of 61,0 of 0,0 of 0
3,UFC 318: Holloway vs. Poirier 3,Max Holloway vs. Dustin Poirier,Round 4,Max Holloway,0.0,44 of 84,52%,44 of 84,0 of 0,---,0.0,0.0,0:00,23 of 57,15 of 18,6 of 9,44 of 84,0 of 0,0 of 0
4,UFC 318: Holloway vs. Poirier 3,Max Holloway vs. Dustin Poirier,Round 5,Max Holloway,0.0,46 of 99,46%,46 of 99,0 of 0,---,0.0,0.0,0:00,22 of 69,18 of 22,6 of 8,46 of 98,0 of 1,0 of 0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38663,UFC 2: No Way Out,Johnny Rhodes vs. David Levicki,Round 1,David Levicki,0.0,4 of 5,80%,95 of 102,0 of 0,---,0.0,0.0,--,4 of 5,0 of 0,0 of 0,1 of 2,2 of 2,1 of 1
38664,UFC 2: No Way Out,Patrick Smith vs. Ray Wizard,Round 1,Patrick Smith,0.0,1 of 1,100%,1 of 1,0 of 1,0%,1.0,0.0,--,0 of 0,1 of 1,0 of 0,0 of 0,1 of 1,0 of 0
38665,UFC 2: No Way Out,Patrick Smith vs. Ray Wizard,Round 1,Ray Wizard,0.0,1 of 1,100%,2 of 2,0 of 0,---,0.0,0.0,--,0 of 0,0 of 0,1 of 1,1 of 1,0 of 0,0 of 0
38666,UFC 2: No Way Out,Scott Morris vs. Sean Daugherty,Round 1,Scott Morris,0.0,1 of 1,100%,2 of 2,1 of 1,100%,1.0,0.0,--,1 of 1,0 of 0,0 of 0,0 of 0,1 of 1,0 of 0
