# Build a Prediction Model Using the DataFrame

### In this file, I will train and test several models for predicting fight outcomes to select the best one that provides the highest accuracy and meets all necessary functional requirements. Comparing the models will help identify which one is most suitable for our task and demonstrates the best performance in the context of fight outcome prediction.

## Data Loading and Missing Values Analysis

This code loads the cleaned UFC dataset and performs an analysis to identify any missing values. It calculates the count and percentage of missing values for each column, displaying a table with columns that contain missing data.

In [14]:
import pandas as pd

In [18]:

# Step 1: Load data
ufc_data = pd.read_csv('../data/processed/ufc_fight_data_cleaned.csv')

# Step 2: Check for missing values
missing_values = ufc_data.isnull().sum()
missing_percentage = (missing_values / len(ufc_data)) * 100
missing_data = pd.DataFrame({'Missing Values': missing_values, 'Percentage': missing_percentage})
missing_data = missing_data[missing_data['Missing Values'] > 0]  # Only columns with missing values

print("Columns with missing values and their percentage:")
print(missing_data)


Columns with missing values and their percentage:
              Missing Values  Percentage
date                      12    0.161312
location                  12    0.161312
total_rounds              31    0.416723
referee                   32    0.430165
r_age                     76    1.021643
r_reach                  412    5.538379
r_stance                  26    0.349509
b_age                    190    2.554107
b_reach                  888   11.937088
b_stance                  68    0.914101
age_diff                 213    2.863288
reach_diff              1038   13.953488


## Handling Missing Values in Numeric and Categorical Columns

his code fills missing values in the dataset. Numeric columns are filled with the mean of each column, while categorical columns are filled with the mode (most frequent value) of each column. A final check is then performed to ensure there are no remaining missing values.

In [19]:
# Step 3: Separate and fill missing values

# Fill numeric columns with mean
numeric_columns = ufc_data.select_dtypes(include=['number']).columns
numeric_missing = missing_data.loc[numeric_columns.intersection(missing_data.index)]
ufc_data[numeric_columns] = ufc_data[numeric_columns].fillna(ufc_data[numeric_columns].mean())

print("\nNumeric columns filled with mean:")
print(numeric_missing)

# Fill categorical columns with mode
categorical_columns = ufc_data.select_dtypes(include=['object']).columns
categorical_missing = missing_data.loc[categorical_columns.intersection(missing_data.index)]
for column in categorical_columns:
    ufc_data[column].fillna(ufc_data[column].mode()[0], inplace=True)

print("\nCategorical columns filled with mode:")
print(categorical_missing)

# Final check for any remaining missing values
print("\nRemaining missing values after filling:")
print(ufc_data.isnull().sum().sum())  # Total count of missing values


Numeric columns filled with mean:
              Missing Values  Percentage
total_rounds              31    0.416723
r_age                     76    1.021643
r_reach                  412    5.538379
b_age                    190    2.554107
b_reach                  888   11.937088
age_diff                 213    2.863288
reach_diff              1038   13.953488

Categorical columns filled with mode:
          Missing Values  Percentage
date                  12    0.161312
location              12    0.161312
referee               32    0.430165
r_stance              26    0.349509
b_stance              68    0.914101

Remaining missing values after filling:
0


## Encoding Categorical Variables and Defining Features and Target Variable

This code converts categorical variables to a numeric format using one-hot encoding. It then defines the feature matrix `X`, which includes all columns except `winner_Red`, and the target variable `y`, represented by the `winner_Red` column, indicating the victory of the fighter in the red corner.

In [20]:
# Encode categorical variables
ufc_data_encoded = pd.get_dummies(ufc_data, drop_first=True)

# Define features and target variable
X = ufc_data_encoded.drop(columns=['winner_Red'])  # Adjust `winner_Red` if needed
y = ufc_data_encoded['winner_Red']


## Splitting Data into Training and Testing Sets

This code splits the dataset into training and testing sets using the `train_test_split` function from scikit-learn. Here, 20% of the data is allocated for testing, while the remaining 80% is used for training the model. The `random_state=42` parameter ensures reproducibility of the data split.

In [21]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


## Training and Evaluating Models for Fight Outcome Prediction

This code applies three models — Logistic Regression, Random Forest, and XGBoost — to predict the fight winner. Each model is trained on the training data and evaluated on the test data. Logistic Regression includes data scaling, while Random Forest and XGBoost operate without it. For each model, accuracy is calculated, and a classification report is displayed to analyze the performance metrics for each model.

In [23]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.pipeline import make_pipeline

# Function to train and evaluate a model
def train_and_evaluate_model(model, model_name):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred)
    print(f"\n=== {model_name} ===")
    print(f"Accuracy: {accuracy}")
    print("Classification Report:")
    print(report)

# Logistic Regression with scaling
log_reg_model = make_pipeline(
    StandardScaler(),
    LogisticRegression(class_weight='balanced', max_iter=2000, random_state=42)
)
train_and_evaluate_model(log_reg_model, "Logistic Regression")

# Random Forest (no need for scaling)
rf_model = RandomForestClassifier(class_weight='balanced', random_state=42)
train_and_evaluate_model(rf_model, "Random Forest")

# XGBoost (ensure no warnings by using updated parameters)
xgb_model = XGBClassifier(scale_pos_weight=(y_train.value_counts().iloc[0] / y_train.value_counts().iloc[1]), eval_metric='logloss')
train_and_evaluate_model(xgb_model, "XGBoost")



=== Logistic Regression ===
Accuracy: 0.8158602150537635
Classification Report:
              precision    recall  f1-score   support

       False       0.72      0.77      0.74       517
        True       0.87      0.84      0.86       971

    accuracy                           0.82      1488
   macro avg       0.80      0.81      0.80      1488
weighted avg       0.82      0.82      0.82      1488


=== Random Forest ===
Accuracy: 0.8870967741935484
Classification Report:
              precision    recall  f1-score   support

       False       0.88      0.79      0.83       517
        True       0.89      0.94      0.92       971

    accuracy                           0.89      1488
   macro avg       0.88      0.86      0.87      1488
weighted avg       0.89      0.89      0.89      1488


=== XGBoost ===
Accuracy: 0.918010752688172
Classification Report:
              precision    recall  f1-score   support

       False       0.90      0.86      0.88       517
        True 

## Fighter Data Preparation and Outcome Prediction

This code implements functions to prepare data for two fighters and make predictions using trained models. The `prepare_fight_data` function extracts and averages numeric data for each fighter, creating a test dataset with their characteristics. The `predict_for_fighters` function then uses this dataset to predict the outcome of a match between two selected fighters, assigning each to a corner (Red or Blue). Prediction results are displayed for each model.

In [24]:
import pandas as pd

# Load the main dataset
ufc_data = pd.read_csv('../data/processed/ufc_fight_data_cleaned.csv')

# Prepare fight data function
def prepare_fight_data(df, r_fighter_name, b_fighter_name):
    r_fighter_data = df[df['r_fighter'] == r_fighter_name]
    b_fighter_data = df[df['b_fighter'] == b_fighter_name]

    if not r_fighter_data.empty and not b_fighter_data.empty:
        r_fighter_avg = r_fighter_data.select_dtypes(include=['number']).mean()
        b_fighter_avg = b_fighter_data.select_dtypes(include=['number']).mean()

        test_data = pd.DataFrame()
        
        for col in df.columns:
            if col.startswith('r_'):
                test_data[col] = [r_fighter_avg.get(col, 0)]
            elif col.startswith('b_'):
                test_data[col] = [b_fighter_avg.get(col, 0)]

        test_data_encoded = pd.get_dummies(test_data, drop_first=True)
        test_data_encoded = test_data_encoded.reindex(columns=X_train.columns, fill_value=0)

        return test_data_encoded
    else:
        print("Data for one or both fighters not available in the dataset.")
        return None

# Testing the model predictions
def predict_for_fighters(models, r_fighter_name, b_fighter_name):
    test_data_encoded = prepare_fight_data(ufc_data, r_fighter_name, b_fighter_name)
    
    if test_data_encoded is not None:
        print(f"\nPrediction results for {r_fighter_name} (Red) vs {b_fighter_name} (Blue):")
        for model, model_name in models:
            prediction = model.predict(test_data_encoded)
            result = f"Red ({r_fighter_name})" if prediction[0] else f"Blue ({b_fighter_name})"
            print(f"{model_name} Prediction: {result}")

# Define models list
models = [
    (log_reg_model, "Logistic Regression"),
    (rf_model, "Random Forest"),
    (xgb_model, "XGBoost")
]

# Predictions with Islam Makhachev as Red and Dustin Poirier as Blue
predict_for_fighters(models, 'Islam Makhachev', 'Dustin Poirier')

# Predictions with Dustin Poirier as Red and Islam Makhachev as Blue
predict_for_fighters(models, 'Dustin Poirier', 'Islam Makhachev')



Prediction results for Islam Makhachev (Red) vs Dustin Poirier (Blue):
Logistic Regression Prediction: Blue (Dustin Poirier)
Random Forest Prediction: Red (Islam Makhachev)
XGBoost Prediction: Red (Islam Makhachev)

Prediction results for Dustin Poirier (Red) vs Islam Makhachev (Blue):
Logistic Regression Prediction: Red (Dustin Poirier)
Random Forest Prediction: Red (Dustin Poirier)
XGBoost Prediction: Red (Dustin Poirier)


In [29]:
print(ufc_data.columns)


Index(['event_name', 'date', 'location', 'winner', 'weight_class',
       'is_title_bout', 'gender', 'method', 'finish_round', 'total_rounds',
       'time_sec', 'referee', 'kd_diff', 'sig_str_diff', 'sig_str_att_diff',
       'sig_str_acc_diff', 'str_diff', 'str_att_diff', 'str_acc_diff',
       'td_diff', 'td_att_diff', 'td_acc_diff', 'sub_att_diff', 'rev_diff',
       'ctrl_sec_diff', 'wins_total_diff', 'losses_total_diff', 'age_diff',
       'height_diff', 'weight_diff', 'reach_diff', 'SLpM_total_diff',
       'SApM_total_diff', 'sig_str_acc_total_diff', 'td_acc_total_diff',
       'str_def_total_diff', 'td_def_total_diff', 'sub_avg_diff',
       'td_avg_diff'],
      dtype='object')


In [26]:
pip install imbalanced-learn

Collecting imbalanced-learn
  Downloading imbalanced_learn-0.12.4-py3-none-any.whl.metadata (8.3 kB)
Downloading imbalanced_learn-0.12.4-py3-none-any.whl (258 kB)
Installing collected packages: imbalanced-learn
Successfully installed imbalanced-learn-0.12.4

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## Data Preparation, Model Training, and Evaluation with Class Balancing and Calibration

This code involves loading and preprocessing UFC fight data, including filling missing values, creating corner-independent features, and encoding categorical variables. The dataset is balanced using SMOTE to address class imbalance, and features are standardized to prevent dominance by any specific feature. The data is split into training and testing sets, after which three models (Logistic Regression, Random Forest, and XGBoost) are trained, calibrated, and evaluated based on accuracy and classification metrics.

In [40]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import accuracy_score, classification_report
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler
from sklearn.exceptions import NotFittedError

# Step 1: Load data
ufc_data = pd.read_csv('../data/processed/ufc_fight_data_cleaned.csv')

# Step 2: Fill missing values
numeric_columns = ufc_data.select_dtypes(include=['number']).columns
ufc_data[numeric_columns] = ufc_data[numeric_columns].fillna(ufc_data[numeric_columns].mean())
categorical_columns = ufc_data.select_dtypes(include=['object']).columns
for column in categorical_columns:
    ufc_data[column].fillna(ufc_data[column].mode()[0], inplace=True)

# Step 3: Feature Engineering - Create Corner-Independent Features
ufc_data['kd_diff'] = abs(ufc_data['r_kd'] - ufc_data['b_kd'])
ufc_data['sig_str_diff'] = abs(ufc_data['r_sig_str'] - ufc_data['b_sig_str'])
ufc_data['td_diff'] = abs(ufc_data['r_td'] - ufc_data['b_td'])
ufc_data['kd_ratio'] = ufc_data['r_kd'] / (ufc_data['b_kd'] + 1e-5)
ufc_data['sig_str_ratio'] = ufc_data['r_sig_str'] / (ufc_data['b_sig_str'] + 1e-5)
ufc_data['td_ratio'] = ufc_data['r_td'] / (ufc_data['b_td'] + 1e-5)

# Specify which columns to drop and avoid dropping 'r_fighter' and 'b_fighter'
features_to_drop = [col for col in ufc_data.columns if col.startswith(('r_', 'b_')) and col not in ['r_fighter', 'b_fighter']]
ufc_data = ufc_data.drop(columns=features_to_drop)

# Step 4: Encode categorical variables
ufc_data_encoded = pd.get_dummies(ufc_data, drop_first=True)

# Separate features and target variable
X = ufc_data_encoded.drop(columns=['winner_Red'])
y = ufc_data_encoded['winner_Red']

# Step 5: Handle Imbalance in the Data with SMOTE
X, y = SMOTE().fit_resample(X, y)

# Step 6: Standardize features to avoid dominance of any specific feature
scaler = StandardScaler()
X = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Function to train, calibrate, and evaluate each model
def train_and_evaluate_model(model, model_name):
    if model_name in ["Random Forest", "XGBoost"]:
        model = CalibratedClassifierCV(estimator=model, method='sigmoid', cv=5)
    model.fit(X_train, y_train)
    
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred)
    
    print(f"\n=== {model_name} ===")
    print(f"Accuracy: {accuracy}")
    print("Classification Report:")
    print(report)
    
    return model  # Return the fitted model

# Initialize and evaluate models
log_reg_model = LogisticRegression(class_weight='balanced', max_iter=2000, solver='saga', random_state=42)
rf_model = RandomForestClassifier(class_weight='balanced', random_state=42)
xgb_model = XGBClassifier(scale_pos_weight=(y_train.value_counts().iloc[0] / y_train.value_counts().iloc[1]), eval_metric='logloss')

# Train and evaluate models
log_reg_model = train_and_evaluate_model(log_reg_model, "Logistic Regression")
rf_model = train_and_evaluate_model(rf_model, "Random Forest")
xgb_model = train_and_evaluate_model(xgb_model, "XGBoost")


=== Logistic Regression ===
Accuracy: 0.8713480266529985
Classification Report:
              precision    recall  f1-score   support

       False       0.86      0.89      0.88       991
        True       0.89      0.85      0.87       960

    accuracy                           0.87      1951
   macro avg       0.87      0.87      0.87      1951
weighted avg       0.87      0.87      0.87      1951


=== Random Forest ===
Accuracy: 0.9144028703229113
Classification Report:
              precision    recall  f1-score   support

       False       0.91      0.93      0.92       991
        True       0.92      0.90      0.91       960

    accuracy                           0.91      1951
   macro avg       0.91      0.91      0.91      1951
weighted avg       0.91      0.91      0.91      1951


=== XGBoost ===
Accuracy: 0.9287544848795489
Classification Report:
              precision    recall  f1-score   support

       False       0.93      0.93      0.93       991
        True

## Fight Outcome Prediction for Specific Fighters

This code includes a function for preparing data and predicting fight outcomes between two fighters. The `prepare_fight_data` function calculates absolute differences and ratio-based features for key characteristics of each fighter, creating test data aligned with the training features. This data is scaled and fed into trained models (Logistic Regression, Random Forest, and XGBoost) to predict the fight winner. A prediction example is provided for fights between Islam Makhachev and Dustin Poirier, with reversed roles (Red and Blue).

In [41]:
# Prediction example for a specific fight
def prepare_fight_data(df, r_fighter_name, b_fighter_name):
    # Select relevant rows for each fighter
    r_fighter_data = df[df['r_fighter'] == r_fighter_name]
    b_fighter_data = df[df['b_fighter'] == b_fighter_name]
    if not r_fighter_data.empty and not b_fighter_data.empty:
        r_avg = r_fighter_data.select_dtypes(include=['number']).mean()
        b_avg = b_fighter_data.select_dtypes(include=['number']).mean()
        # Create absolute difference and ratio-based test data
        test_data = pd.DataFrame({
            'kd_diff': [abs(r_avg.get('kd_diff', 0) - b_avg.get('kd_diff', 0))],
            'sig_str_diff': [abs(r_avg.get('sig_str_diff', 0) - b_avg.get('sig_str_diff', 0))],
            'td_diff': [abs(r_avg.get('td_diff', 0) - b_avg.get('td_diff', 0))],
            'kd_ratio': [r_avg.get('kd_diff', 1) / (b_avg.get('kd_diff', 1) + 1e-5)],
            'sig_str_ratio': [r_avg.get('sig_str_diff', 1) / (b_avg.get('sig_str_diff', 1) + 1e-5)],
            'td_ratio': [r_avg.get('td_diff', 1) / (b_avg.get('td_diff', 1) + 1e-5)]
        })
        
        # Align with training features and scale
        test_data_aligned = test_data.reindex(columns=X_train.columns, fill_value=0)
        test_data_scaled = pd.DataFrame(scaler.transform(test_data_aligned), columns=X_train.columns)
        
        return test_data_scaled

    return None

# Test prediction for Islam Makhachev vs Dustin Poirier
test_data_encoded = prepare_fight_data(ufc_data, 'Islam Makhachev', 'Dustin Poirier')
if test_data_encoded is not None:
    for model, name in zip([log_reg_model, rf_model, xgb_model], ["Logistic Regression", "Random Forest", "XGBoost"]):
        try:
            prediction = model.predict(test_data_encoded)
            result = "Red (Islam Makhachev)" if prediction[0] else "Blue (Dustin Poirier)"
            print(f"{name} Prediction: {result}")
        except NotFittedError:
            print(f"{name} model is not fitted.")

Logistic Regression Prediction: Red (Islam Makhachev)
Random Forest Prediction: Red (Islam Makhachev)
XGBoost Prediction: Red (Islam Makhachev)


## Additional Check with Switched Fighter Corners

 This code provides an additional check by switching fighter roles in the corners, with Dustin Poirier now in the red corner and Islam Makhachev in the blue corner. The `prepare_fight_data` function prepares the fight data accordingly. Predictions are made using the three trained models (Logistic Regression, Random Forest, and XGBoost), and the outcome for the winner is printed. This approach helps evaluate the effect of role switching on prediction results.

In [42]:
# Test prediction for Dustin Poirier vs Islam Makhachev (reversed roles)
test_data_encoded = prepare_fight_data(ufc_data, 'Dustin Poirier', 'Islam Makhachev')
if test_data_encoded is not None:
    for model, name in zip([log_reg_model, rf_model, xgb_model], ["Logistic Regression", "Random Forest", "XGBoost"]):
        try:
            prediction = model.predict(test_data_encoded)
            result = "Red (Dustin Poirier)" if prediction[0] else "Blue (Islam Makhachev)"
            print(f"{name} Prediction: {result}")
        except NotFittedError:
            print(f"{name} model is not fitted.")


Logistic Regression Prediction: Red (Dustin Poirier)
Random Forest Prediction: Red (Dustin Poirier)
XGBoost Prediction: Red (Dustin Poirier)


## Model Training and Fight Outcome Prediction with Enhanced Features and Class Balancing


This code fully prepares and trains models for predicting UFC fight outcomes. Data is preprocessed to fill missing values, enhanced features like difference and ratio of fighter statistics are added, and class imbalance is addressed using SMOTE. The data is then standardized, and models (Logistic Regression, Random Forest, and XGBoost) are trained and calibrated to improve predictive accuracy. The prediction function `prepare_fight_data` prepares data for specific fighters using the selected features. Additional predictions are made by swapping corner assignments (Red and Blue) for Islam Makhachev and Dustin Poirier to assess the impact of corner assignments on prediction results.

In [43]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import accuracy_score, classification_report
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler
from sklearn.exceptions import NotFittedError

# Step 1: Load data
ufc_data = pd.read_csv('../data/processed/ufc_fight_data_cleaned.csv')

# Step 2: Fill missing values
numeric_columns = ufc_data.select_dtypes(include=['number']).columns
ufc_data[numeric_columns] = ufc_data[numeric_columns].fillna(ufc_data[numeric_columns].mean())
categorical_columns = ufc_data.select_dtypes(include=['object']).columns
for column in categorical_columns:
    ufc_data[column].fillna(ufc_data[column].mode()[0], inplace=True)

# Step 3: Feature Engineering - Enhanced Features
ufc_data['kd_diff'] = ufc_data['r_kd'] - ufc_data['b_kd']
ufc_data['sig_str_diff'] = ufc_data['r_sig_str'] - ufc_data['b_sig_str']
ufc_data['td_diff'] = ufc_data['r_td'] - ufc_data['b_td']
ufc_data['kd_ratio'] = ufc_data['r_kd'] / (ufc_data['b_kd'] + 1e-5)
ufc_data['sig_str_ratio'] = ufc_data['r_sig_str'] / (ufc_data['b_sig_str'] + 1e-5)
ufc_data['td_ratio'] = ufc_data['r_td'] / (ufc_data['b_td'] + 1e-5)
ufc_data['total_str_diff'] = (ufc_data['r_sig_str'] + ufc_data['r_str']) - (ufc_data['b_sig_str'] + ufc_data['b_str'])

# Drop original corner-specific columns
features_to_drop = [col for col in ufc_data.columns if col.startswith(('r_', 'b_')) and col not in ['r_fighter', 'b_fighter']]
ufc_data = ufc_data.drop(columns=features_to_drop)

# Step 4: Encode categorical variables
ufc_data_encoded = pd.get_dummies(ufc_data, drop_first=True)

# Separate features and target variable
X = ufc_data_encoded.drop(columns=['winner_Red'])
y = ufc_data_encoded['winner_Red']

# Step 5: Handle Imbalance with SMOTE
X, y = SMOTE().fit_resample(X, y)

# Step 6: Standardize features
scaler = StandardScaler()
X = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model training function
def train_and_evaluate_model(model, model_name):
    if model_name in ["Random Forest", "XGBoost"]:
        model = CalibratedClassifierCV(estimator=model, method='sigmoid', cv=5)
    model.fit(X_train, y_train)
    
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred)
    
    print(f"\n=== {model_name} ===")
    print(f"Accuracy: {accuracy}")
    print("Classification Report:")
    print(report)
    
    return model

# Initialize models
log_reg_model = LogisticRegression(class_weight='balanced', max_iter=2000, solver='saga', random_state=42)
rf_model = RandomForestClassifier(class_weight='balanced', random_state=42)
xgb_model = XGBClassifier(scale_pos_weight=(y_train.value_counts().iloc[0] / y_train.value_counts().iloc[1]), eval_metric='logloss')

# Train and evaluate models
log_reg_model = train_and_evaluate_model(log_reg_model, "Logistic Regression")
rf_model = train_and_evaluate_model(rf_model, "Random Forest")
xgb_model = train_and_evaluate_model(xgb_model, "XGBoost")

# Prediction function with balanced features for fighters
def prepare_fight_data(df, r_fighter_name, b_fighter_name):
    r_fighter_data = df[df['r_fighter'] == r_fighter_name]
    b_fighter_data = df[df['b_fighter'] == b_fighter_name]
    if not r_fighter_data.empty and not b_fighter_data.empty:
        r_avg = r_fighter_data.select_dtypes(include=['number']).mean()
        b_avg = b_fighter_data.select_dtypes(include=['number']).mean()
        test_data = pd.DataFrame({
            'kd_diff': [r_avg.get('kd_diff', 0) - b_avg.get('kd_diff', 0)],
            'sig_str_diff': [r_avg.get('sig_str_diff', 0) - b_avg.get('sig_str_diff', 0)],
            'td_diff': [r_avg.get('td_diff', 0) - b_avg.get('td_diff', 0)],
            'kd_ratio': [r_avg.get('kd_diff', 1) / (b_avg.get('kd_diff', 1) + 1e-5)],
            'sig_str_ratio': [r_avg.get('sig_str_diff', 1) / (b_avg.get('sig_str_diff', 1) + 1e-5)],
            'td_ratio': [r_avg.get('td_diff', 1) / (b_avg.get('td_diff', 1) + 1e-5)],
            'total_str_diff': [(r_avg.get('sig_str_diff', 0) + r_avg.get('str_diff', 0)) - (b_avg.get('sig_str_diff', 0) + b_avg.get('str_diff', 0))]
        })
        test_data_aligned = test_data.reindex(columns=X_train.columns, fill_value=0)
        test_data_scaled = pd.DataFrame(scaler.transform(test_data_aligned), columns=X_train.columns)
        
        return test_data_scaled

# Test with fighters swapped
for r_fighter, b_fighter in [('Islam Makhachev', 'Dustin Poirier'), ('Dustin Poirier', 'Islam Makhachev')]:
    print(f"\nTesting {r_fighter} (Red) vs {b_fighter} (Blue)")
    test_data_encoded = prepare_fight_data(ufc_data, r_fighter, b_fighter)
    if test_data_encoded is not None:
        for model, name in zip([log_reg_model, rf_model, xgb_model], ["Logistic Regression", "Random Forest", "XGBoost"]):
            try:
                prediction = model.predict(test_data_encoded)
                result = f"Winner: {'Red' if prediction[0] else 'Blue'} ({r_fighter if prediction[0] else b_fighter})"
                print(f"{name} Prediction: {result}")
            except NotFittedError:
                print(f"{name} model is not fitted.")



=== Logistic Regression ===
Accuracy: 0.8749359302921579
Classification Report:
              precision    recall  f1-score   support

       False       0.86      0.90      0.88       991
        True       0.89      0.85      0.87       960

    accuracy                           0.87      1951
   macro avg       0.88      0.87      0.87      1951
weighted avg       0.88      0.87      0.87      1951


=== Random Forest ===
Accuracy: 0.9072270630445926
Classification Report:
              precision    recall  f1-score   support

       False       0.90      0.92      0.91       991
        True       0.91      0.90      0.90       960

    accuracy                           0.91      1951
   macro avg       0.91      0.91      0.91      1951
weighted avg       0.91      0.91      0.91      1951


=== XGBoost ===
Accuracy: 0.9323423885187083
Classification Report:
              precision    recall  f1-score   support

       False       0.93      0.93      0.93       991
        True

In [47]:
# To display all column names in your DataFrame
print(ufc_data.columns.tolist())


['event_name', 'date', 'location', 'r_fighter', 'b_fighter', 'winner', 'weight_class', 'is_title_bout', 'gender', 'method', 'finish_round', 'total_rounds', 'time_sec', 'referee', 'r_kd', 'r_sig_str', 'r_sig_str_att', 'r_sig_str_acc', 'r_str', 'r_str_att', 'r_str_acc', 'r_td', 'r_td_att', 'r_td_acc', 'r_sub_att', 'r_rev', 'r_ctrl_sec', 'r_wins_total', 'r_losses_total', 'r_age', 'r_height', 'r_weight', 'r_reach', 'r_stance', 'r_SLpM_total', 'r_SApM_total', 'r_sig_str_acc_total', 'r_td_acc_total', 'r_str_def_total', 'r_td_def_total', 'r_sub_avg', 'r_td_avg', 'b_kd', 'b_sig_str', 'b_sig_str_att', 'b_sig_str_acc', 'b_str', 'b_str_att', 'b_str_acc', 'b_td', 'b_td_att', 'b_td_acc', 'b_sub_att', 'b_rev', 'b_ctrl_sec', 'b_wins_total', 'b_losses_total', 'b_age', 'b_height', 'b_weight', 'b_reach', 'b_stance', 'b_SLpM_total', 'b_SApM_total', 'b_sig_str_acc_total', 'b_td_acc_total', 'b_str_def_total', 'b_td_def_total', 'b_sub_avg', 'b_td_avg', 'kd_diff', 'sig_str_diff', 'sig_str_att_diff', 'sig_str

## Advanced Fight Outcome Predictions Using Corner-Independent Features

This code performs advanced predictions of UFC fight outcomes using an extended set of features calculated as differences and ratios between fighters' statistics. Data preprocessing includes filling missing values, feature engineering for corner-independent fighter characteristics, class balancing with SMOTE, and feature standardization. After data splitting, models (Logistic Regression, Random Forest, and XGBoost) are trained and calibrated for accurate predictions. The `prepare_fight_data` function prepares data for two fighters, aligning it with training data features. Prediction examples are provided for fights between Islam Makhachev and Dustin Poirier with reversed roles (Red and Blue) to assess the impact of corner assignments on the fight outcome.

In [46]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import accuracy_score, classification_report
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler
from sklearn.exceptions import NotFittedError

# Step 1: Load data
ufc_data = pd.read_csv('../data/processed/ufc_fight_data_cleaned.csv')

# Step 2: Fill missing values
numeric_columns = ufc_data.select_dtypes(include=['number']).columns
ufc_data[numeric_columns] = ufc_data[numeric_columns].fillna(ufc_data[numeric_columns].mean())
categorical_columns = ufc_data.select_dtypes(include=['object']).columns
for column in categorical_columns:
    ufc_data[column].fillna(ufc_data[column].mode()[0], inplace=True)

# Step 3: Feature Engineering - Calculate Corner-Independent Features
feature_cols = [
    'kd_diff', 'sig_str_diff', 'sig_str_att_diff', 'sig_str_acc_diff', 'str_diff',
    'str_att_diff', 'str_acc_diff', 'td_diff', 'td_att_diff', 'td_acc_diff', 'sub_att_diff',
    'rev_diff', 'ctrl_sec_diff', 'wins_total_diff', 'losses_total_diff', 'age_diff',
    'height_diff', 'weight_diff', 'reach_diff', 'SLpM_total_diff', 'SApM_total_diff',
    'sig_str_acc_total_diff', 'td_acc_total_diff', 'str_def_total_diff', 'td_def_total_diff',
    'sub_avg_diff', 'td_avg_diff', 'kd_ratio', 'sig_str_ratio', 'td_ratio', 'total_str_diff'
]

# Ensure required columns are in DataFrame
for col in feature_cols:
    if col not in ufc_data.columns:
        ufc_data[col] = 0  # Handle missing columns if necessary

# Step 4: Encode categorical variables
ufc_data_encoded = pd.get_dummies(ufc_data, drop_first=True)

# Separate features and target variable
X = ufc_data_encoded.drop(columns=['winner_Red'])
y = ufc_data_encoded['winner_Red']

# Step 5: Handle Imbalance in the Data with SMOTE
X, y = SMOTE().fit_resample(X, y)

# Step 6: Standardize features to avoid dominance of any specific feature
scaler = StandardScaler()
X = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Function to train, calibrate, and evaluate each model
def train_and_evaluate_model(model, model_name):
    if model_name in ["Random Forest", "XGBoost"]:
        model = CalibratedClassifierCV(estimator=model, method='sigmoid', cv=5)
    model.fit(X_train, y_train)
    
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred)
    
    print(f"\n=== {model_name} ===")
    print(f"Accuracy: {accuracy}")
    print("Classification Report:")
    print(report)
    
    return model  # Return the fitted model

# Initialize and evaluate models
log_reg_model = LogisticRegression(class_weight='balanced', max_iter=2000, solver='saga', random_state=42)
rf_model = RandomForestClassifier(class_weight='balanced', random_state=42)
xgb_model = XGBClassifier(scale_pos_weight=(y_train.value_counts().iloc[0] / y_train.value_counts().iloc[1]), eval_metric='logloss')

# Train and evaluate models
log_reg_model = train_and_evaluate_model(log_reg_model, "Logistic Regression")
rf_model = train_and_evaluate_model(rf_model, "Random Forest")
xgb_model = train_and_evaluate_model(xgb_model, "XGBoost")

# Prediction example for a specific fight with all feature columns
def prepare_fight_data(df, r_fighter_name, b_fighter_name):
    # Select relevant rows for each fighter
    r_fighter_data = df[df['r_fighter'] == r_fighter_name]
    b_fighter_data = df[df['b_fighter'] == b_fighter_name]
    if not r_fighter_data.empty and not b_fighter_data.empty:
        r_avg = r_fighter_data.select_dtypes(include=['number']).mean()
        b_avg = b_fighter_data.select_dtypes(include=['number']).mean()
        
        # Prepare all specified features for prediction
        test_data = pd.DataFrame({feature: [abs(r_avg.get(feature, 0) - b_avg.get(feature, 0))] for feature in feature_cols})
        
        # Align with training features and scale
        test_data_aligned = test_data.reindex(columns=X_train.columns, fill_value=0)
        test_data_scaled = pd.DataFrame(scaler.transform(test_data_aligned), columns=X_train.columns)
        
        return test_data_scaled

    return None

# Test prediction for Islam Makhachev vs Dustin Poirier
print("Testing Islam Makhachev (Red) vs Dustin Poirier (Blue)")
test_data_1 = prepare_fight_data(ufc_data, 'Islam Makhachev', 'Dustin Poirier')
if test_data_1 is not None:
    for model, name in zip([log_reg_model, rf_model, xgb_model], ["Logistic Regression", "Random Forest", "XGBoost"]):
        prediction = model.predict(test_data_1)
        result = "Red (Islam Makhachev)" if prediction[0] else "Blue (Dustin Poirier)"
        print(f"{name} Prediction: Winner: {result}")

print("\nTesting Dustin Poirier (Red) vs Islam Makhachev (Blue)")
test_data_2 = prepare_fight_data(ufc_data, 'Dustin Poirier', 'Islam Makhachev')
if test_data_2 is not None:
    for model, name in zip([log_reg_model, rf_model, xgb_model], ["Logistic Regression", "Random Forest", "XGBoost"]):
        prediction = model.predict(test_data_2)
        result = "Red (Dustin Poirier)" if prediction[0] else "Blue (Islam Makhachev)"
        print(f"{name} Prediction: Winner: {result}")



=== Logistic Regression ===
Accuracy: 0.8918503331624807
Classification Report:
              precision    recall  f1-score   support

       False       0.88      0.91      0.90       991
        True       0.90      0.87      0.89       960

    accuracy                           0.89      1951
   macro avg       0.89      0.89      0.89      1951
weighted avg       0.89      0.89      0.89      1951


=== Random Forest ===
Accuracy: 0.9195284469502819
Classification Report:
              precision    recall  f1-score   support

       False       0.92      0.93      0.92       991
        True       0.92      0.91      0.92       960

    accuracy                           0.92      1951
   macro avg       0.92      0.92      0.92      1951
weighted avg       0.92      0.92      0.92      1951


=== XGBoost ===
Accuracy: 0.9359302921578677
Classification Report:
              precision    recall  f1-score   support

       False       0.94      0.94      0.94       991
        True

## UFC Fight Outcome Prediction with Difference and Ratio-Based Features

This code performs UFC fight outcome predictions using machine learning models with features that are independent of fighter corners. The initial steps load and preprocess the data to fill missing values and calculate new features based on differences and ratios between fighters' statistics. The data is balanced with SMOTE to address class imbalance and standardized for consistency. Models (Logistic Regression, Random Forest, and XGBoost) are trained and calibrated for improved predictive accuracy. The `prepare_fight_data` function prepares data for two fighters by calculating relevant features, which are then fed into the models to predict fight outcomes. Examples of predictions are provided for fights between Islam Makhachev and Dustin Poirier with alternating corner assignments.

In [50]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import accuracy_score, classification_report
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler
from sklearn.exceptions import NotFittedError
import numpy as np

# Step 1: Load data
ufc_data = pd.read_csv('../data/processed/ufc_fight_data_cleaned.csv')

# Step 2: Fill missing values
numeric_columns = ufc_data.select_dtypes(include=['number']).columns
ufc_data[numeric_columns] = ufc_data[numeric_columns].fillna(ufc_data[numeric_columns].mean())
categorical_columns = ufc_data.select_dtypes(include=['object']).columns
for column in categorical_columns:
    ufc_data[column].fillna(ufc_data[column].mode()[0], inplace=True)

# Step 3: Feature Engineering - Create Corner-Independent Features
feature_list = [
    'kd', 'sig_str', 'sig_str_att', 'sig_str_acc', 'str', 'str_att', 'str_acc', 'td', 
    'td_att', 'td_acc', 'sub_att', 'rev', 'ctrl_sec', 'wins_total', 'losses_total', 
    'age', 'height', 'weight', 'reach', 'SLpM_total', 'SApM_total', 'sig_str_acc_total', 
    'td_acc_total', 'str_def_total', 'td_def_total', 'sub_avg', 'td_avg'
]

# Generate diff and ratio features in one step for better performance
diff_features = pd.DataFrame()
ratio_features = pd.DataFrame()

for feature in feature_list:
    diff_features[f'{feature}_diff'] = abs(ufc_data[f'r_{feature}'] - ufc_data[f'b_{feature}'])
    ratio_features[f'{feature}_ratio'] = ufc_data[f'r_{feature}'] / (ufc_data[f'b_{feature}'] + 1e-5)

# Concatenate new features with the original DataFrame
ufc_data = pd.concat([ufc_data, diff_features, ratio_features], axis=1)

# Drop original `r_` and `b_` features after creating difference and ratio features
features_to_drop = [f'r_{feature}' for feature in feature_list] + [f'b_{feature}' for feature in feature_list]
ufc_data = ufc_data.drop(columns=features_to_drop)

# Step 4: Encode categorical variables
ufc_data_encoded = pd.get_dummies(ufc_data, drop_first=True)

# Separate features and target variable
X = ufc_data_encoded.drop(columns=['winner_Red'])
y = ufc_data_encoded['winner_Red']

# Step 5: Handle Imbalance in the Data with SMOTE
X, y = SMOTE().fit_resample(X, y)

# Step 6: Standardize features to avoid dominance of any specific feature
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Function to train, calibrate, and evaluate each model
def train_and_evaluate_model(model, model_name):
    if model_name in ["Random Forest", "XGBoost"]:
        model = CalibratedClassifierCV(estimator=model, method='sigmoid', cv=5)
    model.fit(X_train, y_train)
    
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred)
    
    print(f"\n=== {model_name} ===")
    print(f"Accuracy: {accuracy}")
    print("Classification Report:")
    print(report)
    
    return model  # Return the fitted model

# Initialize and evaluate models
log_reg_model = LogisticRegression(class_weight='balanced', max_iter=2000, solver='saga', random_state=42)
rf_model = RandomForestClassifier(class_weight='balanced', random_state=42)
xgb_model = XGBClassifier(scale_pos_weight=(y_train.value_counts().iloc[0] / y_train.value_counts().iloc[1]), eval_metric='logloss')

# Train and evaluate models
log_reg_model = train_and_evaluate_model(log_reg_model, "Logistic Regression")
rf_model = train_and_evaluate_model(rf_model, "Random Forest")
xgb_model = train_and_evaluate_model(xgb_model, "XGBoost")

# Test prediction function and predictions
def prepare_fight_data(df, r_fighter_name, b_fighter_name):
    # Select relevant rows for each fighter
    r_fighter_data = df[df['r_fighter'] == r_fighter_name]
    b_fighter_data = df[df['b_fighter'] == b_fighter_name]
    if not r_fighter_data.empty and not b_fighter_data.empty:
        r_avg = r_fighter_data.select_dtypes(include=['number']).mean()
        b_avg = b_fighter_data.select_dtypes(include=['number']).mean()
        
        # Prepare diff and ratio features for prediction
        test_data = {f'{feature}_diff': [abs(r_avg.get(f'r_{feature}', 0) - b_avg.get(f'b_{feature}', 0))] for feature in feature_list}
        test_data.update({f'{feature}_ratio': [r_avg.get(f'r_{feature}', 1) / (b_avg.get(f'b_{feature}', 1) + 1e-5)] for feature in feature_list})
        
        # Convert dictionary to DataFrame and align with training features
        test_data_df = pd.DataFrame(test_data)
        test_data_aligned = test_data_df.reindex(columns=X.columns, fill_value=0)
        
        # Scale the test data
        test_data_scaled = scaler.transform(test_data_aligned)
        
        return test_data_scaled

    return None

# Test predictions
print("Testing Islam Makhachev (Red) vs Dustin Poirier (Blue)")
test_data_1 = prepare_fight_data(ufc_data, 'Islam Makhachev', 'Dustin Poirier')
if test_data_1 is not None:
    for model, name in zip([log_reg_model, rf_model, xgb_model], ["Logistic Regression", "Random Forest", "XGBoost"]):
        prediction = model.predict(test_data_1)
        result = "Red (Islam Makhachev)" if prediction[0] else "Blue (Dustin Poirier)"
        print(f"{name} Prediction: Winner: {result}")

print("\nTesting Dustin Poirier (Red) vs Islam Makhachev (Blue)")
test_data_2 = prepare_fight_data(ufc_data, 'Dustin Poirier', 'Islam Makhachev')
if test_data_2 is not None:
    for model, name in zip([log_reg_model, rf_model, xgb_model], ["Logistic Regression", "Random Forest", "XGBoost"]):
        prediction = model.predict(test_data_2)
        result = "Red (Dustin Poirier)" if prediction[0] else "Blue (Islam Makhachev)"
        print(f"{name} Prediction: Winner: {result}")



=== Logistic Regression ===
Accuracy: 0.8856996412096361
Classification Report:
              precision    recall  f1-score   support

       False       0.87      0.91      0.89       991
        True       0.90      0.87      0.88       960

    accuracy                           0.89      1951
   macro avg       0.89      0.89      0.89      1951
weighted avg       0.89      0.89      0.89      1951


=== Random Forest ===
Accuracy: 0.9082521783700667
Classification Report:
              precision    recall  f1-score   support

       False       0.90      0.92      0.91       991
        True       0.91      0.90      0.91       960

    accuracy                           0.91      1951
   macro avg       0.91      0.91      0.91      1951
weighted avg       0.91      0.91      0.91      1951


=== XGBoost ===
Accuracy: 0.9343926191696565
Classification Report:
              precision    recall  f1-score   support

       False       0.93      0.94      0.94       991
        True

In [2]:
pip install xgboost


Collecting xgboost
  Downloading xgboost-2.1.2-py3-none-macosx_12_0_arm64.whl.metadata (2.1 kB)
Downloading xgboost-2.1.2-py3-none-macosx_12_0_arm64.whl (1.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: xgboost
Successfully installed xgboost-2.1.2
Note: you may need to restart the kernel to use updated packages.
