In [31]:
import pandas as pd
from datetime import datetime
import re
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Potential Features

- **Win-Loss Record**: The number of wins, losses, and possibly draws in a fighter's career.
- **Opponent’s Win-Loss Record**: The record of the current opponent, providing context for the difficulty of the matchup.
- **Knockout (KO) and Submission Wins**: The number of wins by knockout and submission, providing insight into the fighter's preferred and most successful methods of victory.
- **Recent Win Rate** (`recent_win_rate`): This feature captures the fighter's recent form, providing insight into their current performance level.
- **Win Streak** (`win_streak`): A fighter on a win streak may have momentum and confidence, which can be a psychological advantage.
- **Fight Activity**
- **Number of Fights/Experience**
- **Fighting Style**: Whether the fighter is more of a striker, grappler, or mixed.
- **Fighter's Age**: The age of the fighter, as younger or older fighters may have different performance patterns.
- **Age Difference** (`age_difference`): Age can significantly impact stamina, strength, and experience, making this a crucial feature.
- **Height Difference** (`height_difference`): Height can influence a fighter's reach and striking ability, which are critical factors in a fight.
- **Reach Difference** (`reach_difference`): A longer reach can provide a strategic advantage in striking, making this an important feature.
- **Striking Accuracy**: The percentage of strikes landed out of attempted strikes, indicating striking effectiveness.
- **Takedown Accuracy**: The percentage of successful takedowns, indicating grappling effectiveness.
- **Defense**: Statistics like significant strikes absorbed per minute and takedown defense, which show a fighter's ability to avoid damage and resist takedowns.

---

# Organization of the Data for Training

In my opinion, the best organization of the dataframe is one row per match. Then you set a feature that subtracts fighter2’s stats from fighter1’s stat. Keep only these difference stats, and then create the model so it predicts only if fighter1 will win. It halves the amount of features the model has to look at with the same outcome.

Next, we limit the data to only fights in 2014. The reason for this is the early UFC fights were under different rulesets and very different skillsets. We’re trying to predict future fights and 2014 is a pretty good cutoff for the modern era of MMA. Additionally, the UFC greatly increased the fights per year starting in 2014. (***from mma-ai.net***)

Removing women’s fights, DQs, overturned, win by points deducted or illegal move, no contests, draws and last, split decisions.
- Women’s fights are harder to predict and they fight in different statistical patterns.
- We don’t want the model to predict a fight that was lost by points or illegal moves because those endings don’t show up in the fight stats.
- No contests and draws happen so rarely it’s worth removing them so the model doesn’t try to predict fights go to a draw.
- Split decisions are sketchy in the UFC because judges are bad or possibly because they’re influenced. Removing these from the dataset saw a 1% increase in prediction accuracy.

***MERGE ALL DATA TO EVENT_DATA_SHERDOG.CSV***

---

In [32]:
!rm -rf data-bak
!mv data data-bak
!mkdir data
!cp ../Scrapers/data/fighter_info.csv data/
!cp ../Scrapers/data/event_data_sherdog.csv data/
!cp ../Scrapers/data/github/master.csv data/
!cp -R ../Scrapers/data/fighters data/

# Generate Features

In [33]:
# fighter1_age_on_fightnight
# fighter2_age_on_fightnight
# age_difference

event_data = pd.read_csv('data/event_data_sherdog.csv')
fighter_info = pd.read_csv('data/fighter_info.csv')

# Clean the fighter info data and convert 'Birth Date' to datetime
fighter_info_cleaned = fighter_info[fighter_info['Birth Date'] != '-'].copy()
fighter_info_cleaned['Birth Date'] = pd.to_datetime(fighter_info_cleaned['Birth Date'], errors='coerce')

# Convert 'Event Date' to datetime, ensuring it becomes timezone-naive
event_data['Event Date'] = pd.to_datetime(event_data['Event Date'], errors='coerce', utc=True).dt.tz_localize(None)

# Merge age data with event data
event_data_merged = event_data.merge(fighter_info_cleaned[['Fighter_ID', 'Birth Date']], left_on='Fighter 1 ID', right_on='Fighter_ID')
event_data_merged = event_data_merged.merge(fighter_info_cleaned[['Fighter_ID', 'Birth Date']], left_on='Fighter 2 ID', right_on='Fighter_ID', suffixes=('_fighter1', '_fighter2'))

# Ensure that 'Birth Date' columns are timezone-naive
event_data_merged['Birth Date_fighter1'] = event_data_merged['Birth Date_fighter1'].dt.tz_localize(None)
event_data_merged['Birth Date_fighter2'] = event_data_merged['Birth Date_fighter2'].dt.tz_localize(None)

# Calculate ages on fight dates
event_data_merged['fighter_age_fighter1'] = ((event_data_merged['Event Date'] - event_data_merged['Birth Date_fighter1']).dt.days // 365)
event_data_merged['fighter_age_fighter2'] = ((event_data_merged['Event Date'] - event_data_merged['Birth Date_fighter2']).dt.days // 365)

# Rename the columns
event_data_merged = event_data_merged.rename(columns={
    'fighter_age_fighter1': 'fighter1_age_on_fightnight',
    'fighter_age_fighter2': 'fighter2_age_on_fightnight'
})

# Calculate age difference
event_data_merged['age_difference'] = event_data_merged['fighter1_age_on_fightnight'] - event_data_merged['fighter2_age_on_fightnight']

# Drop unnecessary columns
event_data_final = event_data_merged.drop(columns=['Fighter_ID_fighter1', 'Fighter_ID_fighter2', 'Birth Date_fighter1', 'Birth Date_fighter2'])

# Save the updated DataFrame back to the CSV file
event_data_final.to_csv('data/event_data_sherdog.csv', index=False)

In [34]:
# current_win_streak
# fighter1_current_win_streak
# fighter2_current_win_streak

event_data = pd.read_csv('data/event_data_sherdog.csv')
fighter_info = pd.read_csv('data/fighter_info.csv')

# Initialize a dictionary to store win streaks
win_streaks = {}

# Recalculate win streaks based on Fighter_ID matching with Fighter 1 ID and Fighter 2 ID
for fighter_id in fighter_info['Fighter_ID']:
    # Filter event data for fights involving this fighter based on Fighter 1 ID and Fighter 2 ID
    fights = event_data[(event_data['Fighter 1 ID'] == fighter_id) | (event_data['Fighter 2 ID'] == fighter_id)].copy()
    
    # Sort fights by event date to consider the most recent first
    fights = fights.sort_values(by='Event Date', ascending=False)
    
    # Initialize win streak counter
    win_streak = 0
    
    # Iterate through the sorted fights to calculate win streak
    for _, fight in fights.iterrows():
        if (fight['Fighter 1 ID'] == fighter_id and fight['Winning Fighter'] == fight['Fighter 1']) or \
           (fight['Fighter 2 ID'] == fighter_id and fight['Winning Fighter'] == fight['Fighter 2']):
            win_streak += 1
        else:
            break  # Stop counting when a loss is encountered
    
    # Store the win streak for this fighter
    win_streaks[fighter_id] = win_streak

# Append the current win streak to all rows for each fighter
event_data['fighter1_current_win_streak'] = event_data['Fighter 1 ID'].map(win_streaks)
event_data['fighter2_current_win_streak'] = event_data['Fighter 2 ID'].map(win_streaks)

# Save the updated event_data DataFrame back to a CSV file
event_data.to_csv('data/event_data_sherdog.csv', index=False)

# Create a new column 'current_win_streak' in the fighter_info DataFrame using the Fighter_ID
fighter_info['current_win_streak'] = fighter_info['Fighter_ID'].map(win_streaks)

# Save the updated fighter_info DataFrame back to the CSV file
fighter_info.to_csv('data/fighter_info.csv', index=False)


In [36]:
# recent_win_rate_7fights
# recent_win_rate_5fights
# recent_win_rate_3fights
# fighter1_recent_win_rate_7fights
# fighter1_recent_win_rate_5fights
# fighter1_recent_win_rate_3fights
# fighter2_recent_win_rate_7fights
# fighter2_recent_win_rate_5fights
# fighter2_recent_win_rate_3fights

event_data = pd.read_csv('data/event_data_sherdog.csv')
fighter_info = pd.read_csv('data/fighter_info.csv')

# Ensure 'Event Date' is in datetime format
event_data['Event Date'] = pd.to_datetime(event_data['Event Date'])

# Define the number of recent fights to consider
recent_fights_list = [7, 5, 3]

# Initialize a dictionary to store recent win rates for each N
recent_win_rates = {N: {} for N in recent_fights_list}

# Loop through each fighter in fighter_info using their Fighter_ID
for fighter_id in fighter_info['Fighter_ID']:
    # Filter event data for fights involving this fighter based on Fighter 1 ID and Fighter 2 ID
    fights = event_data[(event_data['Fighter 1 ID'] == fighter_id) | (event_data['Fighter 2 ID'] == fighter_id)].copy()
    
    # Sort the fights by event date in descending order (most recent first)
    fights = fights.sort_values(by='Event Date', ascending=False)
    
    # Calculate the win rate for each N in recent_fights_list
    for N in recent_fights_list:
        recent_fights = fights.head(N)
        
        if len(recent_fights) > 0:
            wins = sum((recent_fights['Fighter 1 ID'] == fighter_id) & (recent_fights['Winning Fighter'] == recent_fights['Fighter 1']) |
                       (recent_fights['Fighter 2 ID'] == fighter_id) & (recent_fights['Winning Fighter'] == recent_fights['Fighter 2']))
            win_rate = wins / len(recent_fights)
        else:
            win_rate = 0  # No fights, so win rate is 0
        
        recent_win_rates[N][fighter_id] = win_rate

# These lines were added:
for N in recent_fights_list:
    event_data[f'fighter1_recent_win_rate_{N}fights'] = event_data['Fighter 1 ID'].map(recent_win_rates[N])
    event_data[f'fighter2_recent_win_rate_{N}fights'] = event_data['Fighter 2 ID'].map(recent_win_rates[N])

# This line was also added to save the updated event_data:
event_data.to_csv('data/event_data_sherdog.csv', index=False)

# Create new columns in the fighter_info DataFrame for each N
for N in recent_fights_list:
    fighter_info[f'recent_win_rate_{N}fights'] = fighter_info['Fighter_ID'].map(recent_win_rates[N])

# Save the updated fighter_info DataFrame back to the CSV file
fighter_info.to_csv('data/fighter_info.csv', index=False)

# Display the first few rows to verify the results
fighter_info[['Fighter', 'recent_win_rate_7fights', 'recent_win_rate_5fights', 'recent_win_rate_3fights']].head()

Unnamed: 0,Fighter,recent_win_rate_7fights,recent_win_rate_5fights,recent_win_rate_3fights
0,ikram aliskerov,0.666667,0.666667,0.666667
1,robert whittaker,0.714286,0.6,0.666667
2,antonio trocoli,0.0,0.0,0.0
3,felipe lima,1.0,1.0,1.0
4,daniel rodriguez,0.571429,0.4,0.0


In [38]:
# current_layoff
# fighter1_current_layoff
# fighter2_current_layoff

event_data = pd.read_csv('data/event_data_sherdog.csv')
fighter_info = pd.read_csv('data/fighter_info.csv')

# Ensure 'Event Date' is in datetime format
event_data['Event Date'] = pd.to_datetime(event_data['Event Date'])

# Initialize a dictionary to store layoff times
layoff_times = {}

# Get the current date
current_date = datetime.now()

# Loop through each fighter in fighter_info using their Fighter_ID
for fighter_id in fighter_info['Fighter_ID']:
    # Filter event data for fights involving this fighter based on Fighter 1 ID and Fighter 2 ID
    fights = event_data[(event_data['Fighter 1 ID'] == fighter_id) | (event_data['Fighter 2 ID'] == fighter_id)].copy()
    
    # If the fighter has participated in any fights
    if not fights.empty:
        # Find the most recent fight date
        most_recent_fight_date = fights['Event Date'].max()
        
        # Calculate the layoff time in days
        layoff_time = (current_date - most_recent_fight_date).days
    else:
        layoff_time = None  # If no fights are found, set layoff time to None
    
    # Store the layoff time for this fighter
    layoff_times[fighter_id] = layoff_time

# Map the layoff times to the event_data DataFrame for both fighters
event_data['fighter1_current_layoff'] = event_data['Fighter 1 ID'].map(layoff_times)
event_data['fighter2_current_layoff'] = event_data['Fighter 2 ID'].map(layoff_times)

# Save the updated event_data DataFrame back to a CSV file
event_data.to_csv('data/event_data_sherdog.csv', index=False)

# Create a new column 'layoff' in the fighter_info DataFrame using the Fighter_ID
fighter_info['current_layoff'] = fighter_info['Fighter_ID'].map(layoff_times)

# Save the updated fighter_info DataFrame back to the CSV file
fighter_info.to_csv('data/fighter_info.csv', index=False)

# Display the first few rows to verify the results
print(fighter_info[['Fighter', 'current_layoff']].head())


            Fighter  current_layoff
0   ikram aliskerov            55.0
1  robert whittaker            55.0
2   antonio trocoli            55.0
3       felipe lima            55.0
4  daniel rodriguez            55.0


In [40]:
# height_in_inches
# fighter1_height_in_inches
# fighter2_height_in_inches

file_path = 'data/fighter_info.csv'
df = pd.read_csv(file_path)

# Function to convert height to inches
def convert_height_to_inches(height_str):
    if pd.isna(height_str):  # Handle missing values
        return None
    try:
        # Split the height string into feet and inches
        parts = height_str.split("'")
        
        # If there are no inches, assume 0 inches
        feet = int(parts[0])
        inches = int(parts[1].replace('"', '')) if len(parts) > 1 else 0
        
        return feet * 12 + inches
    except ValueError:
        # Handle unexpected formats
        return None

# Apply the conversion to the 'Height' column
df['height_in_inches'] = df['Height'].apply(convert_height_to_inches)

# Save the updated fighter_info DataFrame back to the CSV file
df.to_csv(file_path, index=False)

# Load the event data
event_data = pd.read_csv('data/event_data_sherdog.csv')

# Map the heights in inches to the event_data DataFrame
event_data['fighter1_height_in_inches'] = event_data['Fighter 1 ID'].map(df.set_index('Fighter_ID')['height_in_inches'])
event_data['fighter2_height_in_inches'] = event_data['Fighter 2 ID'].map(df.set_index('Fighter_ID')['height_in_inches'])

# Save the updated event_data DataFrame back to a CSV file
event_data.to_csv('data/event_data_sherdog.csv', index=False)

# Display the first few rows to verify the results
print(event_data[['Fighter 1', 'fighter1_height_in_inches', 'Fighter 2', 'fighter2_height_in_inches']].head())

               Fighter 1  fighter1_height_in_inches         Fighter 2  \
0       Robert Whittaker                       72.0   Ikram Aliskerov   
1       Alexander Volkov                       79.0  Sergei Pavlovich   
2        Kelvin Gastelum                       69.0  Daniel Rodriguez   
3  Sharabutdin Magomedov                       74.0   Antonio Trocoli   
4        Volkan Oezdemir                       73.0     Johnny Walker   

   fighter2_height_in_inches  
0                       72.0  
1                       75.0  
2                       72.0  
3                       77.0  
4                       77.0  


In [42]:
# reach dictionary

df = pd.read_csv('data/master.csv')

# Create an empty dictionary to store the fighter reaches
fighter_reaches = {}

# Loop through each row in the DataFrame
for _, row in df.iterrows():
    # Check if FIGHTER1 has a reach value and is not already in the dictionary
    if pd.notna(row['FIGHTER1_REACH']) and row['FIGHTER1'] not in fighter_reaches:
        fighter_reaches[row['FIGHTER1']] = row['FIGHTER1_REACH']
    
    # Check if FIGHTER2 has a reach value and is not already in the dictionary
    if pd.notna(row['FIGHTER2_REACH']) and row['FIGHTER2'] not in fighter_reaches:
        fighter_reaches[row['FIGHTER2']] = row['FIGHTER2_REACH']

# Convert the dictionary to a DataFrame
reach_df = pd.DataFrame(list(fighter_reaches.items()), columns=['Fighter', 'Reach'])

# Save the DataFrame to a CSV file in the 'data/' directory
reach_df.to_csv('data/reach_dictionary.csv', index=False)

print("Reach dictionary has been saved to data/reach_dictionary.csv")


Reach dictionary has been saved to data/reach_dictionary.csv


In [43]:
# reach

updated_master = pd.read_csv('data/master.csv')
fighter_info = pd.read_csv('data/fighter_info.csv')

# Standardize the fighter names in both datasets (lowercase and stripped of extra spaces)
updated_master['FIGHTER1'] = updated_master['FIGHTER1'].str.lower().str.strip()
updated_master['FIGHTER2'] = updated_master['FIGHTER2'].str.lower().str.strip()
fighter_info['Fighter'] = fighter_info['Fighter'].str.lower().str.strip()

# Merge reach data for FIGHTER1
fighter_info = fighter_info.merge(
    updated_master[['FIGHTER1', 'FIGHTER1_REACH']],
    left_on='Fighter',
    right_on='FIGHTER1',
    how='left'
).rename(columns={'FIGHTER1_REACH': 'Reach'})

# Merge reach data for FIGHTER2 (ensure it doesn't overwrite FIGHTER1 reach if already filled)
fighter_info = fighter_info.merge(
    updated_master[['FIGHTER2', 'FIGHTER2_REACH']],
    left_on='Fighter',
    right_on='FIGHTER2',
    how='left'
)

# Fill missing reach values with the reach from FIGHTER2, if available
fighter_info['Reach'] = fighter_info['Reach'].fillna(fighter_info['FIGHTER2_REACH'])

# Drop the helper columns used for merging
fighter_info = fighter_info.drop(columns=['FIGHTER1', 'FIGHTER2', 'FIGHTER2_REACH'])

# Save the updated fighter_info DataFrame back to the CSV file
fighter_info.to_csv('data/fighter_info.csv', index=False)

# Count the number of fighters with no match (Reach is still NaN)
unmatched_count = fighter_info['Reach'].isna().sum()
missing_fighters = fighter_info[fighter_info['Reach'].isna()]['Fighter']

# Display the first few rows to verify the results
print(f"Number of fighters with no match: {unmatched_count}")
print("Fighters with no match:")
print(missing_fighters.tolist())


Number of fighters with no match: 380
Fighters with no match:
['muhammadjon naimov', 'jose mariscal', 'chang ho lee', 'sharabutdin magomedov', 'long xiao', 'jhonata diniz', 'elves brener', 'elves brener', 'elves brener', 'elves brener', 'elves brener', 'elves brener', 'tokitaka nakanishi', 'yunfeng li', 'vinicius oliveira', 'jieleyisi baergeng', 'kierandip singh sahota', 'shuai yin', 'lisa kyriacou', 'qihui yan', 'kyu sung kim', 'seung woo choi', 'douglassilva silva de andrade', 'jeong yeong lee', 'doo ho choi', 'timothy cuamba', 'asu almabaev', 'nathan maness', 'nuerdanbieke shayilan', 'christian leroy duncan', 'christian leroy duncan', 'christian leroy duncan', 'mick parkin', 'mick parkin', 'mick parkin', 'kalinn williams', 'george tokkos', 'shin haraguchi', 'abusupiyan magomedov', 'tatsuya ando', 'jun young hong', 'bahatebole batebolati', 'zachary reese', 'han seul kim', 'kangjie zhu', 'brunno ferreira', 'brunno ferreira', 'brunno ferreira', 'brunno ferreira', 'heili alateng', 'char

In [49]:
# fighter1_total_wins
# fighter1_total_losses
# fighter2_total_wins
# fighter2_total_losses

fighter_info = pd.read_csv('data/fighter_info.csv')
event_data = pd.read_csv('data/event_data_sherdog.csv')

# Ensure 'Fighter' is set as the index in fighter_info for mapping
fighter_info.set_index('Fighter', inplace=True)

# Normalize the fighter names in both DataFrames (lowercase and strip spaces)
fighter_info.index = fighter_info.index.str.lower().str.strip()
event_data['Fighter 1'] = event_data['Fighter 1'].str.lower().str.strip()
event_data['Fighter 2'] = event_data['Fighter 2'].str.lower().str.strip()

# Drop duplicates by keeping the first occurrence
fighter_info_unique = fighter_info[~fighter_info.index.duplicated(keep='first')]

# Map the total wins and losses to the event_data DataFrame for Fighter 1 and Fighter 2
event_data['fighter1_total_wins'] = event_data['Fighter 1'].map(fighter_info_unique['Wins'])
event_data['fighter1_total_losses'] = event_data['Fighter 1'].map(fighter_info_unique['Losses'])
event_data['fighter2_total_wins'] = event_data['Fighter 2'].map(fighter_info_unique['Wins'])
event_data['fighter2_total_losses'] = event_data['Fighter 2'].map(fighter_info_unique['Losses'])

# Save the updated event_data DataFrame back to a CSV file
event_data.to_csv('data/event_data_sherdog.csv', index=False)

# Display the first few rows to verify the results
print(event_data[['Fighter 1', 'fighter1_total_wins', 'fighter1_total_losses', 
                  'Fighter 2', 'fighter2_total_wins', 'fighter2_total_losses']].head())

               Fighter 1  fighter1_total_wins  fighter1_total_losses  \
0       robert whittaker                 26.0                    7.0   
1       alexander volkov                 38.0                   10.0   
2        kelvin gastelum                 19.0                    9.0   
3  sharabutdin magomedov                 14.0                    0.0   
4        volkan oezdemir                 20.0                    7.0   

          Fighter 2  fighter2_total_wins  fighter2_total_losses  
0   ikram aliskerov                 15.0                    2.0  
1  sergei pavlovich                 18.0                    3.0  
2  daniel rodriguez                 17.0                    5.0  
3   antonio trocoli                 12.0                    4.0  
4     johnny walker                 21.0                    9.0  


In [51]:
# # Assign features to variables

# event_data = pd.read_csv('data/event_data_sherdog.csv')
# fighter_info = pd.read_csv('data/fighter_info.csv')

# # Now the features are stored in variables within the DataFrame
# fighter_age_fighter1 = event_data['fighter1_age_on_fightnight']
# fighter_age_fighter2 = event_data['fighter2_age_on_fightnight']
# age_difference = event_data['age_difference']

# # Convert the 'Height' column in fighter_info to inches directly in the DataFrame
# fighter_info['Height_inches'] = fighter_info['Height'].apply(lambda x: int(re.search(r"(\d+)'", x).group(1)) * 12 +
#                                                                      int(re.search(r"'(\d+)\"", x).group(1)) if isinstance(x, str) and re.search(r"(\d+)'", x) and re.search(r"'(\d+)\"", x) else None)

# # Extract the relevant columns for fighter1 and fighter2 height
# fighter1_height = fighter_info[['Fighter_ID', 'Height_inches']].rename(columns={'Height_inches': 'fighter1_height'})
# fighter2_height = fighter_info[['Fighter_ID', 'Height_inches']].rename(columns={'Height_inches': 'fighter2_height'})

# # At this point, you can directly use these height columns (fighter1_height, fighter2_height) in your model preparation code
# print(fighter1_height.head())
# print(fighter2_height.head())

# fighter_info.to_csv('data/fighter_info.csv', index=False)

---

# Basic Model 1

1. Total Wins
2. Total Losses
<!-- 3. Reach -->
4. Height
5. Current Win Streak
<!-- 6. Opponent’s Win-Loss Record -->

In [52]:
# Check final dataset is clean

event_data = pd.read_csv('data/event_data_sherdog.csv')

# List all columns
print("Columns in the dataset:")
print(event_data.columns)

# Display the first few rows to understand the structure
print("\nFirst few rows of the dataset:")
print(event_data.head())

# Check for missing values
print("\nMissing values in each column:")
print(event_data.isnull().sum())

# Check for duplicates
print("\nCheck for duplicate rows:")
print(event_data.duplicated().sum())

# Get summary statistics
print("\nSummary statistics for numerical columns:")
print(event_data.describe())

# Check data types
print("\nData types of each column:")
print(event_data.dtypes)


Columns in the dataset:
Index(['Event Name', 'Event Location', 'Event Date', 'Fighter 1', 'Fighter 2',
       'Fighter 1 ID', 'Fighter 2 ID', 'Weight Class', 'Winning Fighter',
       'Winning Method', 'Winning Round', 'Winning Time', 'Referee',
       'Fight Type', 'fighter1_age_on_fightnight',
       'fighter2_age_on_fightnight', 'age_difference',
       'fighter1_current_win_streak', 'fighter2_current_win_streak',
       'fighter1_recent_win_rate_7fights', 'fighter2_recent_win_rate_7fights',
       'fighter1_recent_win_rate_5fights', 'fighter2_recent_win_rate_5fights',
       'fighter1_recent_win_rate_3fights', 'fighter2_recent_win_rate_3fights',
       'fighter1_current_layoff', 'fighter2_current_layoff',
       'fighter1_height_in_inches', 'fighter2_height_in_inches',
       'fighter1_total_wins', 'fighter1_total_losses', 'fighter2_total_wins',
       'fighter2_total_losses'],
      dtype='object')

First few rows of the dataset:
                               Event Name  \
0  UFC

In [None]:
# Skeleton Code

import pandas as pd
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report

# 1. Load the Dataset
# Assuming you have a CSV file with the relevant data
data = pd.read_csv('mma_fights.csv')

# 2. Feature Selection
# Selecting the minimal features we discussed
features = ['Total Wins', 'Total Losses', 'Reach', 'Height', 'Current Win Streak', 'Opponent Win-Loss Record']
X = data[features]

# Target variable (1 for win, 0 for loss)
y = data['Fight Outcome']  # Adjust the column name as needed

# 3. Split the Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 4. Initialize the XGBoost Classifier
model = XGBClassifier(use_label_encoder=False, eval_metric='logloss')

# 5. Train the Model
model.fit(X_train, y_train)

# 6. Make Predictions
y_pred = model.predict(X_test)

# 7. Evaluate the Model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

# Detailed Classification Report
print(classification_report(y_test, y_pred))

# 8. Feature Importance (Optional)
# This helps you understand which features are most important
importances = model.feature_importances_
for feature, importance in zip(features, importances):
    print(f'{feature}: {importance:.2f}')


In [60]:
# Test 1

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Load the dataset
file_path = 'data/event_data_sherdog.csv'
df = pd.read_csv(file_path)

# Select relevant columns for the model
features = [
    'fighter1_age_on_fightnight',
    'fighter2_age_on_fightnight',
    'age_difference',
    'fighter1_height_in_inches',
    'fighter2_height_in_inches',
    'fighter1_total_wins',
    'fighter1_total_losses',
    'fighter2_total_wins',
    'fighter2_total_losses',
    'fighter1_current_win_streak',
    'fighter2_current_win_streak',
    'fighter1_recent_win_rate_3fights',
    'fighter2_recent_win_rate_3fights',
    'fighter1_current_layoff',
    'fighter2_current_layoff'
]

# Extract features and target variable
X = df[features].copy()
y = df['Winning Fighter']  # Assuming 'Winning Fighter' is the target variable

# Handle missing values by imputing with mean (for numeric columns)
X.fillna(X.mean(), inplace=True)

# Encode target variable if necessary
# y = y.apply(lambda x: 1 if x == 'Fighter 1' else 0)  # Binary encoding (e.g., 1 for Fighter 1 wins, 0 for Fighter 2 wins)
y = y.apply(lambda x: 1 if x == 'Fighter 1' else 0)  # Assuming 'Fighter 1' is the positive class
# df['Winning Fighter'] = df['Winning Fighter'].apply(lambda x: 1 if x == 'Fighter 1' else 0)
# print(df['Winning Fighter'].value_counts()) # Verify the encoding

# Feature scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Initialize and train the XGBoost model
model = XGBClassifier(eval_metric='mlogloss')
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Save the model if needed
import joblib
joblib.dump(model, 'xgboost_model.pkl')


Confusion Matrix:
[[1368]]

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      1368

    accuracy                           1.00      1368
   macro avg       1.00      1.00      1.00      1368
weighted avg       1.00      1.00      1.00      1368





['xgboost_model.pkl']

In [None]:
# Test 2

file_path = 'data/event_data_sherdog.csv'
data = pd.read_csv(file_path)

# Fill missing values for critical features
data['fighter1_total_wins'].fillna(data['fighter1_total_wins'].median(), inplace=True)
data['fighter2_total_wins'].fillna(data['fighter2_total_wins'].median(), inplace=True)
data['fighter1_total_losses'].fillna(data['fighter1_total_losses'].median(), inplace=True)
data['fighter2_total_losses'].fillna(data['fighter2_total_losses'].median(), inplace=True)
data['fighter1_height_in_inches'].fillna(data['fighter1_height_in_inches'].median(), inplace=True)
data['fighter2_height_in_inches'].fillna(data['fighter2_height_in_inches'].median(), inplace=True)

# Select features for the model
features = [
    'fighter1_age_on_fightnight', 'fighter2_age_on_fightnight', 
    'fighter1_height_in_inches', 'fighter2_height_in_inches', 
    'fighter1_total_wins', 'fighter1_total_losses', 
    'fighter2_total_wins', 'fighter2_total_losses', 
    'fighter1_current_win_streak', 'fighter2_current_win_streak', 
    'fighter1_recent_win_rate_3fights', 'fighter2_recent_win_rate_3fights', 
    'fighter1_current_layoff', 'fighter2_current_layoff'
]

# Prepare the cleaned data
cleaned_data = data[features]

# Confirm there are no missing values
print("Missing values after cleaning:")
print(cleaned_data.isnull().sum())

# Check the first few rows of the cleaned data
print("\nCleaned data preview:")
print(cleaned_data.head())


In [None]:
# # Basic Model 1
# # Feature Selection
# features = ['Wins', 'Losses', 'Height_in_inches', 'win_streak']
# X = data[features]

# # Target variable (1 for win, 0 for loss)
# y = data['Fight Outcome']  # Adjust the column name as needed

# # 3. Split the Data
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# # 4. Initialize the XGBoost Classifier
# model = XGBClassifier(use_label_encoder=False, eval_metric='logloss')

# # 5. Train the Model
# model.fit(X_train, y_train)

# # 6. Make Predictions
# y_pred = model.predict(X_test)

# # 7. Evaluate the Model
# accuracy = accuracy_score(y_test, y_pred)
# print(f'Accuracy: {accuracy:.2f}')

# # Detailed Classification Report
# print(classification_report(y_test, y_pred))

# # 8. Feature Importance (Optional)
# # This helps you understand which features are most important
# importances = model.feature_importances_
# for feature, importance in zip(features, importances):
#     print(f'{feature}: {importance:.2f}')


In [None]:
# # Basic Model 2

# # Basic Features
# df['age_difference'] = df['fighter_age'] - df['opponent_age']
# df['height_difference'] = df['fighter_height'] - df['opponent_height']
# df['reach_difference'] = df['fighter_reach'] - df['opponent_reach']
# df['recent_win_rate'] = df['fighter_recent_wins'] / (df['fighter_recent_wins'] + df['fighter_recent_losses'])
# df['opponent_recent_win_rate'] = df['opponent_recent_wins'] / (df['opponent_recent_wins'] + df['opponent_recent_losses'])
# df['win_streak'] = df['fighter_win_streak']
# df['opponent_win_streak'] = df['opponent_win_streak']

# # Target variable
# df['outcome'] = df['outcome'].apply(lambda x: 1 if x == 'win' else 0)

# # Features and target
# X = df[['age_difference', 'height_difference', 'reach_difference', 'recent_win_rate', 'opponent_recent_win_rate', 'win_streak', 'opponent_win_streak']]
# y = df['outcome']

# # Train-test split
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# # Initialize the XGBoost model
# model = XGBClassifier()

# # Train the model
# model.fit(X_train, y_train)

# # Make predictions
# y_pred = model.predict(X_test)

# # Evaluate the model
# accuracy = accuracy_score(y_test, y_pred)
# print(f'Accuracy: {accuracy:.2f}')
# print('Classification Report:')
# print(classification_report(y_test, y_pred))
# print('Confusion Matrix:')
# print(confusion_matrix(y_test, y_pred))

# # Feature importance (optional)
# import matplotlib.pyplot as plt
# xgb.plot_importance(model)
# plt.show()


# Dan Hooker XGBoost (GPT)

Certainly! Below is the full Python code you can use to implement XGBoost for predicting Dan Hooker's fight outcomes. This code includes data preparation, model training, and evaluation, along with visualizations for ROC and Precision-Recall curves.

### **Full Code for XGBoost Implementation**

```python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report, roc_curve, precision_recall_curve, auc
import matplotlib.pyplot as plt

# Load your dataset (assuming it's already loaded as a DataFrame)
# Example: dan_hooker_df = pd.read_csv('your_dataset.csv')

# Encoding categorical variables
label_encoder = LabelEncoder()
dan_hooker_df['METHOD_ENCODED'] = label_encoder.fit_transform(dan_hooker_df['METHOD'])
dan_hooker_df['LOCATION_ENCODED'] = label_encoder.fit_transform(dan_hooker_df['LOCATION'])

# Feature selection
features = dan_hooker_df[['ROUND', 'METHOD_ENCODED', 'LOCATION_ENCODED']]
target = dan_hooker_df['Dan_Hooker_Win']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.3, random_state=42)

# Standardize the feature set
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Advanced Model: XGBoost Classifier
xgb_clf = XGBClassifier(random_state=42)
xgb_clf.fit(X_train_scaled, y_train)
xgb_preds = xgb_clf.predict(X_test_scaled)

# Evaluate the XGBoost model
xgb_accuracy = accuracy_score(y_test, xgb_preds)
xgb_report = classification_report(y_test, xgb_preds)

# Feature Importance
feature_importances = xgb_clf.feature_importances_

# ROC Curve
fpr, tpr, _ = roc_curve(y_test, xgb_clf.predict_proba(X_test_scaled)[:,1])
roc_auc = auc(fpr, tpr)

# Precision-Recall Curve
precision, recall, _ = precision_recall_curve(y_test, xgb_clf.predict_proba(X_test_scaled)[:,1])
pr_auc = auc(recall, precision)

# Plotting ROC and Precision-Recall Curves
plt.figure(figsize=(14, 6))

# ROC Curve
plt.subplot(1, 2, 1)
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")

# Precision-Recall Curve
plt.subplot(1, 2, 2)
plt.plot(recall, precision, color='blue', lw=2, label='PR curve (area = %0.2f)' % pr_auc)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend(loc="lower left")

plt.show()

# Output the results
print("XGBoost Model Accuracy:", xgb_accuracy)
print("Classification Report:\n", xgb_report)
print("Feature Importances:", feature_importances)
```

### **Steps to Run the Code:**
1. **Install Required Packages**:
   - Ensure you have `xgboost`, `scikit-learn`, and `matplotlib` installed. You can install them using pip:
     ```bash
     pip install xgboost scikit-learn matplotlib
     ```

2. **Prepare Your Data**:
   - Make sure your data is loaded into a pandas DataFrame similar to how `dan_hooker_df` is used in the code.

3. **Run the Code**:
   - Execute the code in your Python environment. The model will be trained, evaluated, and the ROC and Precision-Recall curves will be plotted.

This code should give you a robust implementation of the XGBoost model, along with detailed evaluation metrics and visualizations.

In [None]:
# Delete columns 

event_data = pd.read_csv('data/event_data_sherdog.csv')

# List of columns to remove
columns_to_keep = [
    'Event Name',
    'Event Location',
    'Event Date',
    'Fighter 1',
    'Fighter 2',
    'Fighter 1 ID',
    'Fighter 2 ID',
    'Weight Class',
    'Winning Fighter',
    'Winning Method',
    'Winning Round',
    'Winning Time',
    'Referee',
    'Fight Type',
]

# Keep only the specified columns
event_data = event_data[columns_to_keep]

# Optionally, save the updated DataFrame back to the CSV file
event_data.to_csv('data/event_data_sherdog.csv', index=False)

# Display the first few rows to check
print(event_data.head())



In [62]:
!open data/event_data_sherdog.csv

In [None]:
!open data/fighter_info.csv