In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
### Perform exploratory data analysis



# Returning Fighters Analysis


In [None]:
df = pd.read_csv('./data-dwcs/DWCS_Event_Data_Sherdog.csv')

# Combine the 'Fighter 1' and 'Fighter 2' columns into a single list
all_fighters = df['Fighter 1'].tolist() + df['Fighter 2'].tolist()

# Convert the list into a pandas Series and count the occurrences of each fighter
all_fighters_series = pd.Series(all_fighters)
fighter_counts = all_fighters_series.value_counts()

# Filter out fighters who have fought more than once
repeat_fighters = fighter_counts[fighter_counts > 1]

### GPT Analysis 1

Great, let's dig deeper into the win percentages for fighters who have fought exactly twice on the show. Here's a summary:

Summary of Win Status for Fighters with Exactly 2 Bouts
11 fighters lost both their first and second bouts.
8 fighters lost their first bout but won their second bout.
5 fighters won their first bout but lost their second bout.
11 fighters won both their first and second bouts.
Win Percentage Analysis
Among fighters who won their first bout, approximately 68.75% also won their second bout.
Among fighters who lost their first bout, approximately 42.11% won their second bout.
From this analysis, it appears that fighters who win their first bout are more likely to win their second bout as well (68.75% win rate for the second bout). Conversely, for those who lose their first bout, the chance of winning the second bout is lower, at around 42.11%.

Here's the code I used to conduct this deeper analysis:

In [None]:
df = pd.read_csv('./data-dwcs/DWCS_Event_Data_Sherdog.csv')

# Create a summary DataFrame from the extracted fighter statistics
summary_df = pd.DataFrame(summary_rows)

# Group by 'Won First Bout' and 'Won Second Bout' to get counts
summary_counts = summary_df.groupby(['Won First Bout', 'Won Second Bout']).size().reset_index(name='Count')

# Calculate percentages for specific scenarios
won_first_won_second = summary_df[(summary_df['Won First Bout'] == True) & (summary_df['Won Second Bout'] == True)].shape[0]
won_first_total = summary_df[summary_df['Won First Bout'] == True].shape[0]
percentage_won_first_won_second = (won_first_won_second / won_first_total) * 100

lost_first_won_second = summary_df[(summary_df['Won First Bout'] == False) & (summary_df['Won Second Bout'] == True)].shape[0]
lost_first_total = summary_df[summary_df['Won First Bout'] == False].shape[0]
percentage_lost_first_won_second = (lost_first_won_second / lost_first_total) * 100


### GPT Analysis 2

After closely examining the data for returning fighters who have had exactly two bouts, I found an interesting trend related to whether fights go the distance:

Fights Going the Distance (Ending in Round 3)
In the first bouts, 21 fights went the distance, which is approximately 60% of the total fights.
In the second bouts, 24 fights went the distance, which is approximately 68.57% of the total fights.
Fights Not Going the Distance (Ending Before Round 3)
In the first bouts, approximately 40% of the fights did not go the distance.
In the second bouts, approximately 31.43% of the fights did not go the distance.
From this, we can observe that fights are slightly more likely to go the distance in the second bout compared to the first bout for returning fighters.

Here's the code I used for this analysis:

In [None]:
df = pd.read_csv('./data-dwcs/DWCS_Event_Data_Sherdog.csv')

# Create a round summary DataFrame from extracted 'Winning Round' data
round_summary_df = pd.DataFrame(round_summary_rows)

# Count the number of fights that went the distance (i.e., ended in round 3) in the first and second bouts
went_distance_first = round_summary_df[round_summary_df['First Bout Round'] == 3].shape[0]
went_distance_second = round_summary_df[round_summary_df['Second Bout Round'] == 3].shape[0]

# Calculate the percentages of fights that went the distance in the first and second bouts
total_fights = len(round_summary_rows)
percentage_went_distance_first = (went_distance_first / total_fights) * 100
percentage_went_distance_second = (went_distance_second / total_fights) * 100


### GPT Analysis 3

After digging deeper into the original dataset, here are five more insightful and interesting trends:

1. Distribution by Weight Class
The most common weight class for fights is Middleweight with 88 bouts, followed by Light Heavyweight with 67 bouts and Lightweight with 50 bouts. The least represented weight classes are Strawweight and 188lb Catchweight, each with just 5 bouts.
2. Trends in Winning Methods over Time
The winning method "Decision (Unanimous)" has been consistently popular across years, dominating the count each year.
3. Event Locations
A significant majority of the events took place at the UFC Training Center in Las Vegas, Nevada, United States (262 bouts). The UFC Apex in Las Vegas hosted 43 bouts, making it the second most common venue.
4. Average Fight Duration by Weight Class
Bantamweight fights tend to last the longest, with an average duration of approximately 4.06 minutes. Strawweight fights are the shortest on average, lasting around 2.90 minutes.
5. Winning Fighters by Weight Class
There doesn't seem to be a dominant fighter in any particular weight class based on the available data. The wins are fairly distributed among various fighters.

# Previous Fight Leagues Results (Cage Warriors)

In [1]:
import pandas as pd
import os

fighters_folder_path = 'data-dwcs/fighters/'

# Function to process each fighter's file and retain DWCS fight, plus the fight after
def filter_fighter_fights_extended(df):
    # Identify rows where the 'Event Name' contains "DWCS" or "Contender Series"
    dwcs_mask = df['Event Name'].str.contains('DWCS|Contender Series', case=False, na=False)

    # Find indices of the DWCS fights
    dwcs_indices = df[dwcs_mask].index
    
    # Create an empty list to hold rows to keep
    rows_to_keep = []

    # Loop through each index where a DWCS fight is found
    for idx in dwcs_indices:
        # Add the current DWCS fight row (idx) and the next row (idx + 1), if exists
        rows_to_keep.append(idx)
        if idx + 1 < len(df):
            rows_to_keep.append(idx + 1)

    # Filter the DataFrame to keep only the selected rows
    filtered_df = df.loc[rows_to_keep]

    return filtered_df

# Iterate through each file in the directory
for filename in os.listdir(fighters_folder_path):
    if filename.endswith('.csv'):
        file_path = os.path.join(fighters_folder_path, filename)

        # Read the CSV file into a DataFrame
        df = pd.read_csv(file_path)

        # Process the DataFrame to keep only DWCS fight and the fight after
        filtered_df = filter_fighter_fights_extended(df)

        # Save the filtered DataFrame back to the file (or to a new file if preferred)
        filtered_df.to_csv(file_path, index=False)

        print(f"Processed and saved: {filename}")


Processed and saved: Kyler_Matrix_Phillips_190919.csv
Processed and saved: Ernesta_Heavy-Handed_Kareckaite_319093.csv
Processed and saved: Vanderson_de_Camp_285789.csv
Processed and saved: Brunno_Ferreira_322603.csv
Processed and saved: Armen_Superman_Petrosyan_302783.csv
Processed and saved: Janaina_Jana_Popozinha_Silva_284089.csv
Processed and saved: Geoff_Handz_of_Steel_Neal_72107.csv
Processed and saved: Jasmine_Jasudavicius_311001.csv
Processed and saved: Alex_Perez_12443.csv
Processed and saved: Sean_The_Sniper_Woodson_235563.csv
Processed and saved: Ginseng_Poit_DuJour_172427.csv
Processed and saved: Adam_Samurai_Bramhald_159939.csv
Processed and saved: Weldon_Silva_de_Oliveira_158505.csv
Processed and saved: Bevon_The_Extraordinary_Gentleman_Lewis_211153.csv
Processed and saved: Ikram_Aliskerov_202763.csv
Processed and saved: Dwight_Joseph_106151.csv
Processed and saved: Taylor_Tombstone_Johnson_232443.csv
Processed and saved: Ronnie_The_Heat_Lawrence_102885.csv
Processed and s

In [4]:
import pandas as pd
import os

fighters_folder_path = 'data-dwcs/fighters/'

# Function to check if fighter has Cage Warriors or CW fights
def has_cage_warriors_fight(df):
    # Identify Cage Warriors fights by looking for "Cage Warriors" or "CW" in 'Event Name'
    cage_warriors_mask = df['Event Name'].str.contains('Cage Warriors|CW', case=False, na=False)
    return cage_warriors_mask.any()

# Initialize an empty DataFrame to collect all fighters' data
all_fighters_data = pd.DataFrame()

# Iterate through each file in the directory
for filename in os.listdir(fighters_folder_path):
    if filename.endswith('.csv'):
        file_path = os.path.join(fighters_folder_path, filename)

        # Extract the fighter's name from the file name (assuming format: "First_Last_ID.csv")
        fighter_name = filename.rsplit('_', 1)[0].replace('_', ' ')

        # Read the CSV file into a DataFrame
        df = pd.read_csv(file_path)

        # Check if the fighter has participated in a Cage Warriors or CW event
        if has_cage_warriors_fight(df):
            df = df.copy()  # Make an explicit copy to avoid warning
            df.loc[:, 'Fighter Name'] = fighter_name  # Add the fighter's name to the DataFrame
            # Append all of the fighter's data (not just filtered rows)
            all_fighters_data = pd.concat([all_fighters_data, df])

# Save the result to a CSV file if there are any valid fighters
if not all_fighters_data.empty:
    all_fighters_data.to_csv("Cage_Warriors_Results.csv", index=False)
    print("Data saved to Cage_Warriors_Results.csv.")
else:
    print("No fighters with Cage Warriors or CW fights found.")


Data saved to Cage_Warriors_Results.csv.


In [5]:
!open Cage_Warriors_Results.csv

In [4]:
import pandas as pd
import plotly.express as px

# Load the results data
df = pd.read_csv('Cage_Warriors_Results.csv')

# Filter for DWCS fights only
dwcs_fights = df[df['Event Name'].str.contains('Dana White\'s Contender Series', case=False)]

# 1. Bar chart for Win vs. Loss using Plotly
win_loss_fig = px.bar(dwcs_fights, x='Result', color='Result',
                      title='Win vs. Loss Distribution in DWCS Fights for CW Fighters',
                      labels={'Result': 'Fight Result', 'count': 'Count'},
                      color_discrete_sequence=px.colors.diverging.Tealrose)

# Remove gridlines and ensure only whole numbers on y-axis
win_loss_fig.update_layout(
    xaxis=dict(tickmode='linear', tick0=0, dtick=1),  # Ensure whole numbers on x-axis
    yaxis=dict(showgrid=False),  # Remove y-axis gridlines
    xaxis_title='Fight Result',
    yaxis_title='Count',
    template='simple_white'  # Removes background gridlines
)

# Save the Win vs. Loss chart
win_loss_fig.write_image("win_loss_distribution.png")  # Save as PNG

# 2. Bar chart for Win Methods using Plotly

# Filter for wins only and make a copy to avoid SettingWithCopyWarning
dwcs_wins = dwcs_fights[dwcs_fights['Result'] == 'win'].copy()

# Extract the method by splitting at the first occurrence of '('
dwcs_wins['Win Method'] = dwcs_wins['Method/Referee'].apply(lambda x: x.split('(')[0].strip())

win_method_fig = px.bar(dwcs_wins, y='Win Method', orientation='h', 
                        title='Distribution of Win Methods in DWCS Fights',
                        labels={'Win Method': 'Win Method', 'count': 'Count'},
                        color='Win Method',
                        color_discrete_sequence=px.colors.sequential.Viridis)

# Remove gridlines and ensure only whole numbers on x-axis
win_method_fig.update_layout(
    yaxis=dict(showgrid=False),  # Remove y-axis gridlines
    xaxis=dict(tickmode='linear', tick0=0, dtick=1),  # Ensure whole numbers on x-axis
    xaxis_title='Count',
    yaxis_title='Win Method',
    template='simple_white'  # Removes background gridlines
)

# Save the Win Methods chart
win_method_fig.write_image("win_method_distribution.png")  # Save as PNG

# 3. Bar chart for Loss Methods using Plotly

# Filter for losses only and make a copy to avoid SettingWithCopyWarning
dwcs_losses = dwcs_fights[dwcs_fights['Result'] == 'loss'].copy()

# Extract the method by splitting at the first occurrence of '('
dwcs_losses['Loss Method'] = dwcs_losses['Method/Referee'].apply(lambda x: x.split('(')[0].strip())

loss_method_fig = px.bar(dwcs_losses, y='Loss Method', orientation='h', 
                         title='Distribution of Loss Methods in DWCS Fights',
                         labels={'Loss Method': 'Loss Method', 'count': 'Count'},
                         color='Loss Method',
                         color_discrete_sequence=px.colors.sequential.Magma)

# Remove gridlines and ensure only whole numbers on x-axis
loss_method_fig.update_layout(
    yaxis=dict(showgrid=False),  # Remove y-axis gridlines
    xaxis=dict(tickmode='linear', tick0=0, dtick=1),  # Ensure whole numbers on x-axis
    xaxis_title='Count',
    yaxis_title='Loss Method',
    template='simple_white'  # Removes background gridlines
)

# Save the Loss Methods chart
loss_method_fig.write_image("loss_method_distribution.png")  # Save as PNG
