# Statistical DataFrame for Fight Outcome Prediction

#### This DataFrame is structured to support predictive modeling of UFC fight outcomes, consolidating key fight statistics, event details, and fighter attributes. It includes essential metrics such as fight method, fighter stance, age, reach, and historical performance records, all of which contribute to the analysis and prediction of fight results. This dataset serves as a comprehensive foundation for building machine learning models that forecast the outcome of MMA fights based on fighters' skills, physical attributes, and fight histories.


1. **event_name**: The name of the UFC event.
2. **date**: The date of the event.
3. **location**: The venue and city where the event occurred.
4. **fighter (red/blue)**: Names of the fighters in the red and blue corners.
5. **winner**: Indicates the winner of the fight (either "Red" or "Blue").
6. **weight_class**: The weight division for the fight (e.g., Featherweight, Welterweight).
7. **is_title_bout**: Indicates if the fight was for a championship title (1 for yes, 0 for no).
8. **gender**: Gender category of the fighters ("Men" or "Women").
9. **method**: How the fight was won (e.g., "Decision", "KO/TKO", "Submission").
10. **round**: The round in which the fight ended.
11. **time**: The time in the specified round when the fight ended.
12. **time_format**: Format of the time, such as "3 Rnd (5-5-5)".
13. **referee**: The official who oversaw the fight.

### Fight Statistics (for both red and blue fighters)
- **kd**: Number of knockdowns.
- **sig_str**: Significant strikes landed.
- **sig_str_att**: Significant strikes attempted.
- **total_str**: Total strikes landed.
- **total_str_att**: Total strikes attempted.
- **td**: Takedowns landed.
- **td_att**: Takedowns attempted.
- **sub_att**: Submission attempts.
- **pass**: Number of guard passes.
- **rev**: Number of reversals.

### Differences Between Fighters
- **age_diff**: Age difference.
- **height_diff**: Height difference.
- **reach_diff**: Reach difference.
- **weight_diff**: Weight difference.

### Striking Metrics
- **SLpM_total_diff**: Difference in significant strikes landed per minute.
- **SApM_total_diff**: Difference in significant strikes absorbed per minute.
- **sig_str_acc_total_diff**: Difference in significant strike accuracy.
- **str_def_total_diff**: Difference in striking defense.

### Takedown Metrics
- **td_avg_diff**: Difference in average takedowns per 15 minutes.
- **td_acc_total_diff**: Difference in takedown accuracy.
- **td_def_total_diff**: Difference in takedown defense.

### Submission Metrics
- **sub_avg_diff**: Difference in average submission attempts.

### Fighter Records and Trends
- **current_win_streak**: Current win streak.
- **current_lose_streak**: Current losing streak.
- **draw**: Number of draws.
- **longest_win_streak**: Longest win streak.
- **total_fights**: Total number of fights.
- **total_rounds_fought**: Total rounds fought.
- **total_title_bouts**: Total title bouts.
- **record**: Win-loss-draw record.
- **win_by_KO_TKO**: Wins by KO/TKO.
- **win_by_Submission**: Wins by submission.
- **win_by_Decision**: Wins by decision.

### Fight End and Performance
- **finish**: Indicates if the fight ended in a finish (KO/TKO or Submission).
- **fight_end**: How the fight ended (e.g., "Decision", "KO").
- **avg_performance**: Average performance rating.
- **rating_trend**: Trend in performance rating.

### Win Probability
- **prob_win**: Probability of winning for each fighter.

### Miscellaneous Differences
- **average_fight_time_diff**: Difference in average fight time.
- **finish_rate_diff**: Difference in finish rates.

### Recent Performance Metrics
- **last_5_win_rate**: Win rate in the last 5 fights.
- **last_5_finish_rate**: Finish rate in the last 5 fights.

### Conditioning
- **avg_weight_diff_opponent**: Average weight difference with opponents.

### Trends and Experience
- **performance_trend_long**: Long-term performance trend.
- **knockdown_rate**: Knockdown rate.
- **last_title_bout_performance**: Performance in the most recent title bout.
- **total_win_rate_diff**: Difference in win rates.
- **total_finish_rate_diff**: Difference in finish rates.
- **experience_diff**: Difference in overall experience.
- **experience_trend**: Experience trend.
- **career_span_diff**: Difference in career duration.



In [13]:
import pandas as pd

## Comprehensive Data Quality Checks for UFC Fight Analysis

#### This script performs essential data quality checks on UFC-related datasets to ensure readiness for predictive modeling. It loads multiple datasets, examines them for missing values, duplicates, and data type consistency, and provides statistical summaries. The insights generated help validate and prepare the data for further analysis and model training

In [14]:

# Function for data quality checks
def data_quality_checks(df, df_name):
    # Check for missing values
    missing_values = df.isnull().sum()
    
    # Check for duplicate rows
    duplicate_rows = df.duplicated().sum()
    
    # Check data types
    data_types = df.dtypes
    
    # Count of unique values for certain columns
    unique_values = df.nunique()
    
    # Statistical summary for numerical columns
    stats_summary = df.describe()

    # Get the list of all column names
    column_names = df.columns.tolist()
    
    # Display the results
    print("\nColumn Names:")
    print(column_names)
    print(f"\n=== Data Quality Checks for {df_name} ===")
    print("\nMissing Values:")
    print(missing_values[missing_values > 0])
    print(f"\nNumber of Duplicate Rows: {duplicate_rows}")
    print("\nData Types:")
    print(data_types)
    print("\nUnique Values Count:")
    print(unique_values)
    print("\nStatistical Summary for Numerical Columns:")
    print(stats_summary)


# Load the datasets
fighter_stats = pd.read_csv('../data/raw/Fighter_stats/fighter_stats.csv')
large_dataset = pd.read_csv('../data/raw/Large_set/large_dataset.csv')
medium_dataset = pd.read_csv('../data/raw/Medium_set/medium_dataset.csv')
ufc = pd.read_csv('../data/raw/ufc.csv')
ufc_master = pd.read_csv('../data/raw/ufc-master.csv')
completed_events_small = pd.read_csv('../data/raw/Small_set/completed_events_small.csv')

# Perform data quality checks for each dataset
data_quality_checks(fighter_stats, "Fighter Stats")
data_quality_checks(large_dataset, "Large Dataset")
data_quality_checks(medium_dataset, "Medium Dataset")
data_quality_checks(ufc, "UFC")
data_quality_checks(ufc_master, "UFC Master")
data_quality_checks(completed_events_small, "Completed Events Small")



Column Names:
['name', 'wins', 'losses', 'height', 'weight', 'reach', 'stance', 'age', 'SLpM', 'sig_str_acc', 'SApM', 'str_def', 'td_avg', 'td_acc', 'td_def', 'sub_avg']

=== Data Quality Checks for Fighter Stats ===

Missing Values:
name             1
wins             1
losses           1
height           1
weight           1
reach          656
stance          78
age            161
SLpM             1
sig_str_acc      1
SApM             1
str_def          1
td_avg           1
td_acc           1
td_def           1
sub_avg          1
dtype: int64

Number of Duplicate Rows: 0

Data Types:
name            object
wins           float64
losses         float64
height         float64
weight         float64
reach          float64
stance          object
age            float64
SLpM           float64
sig_str_acc    float64
SApM           float64
str_def        float64
td_avg         float64
td_acc         float64
td_def         float64
sub_avg        float64
dtype: object

Unique Values Count:
na

## Merging and Organizing UFC Event Datasets

This script loads and merges two UFC-related datasets based on a common column, `event_name`, to create a unified DataFrame. After merging, it reorders columns to place the `date` and `location` fields immediately after `event_name` for logical organization. Finally, it displays the first few rows of the merged DataFrame to confirm the updates, providing a streamlined dataset for further analysis or modeling.

In [15]:

# Rename columns for matching
medium_dataset = medium_dataset.rename(columns={'event': 'event_name'})

# Merge the datasets on the common column (event_name), без method_details
merged_dataset = pd.merge(large_dataset, medium_dataset[['event_name', 'date', 'location']],
                          on='event_name', how='left')

# Reorder columns: place 'date' after 'event_name' and 'location' after 'date'
columns_order = merged_dataset.columns.tolist()
event_name_idx = columns_order.index('event_name')

# Rearrange the column order
new_columns_order = (
    columns_order[:event_name_idx + 1] +  # Up to and including 'event_name'
    ['date', 'location'] +                # Add 'date' and 'location' after 'event_name'
    columns_order[event_name_idx + 1:]    # Remaining columns
)

# Apply the new column order
merged_dataset = merged_dataset.loc[:, pd.unique(new_columns_order)]

# Display the updated DataFrame to confirm the changes
print(merged_dataset.head())


                             event_name       date                location  \
0  UFC Fight Night: Ribas vs. Namajunas  3/23/2024  Las Vegas, Nevada, USA   
1  UFC Fight Night: Ribas vs. Namajunas  3/23/2024  Las Vegas, Nevada, USA   
2  UFC Fight Night: Ribas vs. Namajunas  3/23/2024  Las Vegas, Nevada, USA   
3  UFC Fight Night: Ribas vs. Namajunas  3/23/2024  Las Vegas, Nevada, USA   
4  UFC Fight Night: Ribas vs. Namajunas  3/23/2024  Las Vegas, Nevada, USA   

      r_fighter       b_fighter winner       weight_class  is_title_bout  \
0  Amanda Ribas  Rose Namajunas   Blue  Women's Flyweight              0   
1  Amanda Ribas  Rose Namajunas   Blue  Women's Flyweight              0   
2  Amanda Ribas  Rose Namajunas   Blue  Women's Flyweight              0   
3  Amanda Ribas  Rose Namajunas   Blue  Women's Flyweight              0   
4  Amanda Ribas  Rose Namajunas   Blue  Women's Flyweight              0   

  gender                method  ...  weight_diff  reach_diff  SLpM_total_d

  merged_dataset = merged_dataset.loc[:, pd.unique(new_columns_order)]


## Analysis of Missing and Duplicate Data in UFC Dataset

This code identifies and analyzes missing and duplicate data within the merged UFC dataset. It calculates both the count and percentage of missing values for each column, creating a summary `DataFrame` that highlights only columns with missing data. It also counts the total number of duplicate rows in the dataset. The results provide insights into data completeness and help identify areas requiring data cleaning or imputation.

In [16]:
# Analysis of missing values
missing_values_count = merged_dataset.isnull().sum()
missing_values_percentage = (missing_values_count / len(merged_dataset)) * 100

# Combine the count and percentage of missing values into one DataFrame
missing_data_analysis = pd.DataFrame({
    'Missing Values Count': missing_values_count,
    'Missing Values Percentage': missing_values_percentage
})

# Filter only those columns with missing values
missing_data_analysis = missing_data_analysis[missing_data_analysis['Missing Values Count'] > 0]

# Analysis of duplicate rows
duplicate_rows_count = merged_dataset.duplicated().sum()

# Output the results
print("Analysis of Missing Values:")
print(missing_data_analysis)

print("\nNumber of Duplicate Rows:", duplicate_rows_count)


Analysis of Missing Values:
              Missing Values Count  Missing Values Percentage
date                            12                   0.014380
location                        12                   0.014380
total_rounds                   361                   0.432610
referee                        384                   0.460172
r_age                          718                   0.860426
r_reach                       3880                   4.649658
r_stance                       275                   0.329550
b_age                         1762                   2.111520
b_reach                       8507                  10.194495
b_stance                       713                   0.854435
age_diff                      1962                   2.351193
reach_diff                   10069                  12.066342

Number of Duplicate Rows: 76008


## Removing Duplicate Rows from Merged UFC Dataset

This code removes duplicate rows from the merged UFC dataset to ensure data integrity. After dropping duplicates, it rechecks for any remaining duplicate rows and outputs the count to confirm their successful removal. The updated `DataFrame` is displayed to verify the changes, ensuring a clean dataset for analysis or modeling.

In [17]:
# Remove duplicate rows
merged_dataset = merged_dataset.drop_duplicates()

# Check again for duplicate rows
duplicate_rows_count_after = merged_dataset.duplicated().sum()

# Output the results
print("Number of Duplicate Rows After Removal:", duplicate_rows_count_after)

# Display the updated DataFrame to confirm the changes
print(merged_dataset.head())


Number of Duplicate Rows After Removal: 0
                              event_name       date                location  \
0   UFC Fight Night: Ribas vs. Namajunas  3/23/2024  Las Vegas, Nevada, USA   
13  UFC Fight Night: Ribas vs. Namajunas  3/23/2024  Las Vegas, Nevada, USA   
26  UFC Fight Night: Ribas vs. Namajunas  3/23/2024  Las Vegas, Nevada, USA   
39  UFC Fight Night: Ribas vs. Namajunas  3/23/2024  Las Vegas, Nevada, USA   
52  UFC Fight Night: Ribas vs. Namajunas  3/23/2024  Las Vegas, Nevada, USA   

            r_fighter        b_fighter winner       weight_class  \
0        Amanda Ribas   Rose Namajunas   Blue  Women's Flyweight   
13      Karl Williams      Justin Tafa    Red        Heavyweight   
26   Edmen Shahbazyan        AJ Dobson    Red       Middleweight   
39     Payton Talbott  Cameron Saaiman    Red       Bantamweight   
52  Billy Quarantillo    Youssef Zalal   Blue      Featherweight   

    is_title_bout gender                method  ...  weight_diff  reach_di

## Assessing Missing and Duplicate Data in UFC Dataset

In [18]:
# Analysis of missing values
missing_values_count = merged_dataset.isnull().sum()
missing_values_percentage = (missing_values_count / len(merged_dataset)) * 100

# Combine the count and percentage of missing values into one DataFrame
missing_data_analysis = pd.DataFrame({
    'Missing Values Count': missing_values_count,
    'Missing Values Percentage': missing_values_percentage
})

# Filter only those columns with missing values
missing_data_analysis = missing_data_analysis[missing_data_analysis['Missing Values Count'] > 0]

# Analysis of duplicate rows
duplicate_rows_count = merged_dataset.duplicated().sum()

# Output the results
print("Analysis of Missing Values:")
print(missing_data_analysis)

print("\nNumber of Duplicate Rows:", duplicate_rows_count)

Analysis of Missing Values:
              Missing Values Count  Missing Values Percentage
date                            12                   0.161312
location                        12                   0.161312
total_rounds                    31                   0.416723
referee                         32                   0.430165
r_age                           76                   1.021643
r_reach                        412                   5.538379
r_stance                        26                   0.349509
b_age                          190                   2.554107
b_reach                        888                  11.937088
b_stance                        68                   0.914101
age_diff                       213                   2.863288
reach_diff                    1038                  13.953488

Number of Duplicate Rows: 0


In [19]:
# Saving the current DataFrame to the specified path with a descriptive filename
merged_dataset.to_csv('../data/processed/ufc_fight_data_cleaned.csv', index=False)
