# MLB Predictive Analysis

### By David Montoto

## Abstract
This project utilizes Python to develop and implement a machine learning solution aimed at uncovering information within historical Major League Baseball (MLB) data. The primary objectives are the following: first, to create a predictive model that determines the likelihood of a manager's success based on various historical performance metrics; second, to analyze the impact of top players on overall team success. Using a dataset spanning from 1870 to 2016, we apply a range of machine learning techniques to build models for predicting managerial success and conducting regression analysis to explore the influence of key player performance. The results provide valuable insights into the factors driving team performance, offering practical implications for team management and strategic decision-making in MLB. Through detailed data preprocessing, exploratory data analysis, and rigorous model evaluation, this project demonstrates the effective use of machine learning in sports analytics.

## Goal
The goal of this assignment is to leverage Python to develop and implement a comprehensive machine learning project that involves building predictive models and conducting detailed data analysis. Specifically, the project aims to:

1. Predict Managerial Success: Create a predictive model to determine the likelihood of a manager's success based on historical MLB data, using various machine learning techniques

2. Analyze Player Impact: Examine how the performance of top players influences overall team success, employing regression and feature importance analysis

#### Data Cleaning and Preprocessing


In [68]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

##### Load and Examine Data 

In [124]:
df = pd.read_csv('baseballdata.csv')

# Inspect first few rows
print(df.head())

# Inspect dataset info
print(df.info())

   Unnamed: 0  Rk  Year                    Tm       Lg    G   W   L  Ties  \
0           1   1  2016  Arizona Diamondbacks  NL West  162  69  93     0   
1           2   2  2015  Arizona Diamondbacks  NL West  162  79  83     0   
2           3   3  2014  Arizona Diamondbacks  NL West  162  64  98     0   
3           4   4  2013  Arizona Diamondbacks  NL West  162  81  81     0   
4           5   5  2012  Arizona Diamondbacks  NL West  162  81  81     0   

    W.L.  ...    R   RA Attendance BatAge  PAge  X.Bat X.P  \
0  0.426  ...  752  890  2,036,216   26.7  26.4     50  29   
1  0.488  ...  720  713  2,080,145   26.6  27.1     50  27   
2  0.395  ...  615  742  2,073,730   27.6  28.0     52  25   
3  0.500  ...  685  695  2,134,895   28.1  27.6     44  23   
4  0.500  ...  734  688  2,177,617   28.3  27.4     48  23   

            Top.Player                               Managers  \
0       J.Segura (5.7)                         C.Hale (69-93)   
1  P.Goldschmidt (8.8)            

#### Handle Missing Values

In [127]:
# Check for missing values
print(df.isnull().sum())

Unnamed: 0       0
Rk               0
Year             0
Tm               0
Lg               0
G                0
W                0
L                0
Ties             0
W.L.             0
pythW.L.         0
Finish           0
GB               0
Playoffs      2163
R                0
RA               0
Attendance      74
BatAge           0
PAge             0
X.Bat            0
X.P              0
Top.Player       0
Managers         0
current          0
dtype: int64


#### Change NULL in Playoffs Column to 'Did not make it'

In [130]:
# Impute missing values in the 'Playoffs' column with 'Did not make it'
df['Playoffs'].fillna('Did not make it', inplace=True)

In [132]:
# Drop unnecessary column
df = df.drop(columns=['Attendance'])

In [None]:
def extract_wins_losses(record):
    # Find the index of the parentheses
    start = record.find('(')
    end = record.find(')')
    
    # Extract the content within parentheses
    if start != -1 and end != -1:
        record_content = record[start + 1:end]
        wins, losses = map(int, record_content.split('-'))
        return wins, losses
    return 0, 0

def process_manager_row(row):
    # Split managers by ',' or ' and '
    managers = [manager.strip() for part in row['Managers'].split(',') for manager in part.split(' and ')]
    
    # Extract manager records
    manager_records = [extract_wins_losses(manager) for manager in managers]
    
    # Extract final playoff result
    final_playoff_result = row['Playoffs'] if managers[-1] in row['Managers'] else 'Did not make it'
    
    rows = []
    for i, manager in enumerate(managers):
        manager_name = manager.split(' (')[0]
        wins, losses = manager_records[i]
        win_loss_record = f"{wins}-{losses}"
        
        if i == len(managers) - 1:  # Last manager gets the actual playoff result
            playoff_result = row['Playoffs']
        else:  # Other managers get 'Did not make it'
            playoff_result = 'Did not make it'
        
        # Append row for each manager
        rows.append({
            'Manager_Name': manager_name,
            'Wins': wins,
            'Losses': losses,
            'Win_Loss_Record': win_loss_record,
            'Playoff_Result': playoff_result,
            'Playoff_Score': calculate_playoff_score(playoff_result)
        })
    
    return pd.DataFrame(rows)

In [None]:
df = pd.concat(df.apply(process_manager_row, axis=1).tolist(), ignore_index=True)
print(df.head())

In [134]:
df['Manager_Name'] = df['Managers'].str.split('(', expand=True)[0].str.strip()

total_unique_managers = df['Manager_Name'].nunique()
print(f"Total number of unique managers: {total_unique_managers}")

Total number of unique managers: 453


In [136]:
# Get unique manager names
unique_managers = df['Manager_Name'].unique()

# Create a new DataFrame with unique managers
manager_df = pd.DataFrame(unique_managers, columns=['Manager_Name'])

# Reset index if needed
manager_df.reset_index(drop=True, inplace=True)

# Print the new DataFrame
print(manager_df)

     Manager_Name
0          C.Hale
1        K.Gibson
2         A.Hinch
3        B.Melvin
4        B.Brenly
..            ...
448  R.Hartsfield
449    M.Williams
450    T.Runnells
451     J.Fanning
452       K.Kuehl

[453 rows x 1 columns]


In [138]:
# Step 1: Calculate Aggregated Win-Loss Data for Each Manager
manager_stats = df.groupby('Manager_Name').agg({
    'W': 'mean',
    'L': 'mean'
}).reset_index()

manager_stats['Win_Percentage'] = manager_stats['W'] / (manager_stats['W'] + manager_stats['L'])

# Assuming manager_df is your DataFrame with unique manager names
# Step 2: Add Aggregated Data to manager_df
manager_df = pd.merge(manager_df, manager_stats[['Manager_Name', 'W', 'L', 'Win_Percentage']], on='Manager_Name', how='left')

# Print the updated manager_df
print(manager_df.head())

  Manager_Name          W          L  Win_Percentage
0       C.Hale  74.000000  88.000000        0.456790
1     K.Gibson  80.000000  82.000000        0.493827
2      A.Hinch  78.333333  83.666667        0.483539
3     B.Melvin  80.500000  81.500000        0.496914
4     B.Brenly  81.250000  80.750000        0.501543


In [140]:
# Find the row for Joe McCarthy
joe_mccarthy_stats = manager_df[manager_df['Manager_Name'] == 'J.McCarthy']

# Print Joe McCarthy's statistics
print(joe_mccarthy_stats)

   Manager_Name          W       L  Win_Percentage
96   J.McCarthy  94.041667  59.125        0.613983


In [142]:
unique_playoff_responses = df['Playoffs'].unique()

# Print the number of unique responses and the responses themselves
print(f"Number of unique playoff responses: {len(unique_playoff_responses)}")
print("Unique playoff responses:")
for response in unique_playoff_responses:
    print(response)

Number of unique playoff responses: 41
Unique playoff responses:
Did not make it
Lost LDS (3-2)
Lost NLCS (4-0)
Lost LDS (3-0)
Won WS (4-3)
Lost LDS (3-1)
Lost NLWC (1-0)
Lost NLCS (4-1)
Lost WS (4-0)
Lost NLCS (4-2)
Lost WS (4-2)
Won WS (4-2)
Lost WS (4-3)
Lost NLCS (3-0)
Won WS (4-0)
Won Series (5-0-1)
Lost ALWC (1-0)
Lost ALCS (4-0)
Lost ALCS (4-2)
Lost ALCS (4-1)
Won WS (4-1)
Lost ALCS (3-1)
Lost ALCS (3-2)
Lost WS (4-1)
Lost ALCS (4-3)
Won WS (5-3)
Lost NLCS (4-3)
Lost NLCS (3-2)
Won WS (4-0-1)
Tied in WS (3-3-1)
Lost WS (5-3)
Won WS (5-2)
Lost WS (4-0-1)
Lost ALCS (3-0)
Lost NLCS (3-1)
Lost WS (5-2)
Lost WS (6-3)
Won WS (6-3)
Won WS (6-4)
Lost WS (6-4)
Lost WS (10-5)


In [144]:
def calculate_playoff_score(playoff_result):
    if 'Lost ALWC' in playoff_result or 'Lost NLWC' in playoff_result:
        return 1
    elif 'Lost LDS' in playoff_result:
        return 2
    elif 'Lost ALCS' in playoff_result or 'Lost NLCS' in playoff_result:
        return 3
    elif 'Lost WS' in playoff_result or 'Tied in WS' in playoff_result:
        return 4
    elif 'Won WS' in playoff_result:
        return 5
    else: 
        return 0

test = calculate_playoff_score('Lost WS (6-4)')
print(test)

4


In [146]:
# Apply the function to the Playoffs column
df['Playoff_Score'] = df['Playoffs'].apply(calculate_playoff_score)
print(df[['Playoffs', 'Playoff_Score']].head(10)) 

          Playoffs  Playoff_Score
0  Did not make it              0
1  Did not make it              0
2  Did not make it              0
3  Did not make it              0
4  Did not make it              0
5   Lost LDS (3-2)              2
6  Did not make it              0
7  Did not make it              0
8  Did not make it              0
9  Lost NLCS (4-0)              3


In [156]:
# Function to separate managers and create new rows for each manager
def separate_managers(row):
    managers = row['Managers'].split(', ')
    return pd.DataFrame({
        'Manager_Name': managers,
        'Playoff_Score': [row['Playoff_Score']] * len(managers)
    })

In [None]:
# Apply the function to each row in the DataFrame and concatenate the results
separated_managers_df = pd.concat(df.apply(separate_managers, axis=1).tolist(), ignore_index=True)

# Group by manager names and sum the Playoff_Score for each manager
manager_playoff_scores = separated_managers_df.groupby('Manager_Name')['Playoff_Score'].sum().reset_index()

# Rename columns for clarity
manager_playoff_scores.columns = ['Manager_Name', 'Total_Playoff_Score']

# Merge the total playoff scores back into manager_df
manager_df = manager_df.merge(manager_playoff_scores, on='Manager_Name', how='left')

# Fill any missing values with 0 (in case a manager never had a playoff score)
manager_df['Total_Playoff_Score'].fillna(0, inplace=True)

# Print the updated manager_df to verify
print(manager_df.head(10))

# Check the number of unique playoff scores and print each unique response
unique_total_playoff_scores = manager_df['Total_Playoff_Score'].unique()

print(f"Number of unique total playoff scores: {len(unique_total_playoff_scores)}")
print("Unique total playoff scores:")
for response in unique_total_playoff_scores:
    print(response)

In [150]:
# Merge the total playoff scores back into manager_df
manager_df = manager_df.merge(manager_playoff_scores, on='Manager_Name', how='left')

# Fill any missing values with 0 (in case a manager never had a playoff score)
manager_df['Total_Playoff_Score'].fillna(0, inplace=True)

# Print the updated manager_df to verify
print(manager_df.head(10))


  Manager_Name          W          L  Win_Percentage  Total_Playoff_Score
0       C.Hale  74.000000  88.000000        0.456790                  0.0
1     K.Gibson  80.000000  82.000000        0.493827                  0.0
2      A.Hinch  78.333333  83.666667        0.483539                  0.0
3     B.Melvin  80.500000  81.500000        0.496914                  0.0
4     B.Brenly  81.250000  80.750000        0.501543                  0.0
5  B.Showalter  82.058824  76.000000        0.519166                  0.0
6   F.Gonzalez  81.500000  80.300000        0.503708                  0.0
7        B.Cox  88.000000  69.428571        0.558984                  0.0
8      R.Nixon  67.333333  94.000000        0.417355                  0.0
9     C.Tanner  77.277778  80.388889        0.490134                  0.0


In [152]:
# Check the number of unique playoff scores and print each unique response
unique_total_playoff_scores = manager_df['Total_Playoff_Score'].unique()

print(f"Number of unique total playoff scores: {len(unique_total_playoff_scores)}")
print("Unique total playoff scores:")
for response in unique_total_playoff_scores:
    print(response)

Number of unique total playoff scores: 1
Unique total playoff scores:
0.0
