# MLB Predictive Analysis

### By David Montoto

## Abstract
This project utilizes Python to develop and implement a machine learning solution aimed at uncovering information within historical Major League Baseball (MLB) data. The primary objectives are the following: first, to create a predictive model that determines the likelihood of a manager's success based on various historical performance metrics; second, to analyze the impact of top players on overall team success. Using a dataset spanning from 1870 to 2016, we apply a range of machine learning techniques to build models for predicting managerial success and conducting regression analysis to explore the influence of key player performance. The results provide valuable insights into the factors driving team performance, offering practical implications for team management and strategic decision-making in MLB. Through detailed data preprocessing, exploratory data analysis, and rigorous model evaluation, this project demonstrates the effective use of machine learning in sports analytics.

## Goal
The goal of this assignment is to leverage Python to develop and implement a comprehensive machine learning project that involves building predictive models and conducting detailed data analysis. Specifically, the project aims to:

1. Predict Managerial Success: Create a predictive model to determine the likelihood of a manager's success based on historical MLB data, using various machine learning techniques

2. Analyze Player Impact: Examine how the performance of top players influences overall team success, employing regression and feature importance analysis

#### Data Cleaning and Preprocessing


In [1]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

##### Load and Examine Data 

In [3]:
df = pd.read_csv('baseballdata.csv')

# Inspect first few rows
print(df.head())

# Inspect dataset info
print(df.info())

   Unnamed: 0  Rk  Year                    Tm       Lg    G   W   L  Ties  \
0           1   1  2016  Arizona Diamondbacks  NL West  162  69  93     0   
1           2   2  2015  Arizona Diamondbacks  NL West  162  79  83     0   
2           3   3  2014  Arizona Diamondbacks  NL West  162  64  98     0   
3           4   4  2013  Arizona Diamondbacks  NL West  162  81  81     0   
4           5   5  2012  Arizona Diamondbacks  NL West  162  81  81     0   

    W.L.  ...    R   RA Attendance BatAge  PAge  X.Bat X.P  \
0  0.426  ...  752  890  2,036,216   26.7  26.4     50  29   
1  0.488  ...  720  713  2,080,145   26.6  27.1     50  27   
2  0.395  ...  615  742  2,073,730   27.6  28.0     52  25   
3  0.500  ...  685  695  2,134,895   28.1  27.6     44  23   
4  0.500  ...  734  688  2,177,617   28.3  27.4     48  23   

            Top.Player                               Managers  \
0       J.Segura (5.7)                         C.Hale (69-93)   
1  P.Goldschmidt (8.8)            

In [5]:
# 1.2 Handle Missing Values
# Check for missing values
print(df.isnull().sum())

Unnamed: 0       0
Rk               0
Year             0
Tm               0
Lg               0
G                0
W                0
L                0
Ties             0
W.L.             0
pythW.L.         0
Finish           0
GB               0
Playoffs      2163
R                0
RA               0
Attendance      74
BatAge           0
PAge             0
X.Bat            0
X.P              0
Top.Player       0
Managers         0
current          0
dtype: int64


In [13]:
# 1.3 Change NULL in Playoffs Column to 'Did not make it'
# Impute missing values in the 'Playoffs' column with 'Did not make it'
df['Playoffs'].fillna('Did not make it', inplace=True)