## Playoff Prediction Machine Learning
Training a model to predict if an MLB team will make the playoffs or not, as determined by their team level statistics

In [1]:
# Dependencies

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from keras.utils import to_categorical
from keras.models import Sequential

from keras.layers import Dense
from keras.callbacks import EarlyStopping

Using TensorFlow backend.


In [2]:
# Read in csv to pandas
team_df = pd.read_csv('../assets/data/mlb_stats.csv')
print(team_df.head())
print(len(team_df))
print(team_df.columns)

  franchid                name     city state                        curname  \
0      ANA  Los Angeles Angels  Anaheim    CA  Los Angeles Angels of Anaheim   
1      ANA  Los Angeles Angels  Anaheim    CA  Los Angeles Angels of Anaheim   
2      ANA  Los Angeles Angels  Anaheim    CA  Los Angeles Angels of Anaheim   
3      ANA  Los Angeles Angels  Anaheim    CA  Los Angeles Angels of Anaheim   
4      ANA   California Angels  Anaheim    CA  Los Angeles Angels of Anaheim   

  lgid  Year    G   w   l  ...  sho  sv ipouts    ha  hra  bba  soa    e   dp  \
0   AL  1961  162  70  91  ...    5  34   4314  1391  180  713  973  192  154   
1   AL  1962  162  86  76  ...   15  47   4398  1412  118  616  858  175  153   
2   AL  1963  161  70  91  ...   13  31   4365  1317  120  578  889  163  155   
3   AL  1964  162  82  80  ...   28  41   4350  1273  100  530  965  138  168   
4   AL  1965  162  75  87  ...   14  33   4323  1259   91  563  847  123  149   

      fp  
0  0.969  
1  0.972  

In [3]:
# Create runs per game and runs allowed per game from r, ra, and G
team_df['rpg'] = team_df['r'] / team_df['G']
team_df['rapg'] = team_df['ra'] / team_df['G']

In [4]:
# Drop unnecessary/unwanted columns

dropped_columns = ['w', 'franchid', 'city', 'state', 'curname', 'lgid', 'G', 'l'
                   , 'divwin', 'wcwin', 'lgwin', 'wswin', 'r', 'ra']
df = team_df.drop(dropped_columns, axis=1)

df.head()

print(df.columns)

Index(['name', 'Year', 'postseason', 'ab', 'h', '2B', '3B', 'hr', 'bb', 'so',
       'sb', 'cs', 'hbp', 'sf', 'er', 'era', 'cg', 'sho', 'sv', 'ipouts', 'ha',
       'hra', 'bba', 'soa', 'e', 'dp', 'fp', 'rpg', 'rapg'],
      dtype='object')


## Data Cleaning

In [5]:
# See surface level correlation between postseason and other statistics
import seaborn as sns
import matplotlib.pyplot as plt

corrmat = df.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(20,20))

#plot heat map
g=sns.heatmap(df[top_corr_features].corr(),annot=True,cmap="RdYlGn")

In [6]:
# Find null values
print(df.isnull().sum(axis=0).tolist())

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 200, 838, 838, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


Null values are in caught stealing, hit by pitch, sac flies.  We can delete caught stealing altogether, as it is not an overly important metric.  However, we need to keep hbp and sf in order to create OBP and OPS metrics down the line, so we will need to fill the null values.  We will impute the missing data by using the median value of the column in question.  While this will not account for changes in playstyle throughout the era, our model already was not going to account for it, so this should be fine.

In [7]:
# Delete caught stealing from dataframe

df.drop('cs', axis=1, inplace=True)

In [8]:
# Fill null values in hbp and sf

df['hbp'] = df['hbp'].fillna(df['hbp'].median())
df['sf'] = df['sf'].fillna(df['sf'].median())

## Model Training (Base Stats)
The first time, we will train the model using the numerical columns we were given (with null values imputed) and no further feature engineering.

In [9]:
# Create base dataframe for modeling that drops team name and Year

base_df = df.drop(['name', 'Year'], axis=1)

In [10]:
base_X = base_df.drop("postseason", axis = 1)
base_Y = base_df["postseason"]

print(base_X.shape, base_Y.shape)

(2192, 25) (2192,)


In [11]:
base_X_train, base_X_test, base_Y_train, base_Y_test = train_test_split(
    base_X, base_Y, random_state=1, stratify=base_Y)

base_X_scaler = StandardScaler().fit(base_X_train)
base_X_train_scaled = base_X_scaler.transform(base_X_train)
base_X_test_scaled = base_X_scaler.transform(base_X_test)

In [12]:
# Convert encoded labels to one-hot-encoding

base_Y_train_categorical = to_categorical(base_Y_train)
base_Y_test_categorical = to_categorical(base_Y_test)

In [13]:
# Create model and add layers

base_model = Sequential()
base_model.add(Dense(units=100, activation='relu', input_dim=25))
base_model.add(Dense(units=100, activation='relu'))
base_model.add(Dense(units=2, activation='softmax'))

In [14]:
# Compile and fit the model
base_model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

base_model.fit(
    base_X_train_scaled,
    base_Y_train_categorical,
    epochs=1000,
    shuffle=True,
    verbose=2,
    callbacks=[EarlyStopping(monitor='accuracy', patience=75, verbose=2)]
)

Epoch 1/1000
 - 0s - loss: 0.4050 - accuracy: 0.8273
Epoch 2/1000
 - 0s - loss: 0.2644 - accuracy: 0.8887
Epoch 3/1000
 - 0s - loss: 0.2444 - accuracy: 0.8990
Epoch 4/1000
 - 0s - loss: 0.2227 - accuracy: 0.9039
Epoch 5/1000
 - 0s - loss: 0.2181 - accuracy: 0.9075
Epoch 6/1000
 - 0s - loss: 0.2064 - accuracy: 0.9106
Epoch 7/1000
 - 0s - loss: 0.1968 - accuracy: 0.9167
Epoch 8/1000
 - 0s - loss: 0.1914 - accuracy: 0.9179
Epoch 9/1000
 - 0s - loss: 0.1795 - accuracy: 0.9270
Epoch 10/1000
 - 0s - loss: 0.1753 - accuracy: 0.9270
Epoch 11/1000
 - 0s - loss: 0.1655 - accuracy: 0.9337
Epoch 12/1000
 - 0s - loss: 0.1605 - accuracy: 0.9380
Epoch 13/1000
 - 0s - loss: 0.1547 - accuracy: 0.9373
Epoch 14/1000
 - 0s - loss: 0.1471 - accuracy: 0.9410
Epoch 15/1000
 - 0s - loss: 0.1451 - accuracy: 0.9367
Epoch 16/1000
 - 0s - loss: 0.1357 - accuracy: 0.9440
Epoch 17/1000
 - 0s - loss: 0.1260 - accuracy: 0.9483
Epoch 18/1000
 - 0s - loss: 0.1204 - accuracy: 0.9489
Epoch 19/1000
 - 0s - loss: 0.1177 - 

<keras.callbacks.callbacks.History at 0x19ff6819c88>

In [15]:
# Evaluate the model using test data split
base_model_loss, base_model_accuracy = base_model.evaluate(
    base_X_test_scaled, base_Y_test_categorical, verbose=2)

print(
    f"Base Normal Neural Network - Loss: {base_model_loss}, Accuracy: {base_model_accuracy}")

Base Normal Neural Network - Loss: 0.9016347971275775, Accuracy: 0.8759124279022217


In [16]:
base_encoded_predictions = base_model.predict_classes(base_X_test_scaled[:25])
# base_prediction_labels = base_label_encoder.inverse_transform(base_encoded_predictions)

In [17]:
# print(f"Base Predicted classes: {list(base_prediction_labels)}")
print(f"Base Predicted classes: {list(base_encoded_predictions)}")
print(f"Base Actual Labels:     {list(base_Y_test[:25])}")

Base Predicted classes: [0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Base Actual Labels:     [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]


In [18]:
# base_df_table = pd.DataFrame(base_prediction_labels, columns=['Predicted'])
base_df_table = pd.DataFrame(base_encoded_predictions, columns=['Predicted'])
base_df_table.insert(loc=1, column='Actual', value= list(base_Y_test[:25]))

In [19]:
base_df_table

Unnamed: 0,Predicted,Actual
0,0,0
1,0,0
2,0,0
3,0,0
4,1,0
5,0,0
6,0,0
7,0,0
8,0,0
9,1,0


## Model Training (Advanced Stats)
This time we will create columns for more advanced statistics and run the model again using those.  Each of these advanced statistics are normalized in some way (by number of games for rundif, at bats for ave and slugging percentage, plate appearances for OBP, and innings pitched for WHIP).
Ideally, these normalizations will allow the model to be evaluated at any point in the season, not just at at the end of a season when we already know who makes the postseason.

While there are many other advanced statistics that are thought to be better indicators of team/player performance, these are not able to be calculated with the statistics we have in the dataset.

In [20]:
# Create columns for advanced statistics (OBS, OPS, WHIP) as well as run differential and batting average
# For plate appearances, we don't have sacrifice hits, so only using sac flies will have to suffice

# Run Differential per game
df['rundif'] = df['rpg'] - df['rapg']

# Batting Average
df['ave'] = df['h'] / df['ab']

# On Base Percent (OBP) = (hits + walks + hbp)/plate appearances
plate_app = (df['ab'] + df['bb'] + df['sf'] +df['hbp'])
df['obp'] = (df['h'] + df['bb'] + df['hbp']) / plate_app

# Slugging Percent
singles = ((df['h'] - df['2B']) - df['3B']) - df['hr']
df['slug_percent'] = ((df['hr']*4) + (df['3B']*3) + (df['2B']*2) + singles) / df['ab']

# On Base plus Slugging (OPS)
df['ops'] = df['obp'] + df['slug_percent']

# Walks plus Hits per Inning Pitched (whip)
df['whip'] = (df['bba'] + df['ha']) / (df['ipouts']/3)

In [21]:
df.columns

Index(['name', 'Year', 'postseason', 'ab', 'h', '2B', '3B', 'hr', 'bb', 'so',
       'sb', 'hbp', 'sf', 'er', 'era', 'cg', 'sho', 'sv', 'ipouts', 'ha',
       'hra', 'bba', 'soa', 'e', 'dp', 'fp', 'rpg', 'rapg', 'rundif', 'ave',
       'obp', 'slug_percent', 'ops', 'whip'],
      dtype='object')

In [22]:
adv_df = df.drop(['rpg', 'ab', 'h', '2B', '3B', 'hr', 
                  'bb', 'so', 'sb', 'hbp', 'sf', 'rapg', 'er', 'era',
                  'cg', 'sho', 'sv', 'ipouts', 'ha', 'hra', 'bba', 'soa', 'e', 'dp', 'fp',
                  'ave','obp', 'slug_percent'
                 ], axis=1)

In [23]:
adv_df.tail()

Unnamed: 0,name,Year,postseason,rundif,ops,whip
2187,Washington Nationals,2014,1,0.808642,0.713987,1.157978
2188,Washington Nationals,2015,0,0.419753,0.723559,1.205855
2189,Washington Nationals,2016,1,0.932099,0.751397,1.192053
2190,Washington Nationals,2017,1,0.907407,0.781506,1.240783
2191,Washington Nationals,2018,0,0.549383,0.753405,1.249654


In [24]:
# Split off the 2018 season from adv_df and make a new df_2018

df_2018 = adv_df[adv_df['Year'] == 2018]
adv_df = adv_df[adv_df['Year'] != 2018]

# Delete Year and Team from adv_df

adv_df.drop(['name', 'Year'], axis=1, inplace=True)

In [25]:
adv_X = adv_df.drop("postseason", axis = 1)
adv_Y = adv_df["postseason"]

print(adv_X.shape, adv_Y.shape)

(2162, 3) (2162,)


In [26]:
adv_X_train, adv_X_test, adv_Y_train, adv_Y_test = train_test_split(
    adv_X, adv_Y, random_state=1, stratify=adv_Y)

adv_X_scaler = StandardScaler().fit(adv_X_train)
adv_X_train_scaled = adv_X_scaler.transform(adv_X_train)
adv_X_test_scaled = adv_X_scaler.transform(adv_X_test)

In [27]:
# Convert encoded labels to one-hot-encoding
adv_Y_train_categorical = to_categorical(adv_Y_train)
adv_Y_test_categorical = to_categorical(adv_Y_test)

In [28]:
# Create model and add layers

adv_model = Sequential()
adv_model.add(Dense(units=100, activation='relu', input_dim=3))
adv_model.add(Dense(units=100, activation='relu'))
adv_model.add(Dense(units=2, activation='softmax'))

In [29]:
# Compile and fit the model

adv_model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

adv_model.fit(
    adv_X_train_scaled,
    adv_Y_train_categorical,
    epochs=1000,
    shuffle=True,
    verbose=0,
    callbacks=[EarlyStopping(monitor='accuracy', patience=75, verbose=2)]
)

<keras.callbacks.callbacks.History at 0x19ff7fe79b0>

In [30]:
# Evaluate the model using test data split

adv_model_loss, adv_model_accuracy = adv_model.evaluate(
    adv_X_test_scaled, adv_Y_test_categorical, verbose=2)

print(
    f"Advanced Stats Normal Neural Network - Loss: {adv_model_loss}, Accuracy: {adv_model_accuracy}")

Advanced Stats Normal Neural Network - Loss: 1.0895163140874253, Accuracy: 0.8336414098739624


In [31]:
adv_encoded_predictions = adv_model.predict_classes(adv_X_test_scaled[:25])
# adv_prediction_labels = adv_label_encoder.inverse_transform(adv_encoded_predictions)

In [32]:
# print(f"Advanced Stats Predicted classes: {list(adv_prediction_labels)}")
print(f"Advanced Stats Predicted classes: {list(adv_encoded_predictions)}")
print(f"Advanced Stats Actual Labels:     {list(adv_Y_test[:25])}")

Advanced Stats Predicted classes: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0]
Advanced Stats Actual Labels:     [1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0]


In [33]:
# adv_df_table = pd.DataFrame(adv_prediction_labels, columns=['Predicted'])
adv_df_table = pd.DataFrame(adv_encoded_predictions, columns=['Predicted'])
adv_df_table.insert(loc=1, column='Actual', value= list(adv_Y_test[:25]))

In [34]:
adv_df_table

Unnamed: 0,Predicted,Actual
0,0,1
1,0,0
2,0,0
3,0,0
4,0,1
5,0,0
6,0,0
7,0,0
8,0,0
9,0,0


## Comparison

In [35]:
print(f'Base Stats Normal Neural Network - Loss {base_model_loss}, Accuracy: {base_model_accuracy}')
print(f"Advanced Stats Normal Neural Network - Loss: {adv_model_loss}, Accuracy: {adv_model_accuracy}")

Base Stats Normal Neural Network - Loss 0.9016347971275775, Accuracy: 0.8759124279022217
Advanced Stats Normal Neural Network - Loss: 1.0895163140874253, Accuracy: 0.8336414098739624


## 2018 Test Case
Using the data from the 2018 season that we stripped from adv_df to evaluate our model.  We will then join our newly made predicted/actual table df with our df_2018 to show results.

In [36]:
# Create Test 2018 dataframe

test_df_2018 = df_2018.drop(['name', 'Year'], axis=1)

# Ready Test 2018 dataframe for predicting

X_2018 = test_df_2018.drop('postseason', axis = 1)
Y_2018 = test_df_2018['postseason']

print(X_2018.shape, Y_2018.shape)

# Scale X data

X_2018_scaled = StandardScaler().fit(X_2018).transform(X_2018)

# One-Hot-Encoding

Y_2018_categorical = to_categorical(Y_2018) 

(30, 3) (30,)


In [37]:
encoded_predictions_2018 = adv_model.predict_classes(X_2018_scaled[:30])
# prediction_labels_2018 = 

In [38]:
df_2018_table = pd.DataFrame(encoded_predictions_2018, columns=['Predicted'])
df_2018_table.insert(loc=1, column='Actual', value=list(Y_2018[:30]))

In [39]:
names = df_2018['name'].tolist()

df_2018_table.insert(0, 'Team', names)
df_2018_table['Year'] = 2018
df_2018_table.set_index('Year', drop=True, inplace=True)

## Result of 2018 Test Case

In [40]:
df_2018_table

Unnamed: 0_level_0,Team,Predicted,Actual
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2018,Los Angeles Angels of Anaheim,0,0
2018,Arizona Diamondbacks,0,0
2018,Atlanta Braves,0,1
2018,Baltimore Orioles,0,0
2018,Boston Red Sox,1,1
2018,Chicago Cubs,0,1
2018,Chicago White Sox,0,0
2018,Cincinnati Reds,0,0
2018,Cleveland Indians,1,1
2018,Colorado Rockies,1,1


Our model correctly predicted that the Red Sox, Indians, Rockies, Astros, Dodgers, and the A's made the 2018 postseason.

It also falsely predicted that the Mariners and Rays would make the postseason when they didn't, and also that the Braves, Cubs, Brewers, and Yankees would miss the postseason when they in fact made it.

On this run, our model predicted: 6 true positives, 2 false positives, and 4 false negatives.

We classify this as a massive success, because a model that predicts that the Yankees will miss the playoffs is a model worth trusting.