# Game of Zones: Using ML to Predict the Outcome of an MLB At Bat

Goal: predict where the batter is most likely to hit the ball (zones of the field) in an at-bat given the situation and the pitcher he is facing
    
Input Data: 
- the pitcher's repertoire: given that each pitcher has a different arsenal of pitches and each pitch moves differently, we use a cluster analysis to categorize pitch types.  In this way, we put each pitcher on the same footing.

- pitcher stats such as groundball and flyball rates

- the game situation: the inning (and top/bottom), the number of outs, positions of baserunners, the count, positions of fielders(?)

- the batter's priors: distribution of batted balls into zones

- any other batter data?

Output: 
- probabilities for each zone on the field where the batter can hit the ball

- contributing factors for each prediction (things the defensive team could use to intervene)

In [12]:
%load_ext autoreload
%autoreload 2
from pybaseball import statcast
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, confusion_matrix
from xgboost import XGBClassifier
import matplotlib.pyplot as plt
from matplotlib import patches
%matplotlib inline

from get_data import get_data, get_hit_zone, get_pitch_data, get_situation_data
from pitch_clustering import pitch_clustering
from batter_zone import get_batter_zone_data

# use Statcast data (from 2015-2018) so we can get spin rate, etc.
train_data_dates = [('2015-04-05', '2015-10-04')] #,      # 2015 data
#                     ('2016-04-03', '2016-10-02'),       # 2016 data
#                     ('2017-04-02', '2017-10-01'),       # 2017 data
#                     ('2018-03-29', '2018-10-01')]       # 2018 data

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Build the Outcome

In [2]:
# get the outcome data
outcome_data = get_data(get_hit_zone, train_data_dates)

# write to file
outcome_data.to_csv("./outcome.csv", index=False)

print(f"Shape of the outcome data: {outcome_data.shape}")
outcome_data.head()

This is a large query, it may take a moment to complete
Shape of the outcome data: (87199, 5)


Unnamed: 0,game_pk,index,batter,pitcher,hit_zone
0,416079,247,150029,544727,4
2,416079,262,547180,544727,2
10,416079,341,543685,544727,1
15,416079,404,502517,595014,2
24,416079,519,434158,595014,4


### Build the Pitcher Data

In [15]:
# get the outcome data
pitch_data = get_data(get_pitch_data, train_data_dates)

# write to file
pitch_data.to_csv("./pitch.csv", index=False)

# print the number of pitchers in the data set
print(f"Number of pitchers in the data: {len(pitch_data['pitcher'].unique())}")

print(f"Shape of training data: {pitch_data.shape}")

pitch_data.head()

This is a large query, it may take a moment to complete
Performing PCA and K-Means Clustering...
Number of components from PCA: 7
The optimal number of K-Means clusters is: 13
Number of pitchers in the data: 721
Shape of training data: (721, 15)


Unnamed: 0,pitcher,player_name,pitch_type_0,pitch_type_1,pitch_type_2,pitch_type_3,pitch_type_4,pitch_type_5,pitch_type_6,pitch_type_7,pitch_type_8,pitch_type_9,pitch_type_10,pitch_type_11,pitch_type_12
0,112526,Bartolo Colon,0.0245555,0.0897544,0.106266,0.0808637,0.0364098,0.132091,0.0779001,0.0986452,0.0719729,0.0211685,0.0351397,0.0321761,0.193057
1,115629,LaTroy Hawkins,0.0427928,0.0923423,0.101351,0.0945946,0.0495495,0.0923423,0.0990991,0.103604,0.0675676,0.0518018,0.0382883,0.0292793,0.137387
2,136600,Bruce Chen,0.0,0.218487,0.0756303,0.134454,0.0252101,0.12605,0.0,0.134454,0.00840336,0.0504202,0.00840336,0.0756303,0.142857
3,150116,Randy Wolf,0.0362069,0.087931,0.134483,0.0827586,0.125862,0.198276,0.0672414,0.0517241,0.0603448,0.0344828,0.037931,0.0413793,0.0413793
4,150302,Jason Marquis,0.00922509,0.0701107,0.0571956,0.129151,0.0369004,0.0811808,0.0774908,0.197417,0.0738007,0.0424354,0.0332103,0.0424354,0.149446


### Build the Batter's Prior Zone Distribution

In [16]:
# use the hit zone data from the outcome (calculated above)
batter_zone_data = pd.read_csv("./outcome.csv")

batter_zone_data_pct = get_batter_zone_data(batter_zone_data)

batter_zone_data_pct.to_csv("./batter_zones.csv", index=False)

print(batter_zone_data_pct.shape)
batter_zone_data_pct.head()

(865, 5)


Unnamed: 0,batter,batter_zone_1,batter_zone_2,batter_zone_3,batter_zone_4
0,112526,0.105263,0.421053,0.263158,0.210526
1,116338,0.114094,0.278523,0.342282,0.265101
2,120074,0.24924,0.024316,0.416413,0.31003
3,121347,0.056738,0.319149,0.283688,0.340426
4,133380,0.06129,0.232258,0.367742,0.33871


### Get the Game Situation Features

In [17]:
situation_data = get_data(get_situation_data, train_data_dates)

# write to file
situation_data.to_csv("./situation.csv", index=False)

print(situation_data.shape)
situation_data.head()

This is a large query, it may take a moment to complete
(101234, 14)


Unnamed: 0,game_pk,index,batter,pitcher,balls,strikes,outs_when_up,inning,on_1b,on_2b,on_3b,bat_right,pitch_right,score_diff
0,416079,247,150029,544727,0.0,1.0,2.0,9.0,False,True,False,True,True,-1.0
2,416079,262,547180,544727,0.0,0.0,2.0,9.0,False,False,False,False,True,-1.0
10,416079,341,543685,544727,3.0,1.0,0.0,9.0,False,False,False,True,True,-1.0
15,416079,404,502517,595014,2.0,1.0,1.0,8.0,True,False,False,False,True,1.0
24,416079,519,434158,595014,2.0,1.0,1.0,8.0,False,False,False,False,True,0.0


### Combine the Game Situation, Pitcher and Batter Features along with the Outcome

In [19]:
game_situation_df = pd.read_csv("./situation.csv")

pitch_type_df = pd.read_csv("./pitch.csv")
pitch_type_df.drop('player_name', axis=1, inplace=True)
pitch_type_df['pitcher'] = pitch_type_df['pitcher'].astype(int)

batter_zone_df = pd.read_csv("./batter_zones.csv")

outcome_df = pd.read_csv("./outcome.csv")

full_data = pd.merge(game_situation_df, pitch_type_df, on="pitcher")
full_data = pd.merge(full_data, batter_zone_df, on="batter")
full_data = pd.merge(outcome_df, full_data, on=['game_pk', 'index', 'batter', 'pitcher'])

print(full_data.shape)
full_data.head()

(87102, 32)


Unnamed: 0,game_pk,index,batter,pitcher,hit_zone,balls,strikes,outs_when_up,inning,on_1b,...,pitch_type_7,pitch_type_8,pitch_type_9,pitch_type_10,pitch_type_11,pitch_type_12,batter_zone_1,batter_zone_2,batter_zone_3,batter_zone_4
0,416079,247,150029,544727,4,0.0,1.0,2.0,9.0,False,...,0.088235,0.073022,0.074037,0.032454,0.036511,0.126775,0.089888,0.219101,0.376404,0.314607
1,416079,262,547180,544727,2,0.0,0.0,2.0,9.0,False,...,0.088235,0.073022,0.074037,0.032454,0.036511,0.126775,0.263345,0.046263,0.409253,0.281139
2,416079,341,543685,544727,1,3.0,1.0,0.0,9.0,False,...,0.088235,0.073022,0.074037,0.032454,0.036511,0.126775,0.097561,0.280488,0.408537,0.213415
3,416079,404,502517,595014,2,2.0,1.0,1.0,8.0,True,...,0.103785,0.083028,0.017094,0.045177,0.034188,0.148962,0.264881,0.089286,0.333333,0.3125
4,416079,519,434158,595014,4,2.0,1.0,1.0,8.0,False,...,0.103785,0.083028,0.017094,0.045177,0.034188,0.148962,0.188088,0.028213,0.46395,0.319749


In [20]:
full_data = full_data.drop(['game_pk', 'index', 'batter', 'pitcher'], axis=1)

# split the dataframe into a feature set and an outcome column
X = full_data.drop('hit_zone', axis=1)
y = full_data['hit_zone']

# split the data into train/test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=4256)

In [34]:
# ----------------------
# train an XGBoost model
# ----------------------

# small set of hyperparameters to optimize over
# xgb_params = {"max_depth": (3, 5, 10, 15, 20),
#               "learning_rate": (0.01, 0.5, 0.1, 0.2, 0.4),
#               "gamma": (0, 33, 66, 100),
#               "min_child_weight": (0, 33, 66, 100),
#               "colsample_bytree": (0.5, 0.75, 1),
#               "subsample": (0.5, 0.75, 1),}

# # perform the paramater grid search using 5-fold cross validation
# xgb_opt = GridSearchCV(XGBClassifier(objective='multi:softprob', num_class=4), 
#                        param_grid=xgb_params, cv=5, scoring='accuracy', verbose=2, n_jobs=-1)

xgb_opt = XGBClassifier(objective='multi:softprob', num_class=4)

# perform fit and make predictions
xgb_opt.fit(X_train, y_train)
y_pred = xgb_opt.predict(X_test)
y_prob = xgb_opt.predict_proba(X_test)

# compute accuracy
accuracy = round(accuracy_score(y_test, y_pred) * 100, 1)

# the naive model - the max of the prior probabilities
def naive_model(df):
    df = df[['batter_zone_1', 'batter_zone_2', 'batter_zone_3', 'batter_zone_4']]
    df.columns = [1, 2, 3, 4]
    return df.idxmax(axis=1)
y_naive = naive_model(X_test).as_matrix()

# compute naive accuracy
naive_accuracy = round(accuracy_score(y_test, y_naive) * 100, 1)

print(f"Accuracy of the Naive model: {naive_accuracy}%")
print(f"Accuracy of the XGBoost model: {accuracy}%")

# print the confusion matrix
print()
print("The Confusion Matrix: ")
print(confusion_matrix(y_test, y_pred))

  if diff:


Accuracy of the Naive model: 37.4%
Accuracy of the XGBoost model: 37.6%

The Confusion Matrix: 
[[ 596  137 2303 1518]
 [ 189  390 2733 1398]
 [ 339  296 5726 2498]
 [ 411  260 4226 3111]]




In [23]:
features = X_train.columns.tolist()
importances = list(xgb_opt.feature_importances_)
for i in range(len(features)):
    print(features[i] + "\t" + str(importances[i] * 100.))

balls	2.109375037252903
strikes	4.257812350988388
outs_when_up	0.11718750465661287
inning	1.4453125186264515
on_1b	0.42968750931322575
on_2b	0.5859375
on_3b	0.0
bat_right	3.203124925494194
pitch_right	1.523437537252903
score_diff	1.796874962747097
pitch_type_0	2.070312574505806
pitch_type_1	3.359375149011612
pitch_type_2	3.476562350988388
pitch_type_3	3.789062425494194
pitch_type_4	3.203124925494194
pitch_type_5	3.203124925494194
pitch_type_6	3.750000149011612
pitch_type_7	3.671874850988388
pitch_type_8	2.773437462747097
pitch_type_9	3.867187350988388
pitch_type_10	3.984374925494194
pitch_type_11	3.398437425494194
pitch_type_12	2.187499962747097
batter_zone_1	11.015625298023224
batter_zone_2	11.562500149011612
batter_zone_3	9.531249850988388
batter_zone_4	9.687499701976776
