In this project, I will be creating a predictive model based on MLB pitches. The data I will be providing comes from <a href="https://www.kaggle.com/pschale/mlb-pitch-data-20152018#pitches.csv">Kaggle</a>, and covers MLB pitches recorded between 2015 and 2018 along with other related data. The model will use variables to make a binary prediction whether the pitch about to be thrown will be a fastball or not. I also test several different models to see which produces the highest scoring results. This notebook is replicated from a previous capstone project, but is instead done with Dask Dataframes and Arrays instead of Pandas and Numpy.

In [1]:
!pip install --upgrade "dask[complete]"

Requirement already up-to-date: dask[complete] in c:\users\brian\anaconda3\lib\site-packages (2.11.0)


In [53]:
!pip install dask-ml

Collecting dask-ml
  Downloading https://files.pythonhosted.org/packages/f4/ee/65f5b61f0f40b3709b91920bfa9cb4820f542a514590e024f746304c7443/dask_ml-1.2.0-py3-none-any.whl (124kB)
Collecting multipledispatch>=0.4.9
  Downloading https://files.pythonhosted.org/packages/89/79/429ecef45fd5e4504f7474d4c3c3c4668c267be3370e4c2fd33e61506833/multipledispatch-0.6.0-py3-none-any.whl
Collecting dask-glm>=0.2.0
  Downloading https://files.pythonhosted.org/packages/cb/ee/36c6e0e7b51e08406e5c3bb036f35adb77bd0a89335437b2e6f03c948f1a/dask_glm-0.2.0-py2.py3-none-any.whl
Installing collected packages: multipledispatch, dask-glm, dask-ml
Successfully installed dask-glm-0.2.0 dask-ml-1.2.0 multipledispatch-0.6.0


<center><h2>
    Exploratory Data Analysis
    </h2></center>

In [1]:
# Imports here
%matplotlib inline

import matplotlib.pyplot as plt
import dask.array as da
import dask.dataframe as dd
from dask_ml.model_selection import train_test_split
import joblib
import seaborn as sns
sns.set()
import warnings
warnings.filterwarnings('ignore')
from scipy.stats import boxcox
from sklearn.model_selection import cross_validate, GridSearchCV
from dask_ml.preprocessing import QuantileTransformer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import confusion_matrix, f1_score, roc_curve, precision_recall_curve, roc_auc_score, classification_report, roc_auc_score



In [2]:
from dask.distributed import Client, progress
client = Client(n_workers=6, threads_per_worker=2, memory_limit='3GB')
client

0,1
Client  Scheduler: tcp://127.0.0.1:63298  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 6  Cores: 12  Memory: 18.00 GB


In [3]:
pitches_df = dd.read_csv(r'C:\Users\Brian\Desktop\pitches.csv')

In [4]:
atbat_df = dd.read_csv (r'C:\Users\Brian\Desktop\atbats.csv')

In [5]:
pitches_df.compute()

Unnamed: 0,px,pz,start_speed,end_speed,spin_rate,spin_dir,break_angle,break_length,break_y,ax,...,event_num,b_score,ab_id,b_count,s_count,outs,pitch_num,on_1b,on_2b,on_3b
0,0.416000,2.963000,92.9,84.1,2305.052,159.235,-25.0,3.2,23.7,7.665000,...,3,0.0,2.015000e+09,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,-0.191000,2.347000,92.8,84.1,2689.935,151.402,-40.7,3.4,23.7,12.043000,...,4,0.0,2.015000e+09,0.0,1.0,0.0,2.0,0.0,0.0,0.0
2,-0.518000,3.284000,94.1,85.2,2647.972,145.125,-43.7,3.7,23.7,14.368000,...,5,0.0,2.015000e+09,0.0,2.0,0.0,3.0,0.0,0.0,0.0
3,-0.641000,1.221000,91.0,84.0,1289.590,169.751,-1.3,5.0,23.8,2.104000,...,6,0.0,2.015000e+09,0.0,2.0,0.0,4.0,0.0,0.0,0.0
4,-1.821000,2.083000,75.4,69.6,1374.569,280.671,18.4,12.0,23.8,-10.280000,...,7,0.0,2.015000e+09,1.0,2.0,0.0,5.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4026,0.230280,1.784910,83.9,78.2,608.594,275.800,10.5,8.8,23.9,-5.097492,...,568,1.0,2.018186e+09,3.0,2.0,2.0,9.0,0.0,0.0,0.0
4027,-1.211049,2.479020,97.7,89.5,2262.907,155.009,-26.9,2.9,23.8,9.344343,...,572,1.0,2.018186e+09,0.0,0.0,2.0,1.0,1.0,0.0,0.0
4028,-0.444578,2.619287,97.3,89.6,2514.010,147.166,-40.5,3.3,23.8,13.292081,...,577,1.0,2.018186e+09,1.0,0.0,2.0,2.0,0.0,1.0,0.0
4029,-0.259813,1.336484,95.9,88.6,2318.775,144.921,-34.2,3.8,23.8,12.786338,...,578,1.0,2.018186e+09,1.0,1.0,2.0,3.0,0.0,1.0,0.0


In [6]:
pitches_df.columns

Index(['px', 'pz', 'start_speed', 'end_speed', 'spin_rate', 'spin_dir',
       'break_angle', 'break_length', 'break_y', 'ax', 'ay', 'az', 'sz_bot',
       'sz_top', 'type_confidence', 'vx0', 'vy0', 'vz0', 'x', 'x0', 'y', 'y0',
       'z0', 'pfx_x', 'pfx_z', 'nasty', 'zone', 'code', 'type', 'pitch_type',
       'event_num', 'b_score', 'ab_id', 'b_count', 's_count', 'outs',
       'pitch_num', 'on_1b', 'on_2b', 'on_3b'],
      dtype='object')

In [7]:
atbat_df.compute()

Unnamed: 0,ab_id,batter_id,event,g_id,inning,o,p_score,p_throws,pitcher_id,stand,top
0,2015000001,572761,Groundout,201500001,1,1,0,L,452657,L,True
1,2015000002,518792,Double,201500001,1,1,0,L,452657,L,True
2,2015000003,407812,Single,201500001,1,1,0,L,452657,R,True
3,2015000004,425509,Strikeout,201500001,1,2,0,L,452657,R,True
4,2015000005,571431,Strikeout,201500001,1,3,0,L,452657,L,True
...,...,...,...,...,...,...,...,...,...,...,...
740384,2018185570,543768,Groundout,201802431,9,3,1,L,448802,L,True
740385,2018185571,502517,Strikeout,201802431,9,1,3,L,623352,L,False
740386,2018185572,450314,Flyout,201802431,9,2,3,L,623352,R,False
740387,2018185573,595879,Single,201802431,9,2,3,L,623352,R,False


In [5]:
df_joined = dd.merge(left=pitches_df, right=atbat_df)

In [6]:
df_joined.compute()

Unnamed: 0,px,pz,start_speed,end_speed,spin_rate,spin_dir,break_angle,break_length,break_y,ax,...,batter_id,event,g_id,inning,o,p_score,p_throws,pitcher_id,stand,top
0,0.416000,2.963000,92.9,84.1,2305.052,159.235,-25.0,3.2,23.7,7.665000,...,572761,Groundout,201500001,1,1,0,L,452657,L,True
1,-0.191000,2.347000,92.8,84.1,2689.935,151.402,-40.7,3.4,23.7,12.043000,...,572761,Groundout,201500001,1,1,0,L,452657,L,True
2,-0.518000,3.284000,94.1,85.2,2647.972,145.125,-43.7,3.7,23.7,14.368000,...,572761,Groundout,201500001,1,1,0,L,452657,L,True
3,-0.641000,1.221000,91.0,84.0,1289.590,169.751,-1.3,5.0,23.8,2.104000,...,572761,Groundout,201500001,1,1,0,L,452657,L,True
4,-1.821000,2.083000,75.4,69.6,1374.569,280.671,18.4,12.0,23.8,-10.280000,...,572761,Groundout,201500001,1,1,0,L,452657,L,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4026,0.230280,1.784910,83.9,78.2,608.594,275.800,10.5,8.8,23.9,-5.097492,...,595879,Single,201802431,9,2,3,L,623352,R,False
4027,-1.211049,2.479020,97.7,89.5,2262.907,155.009,-26.9,2.9,23.8,9.344343,...,519203,Flyout,201802431,9,3,3,L,623352,L,False
4028,-0.444578,2.619287,97.3,89.6,2514.010,147.166,-40.5,3.3,23.8,13.292081,...,519203,Flyout,201802431,9,3,3,L,623352,L,False
4029,-0.259813,1.336484,95.9,88.6,2318.775,144.921,-34.2,3.8,23.8,12.786338,...,519203,Flyout,201802431,9,3,3,L,623352,L,False


In [10]:
df_joined.columns

Index(['px', 'pz', 'start_speed', 'end_speed', 'spin_rate', 'spin_dir',
       'break_angle', 'break_length', 'break_y', 'ax', 'ay', 'az', 'sz_bot',
       'sz_top', 'type_confidence', 'vx0', 'vy0', 'vz0', 'x', 'x0', 'y', 'y0',
       'z0', 'pfx_x', 'pfx_z', 'nasty', 'zone', 'code', 'type', 'pitch_type',
       'event_num', 'b_score', 'ab_id', 'b_count', 's_count', 'outs',
       'pitch_num', 'on_1b', 'on_2b', 'on_3b', 'batter_id', 'event', 'g_id',
       'inning', 'o', 'p_score', 'p_throws', 'pitcher_id', 'stand', 'top'],
      dtype='object')

In [11]:
df_joined.info()

<class 'dask.dataframe.core.DataFrame'>
Columns: 50 entries, px to top
dtypes: object(6), bool(1), float64(36), int64(7)

In [7]:
# I elect to drop all null values here.
df_joined = df_joined.dropna()

In [8]:
df_joined.compute()

Unnamed: 0,px,pz,start_speed,end_speed,spin_rate,spin_dir,break_angle,break_length,break_y,ax,...,batter_id,event,g_id,inning,o,p_score,p_throws,pitcher_id,stand,top
0,0.416000,2.963000,92.9,84.1,2305.052,159.235,-25.0,3.2,23.7,7.665000,...,572761,Groundout,201500001,1,1,0,L,452657,L,True
1,-0.191000,2.347000,92.8,84.1,2689.935,151.402,-40.7,3.4,23.7,12.043000,...,572761,Groundout,201500001,1,1,0,L,452657,L,True
2,-0.518000,3.284000,94.1,85.2,2647.972,145.125,-43.7,3.7,23.7,14.368000,...,572761,Groundout,201500001,1,1,0,L,452657,L,True
3,-0.641000,1.221000,91.0,84.0,1289.590,169.751,-1.3,5.0,23.8,2.104000,...,572761,Groundout,201500001,1,1,0,L,452657,L,True
4,-1.821000,2.083000,75.4,69.6,1374.569,280.671,18.4,12.0,23.8,-10.280000,...,572761,Groundout,201500001,1,1,0,L,452657,L,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4026,0.230280,1.784910,83.9,78.2,608.594,275.800,10.5,8.8,23.9,-5.097492,...,595879,Single,201802431,9,2,3,L,623352,R,False
4027,-1.211049,2.479020,97.7,89.5,2262.907,155.009,-26.9,2.9,23.8,9.344343,...,519203,Flyout,201802431,9,3,3,L,623352,L,False
4028,-0.444578,2.619287,97.3,89.6,2514.010,147.166,-40.5,3.3,23.8,13.292081,...,519203,Flyout,201802431,9,3,3,L,623352,L,False
4029,-0.259813,1.336484,95.9,88.6,2318.775,144.921,-34.2,3.8,23.8,12.786338,...,519203,Flyout,201802431,9,3,3,L,623352,L,False


In [9]:
# Creating dataframe I will be cleaning using only the columns needed for feature engineering
df = df_joined[['pitch_type','b_score', 'p_score', 'b_count', 's_count', 'pitch_num', 'outs', 'on_1b', 'on_2b', 'on_3b',
                'inning', 'p_throws', 'stand']]

In [15]:
df.isnull().sum().compute()

pitch_type    0
b_score       0
p_score       0
b_count       0
s_count       0
pitch_num     0
outs          0
on_1b         0
on_2b         0
on_3b         0
inning        0
p_throws      0
stand         0
dtype: int64

In [10]:
df.compute()

Unnamed: 0,pitch_type,b_score,p_score,b_count,s_count,pitch_num,outs,on_1b,on_2b,on_3b,inning,p_throws,stand
0,FF,0.0,0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1,L,L
1,FF,0.0,0,0.0,1.0,2.0,0.0,0.0,0.0,0.0,1,L,L
2,FF,0.0,0,0.0,2.0,3.0,0.0,0.0,0.0,0.0,1,L,L
3,FF,0.0,0,0.0,2.0,4.0,0.0,0.0,0.0,0.0,1,L,L
4,CU,0.0,0,1.0,2.0,5.0,0.0,0.0,0.0,0.0,1,L,L
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4026,SL,1.0,3,3.0,2.0,9.0,2.0,0.0,0.0,0.0,9,L,R
4027,FF,1.0,3,0.0,0.0,1.0,2.0,1.0,0.0,0.0,9,L,L
4028,FF,1.0,3,1.0,0.0,2.0,2.0,0.0,1.0,0.0,9,L,L
4029,FF,1.0,3,1.0,1.0,3.0,2.0,0.0,1.0,0.0,9,L,L


In [11]:
# Defining my outcome variable fastball, the four pitch types listed are all different types of fastballs
df['fastball'] = df['pitch_type'].isin(['FF', 'FT', 'FC', 'FS'])

In [12]:
df.head(15)

Unnamed: 0,pitch_type,b_score,p_score,b_count,s_count,pitch_num,outs,on_1b,on_2b,on_3b,inning,p_throws,stand,fastball
0,FF,0.0,0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1,L,L,True
1,FF,0.0,0,0.0,1.0,2.0,0.0,0.0,0.0,0.0,1,L,L,True
2,FF,0.0,0,0.0,2.0,3.0,0.0,0.0,0.0,0.0,1,L,L,True
3,FF,0.0,0,0.0,2.0,4.0,0.0,0.0,0.0,0.0,1,L,L,True
4,CU,0.0,0,1.0,2.0,5.0,0.0,0.0,0.0,0.0,1,L,L,False
5,FF,0.0,0,2.0,2.0,6.0,0.0,0.0,0.0,0.0,1,L,L,True
6,FF,0.0,0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1,L,L,True
7,FC,0.0,0,1.0,0.0,2.0,1.0,0.0,0.0,0.0,1,L,L,True
8,FF,0.0,0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1,L,R,True
9,FF,0.0,0,1.0,0.0,2.0,1.0,0.0,1.0,0.0,1,L,R,True


In [13]:
# The next few rows are some feature engineering work in which I create new features
df['same_side'] = (df['p_throws'] == df['stand'])

In [14]:
df = df.drop(columns=['p_throws', 'stand', 'pitch_type'])

In [15]:
df['pitcher_losing'] = (df['p_score'] < df['b_score'])

In [16]:
df['RISP'] = (df['on_2b'] + df['on_3b'] > 0)

In [17]:
df.compute()

Unnamed: 0,b_score,p_score,b_count,s_count,pitch_num,outs,on_1b,on_2b,on_3b,inning,fastball,same_side,pitcher_losing,RISP
0,0.0,0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1,True,True,False,False
1,0.0,0,0.0,1.0,2.0,0.0,0.0,0.0,0.0,1,True,True,False,False
2,0.0,0,0.0,2.0,3.0,0.0,0.0,0.0,0.0,1,True,True,False,False
3,0.0,0,0.0,2.0,4.0,0.0,0.0,0.0,0.0,1,True,True,False,False
4,0.0,0,1.0,2.0,5.0,0.0,0.0,0.0,0.0,1,False,True,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4026,1.0,3,3.0,2.0,9.0,2.0,0.0,0.0,0.0,9,False,False,False,False
4027,1.0,3,0.0,0.0,1.0,2.0,1.0,0.0,0.0,9,True,True,False,False
4028,1.0,3,1.0,0.0,2.0,2.0,0.0,1.0,0.0,9,True,True,False,True
4029,1.0,3,1.0,1.0,3.0,2.0,0.0,1.0,0.0,9,True,True,False,True


<center><h2>
    Model Evaluation
    </h2></center

Because this is a project that I had previously completed, I only include a GBC model since I know that this model performed better than the other models I tested.

In [18]:
# Fastball is my output variable. RISP, inning, and pitch_num had multicollinearity and dropping them improved model performance
y = df['fastball']
X = df.drop(columns=['fastball', 'RISP', 'inning', 'pitch_num'])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2, random_state=42)

In [27]:
type(y)

dask.dataframe.core.Series

In [19]:
# making room for cpu memory
del df_joined, pitches_df, atbat_df

<h2>
    Gradient Boosting Classifier
    </h2>

Here's some parameter tuning using Dask. I didn't include too much of the hyperparameter tuning because it takes a long time to run.

In [20]:
gbc = GradientBoostingClassifier()

with joblib.parallel_backend('dask'):
    gbc.fit(X_train.compute(), y_train.compute())
    
preds_train = gbc.predict(X_train.values.compute())
preds_test = gbc.predict(X_test.values.compute())

print("Gradient boosting tree training score is: ", roc_auc_score(preds_train, y_train.values.compute()))
print("Gradient boosting tree test score is: ", roc_auc_score(preds_test, y_test.values.compute()))

Gradient boosting tree training score is:  0.5516374478526636
Gradient boosting tree test score is:  0.5512990322250508


In [22]:
gbc_params = {"max_depth": [2, 4, 6]}

gbc = GradientBoostingClassifier()

grid_search_gbc = GridSearchCV(gbc,
                           param_grid=gbc_params,
                           return_train_score=True,
                           iid=True,
                           cv=4,
                           n_jobs=-1, 
                           scoring='roc_auc')

In [23]:
with joblib.parallel_backend('dask'):
    grid_search_gbc.fit(X_train.compute(), y_train.compute())

In [24]:
print("The best value is: ", grid_search_gbc.best_params_)
print("The test AUC score is: ", grid_search_gbc.score(X_test.compute(), y_test.compute()))

The best value is:  {'max_depth': 6}
The test AUC score is:  0.5860352731843442
