 # Introduction and Summary
 
The below code takes a data set of statcast data and uses it to predict the probability of a batter hitting a home run using a rolling average of pitch results. Statcast data contains a variety of metrics related to pitch events that occur during major league games, including pitch velocities, the result of each pitch (swinging strike, ball, line drive, foul ball, etc.), and exit velocities and launch angles for batted balls. The below code takes these features and organizes them so that a predictive model can be generated to predict a batter's likelihood of hitting a home run.

In [1]:
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', 500)
wP = 2400
mpP = 100
wBB = 250
mpBB = 60

In [2]:
stat = pd.read_csv('statRaw_all.csv')

 # I. Clean Dataset for Analysis

## A. Remove rare (non-predictive) events and clean description field for analysis.

In [3]:
#remove rare or irrelevant events as irrelevant
irrelevant = ['caught_stealing_2b', 'pickoff_caught_stealing_2b', 'run', \
               'caught_stealing_3b', 'pickoff_1b', 'pickoff_2b', 'pickoff_caught_stealing_3b', \
               'caught_stealing_home', 'pickoff_caught_stealing_home', 'run', \
              #rare
              'sac_bunt', 'catcher_interf', 'batter_interference', 'sac_bunt_double_play']

stat = stat[~stat['events'].isin(irrelevant)]

#remove rare or irrelevant descriptions as irrelevant
irrelevant = ['pitchout', 'swinging_pitchout', 'automatic_ball', 'pitchout_hit_into_play_score', 'foul_bunt', 'missed_bunt']

stat = stat[~stat['description'].isin(irrelevant)]

In [4]:
desc = pd.read_csv('desc.csv')
stat = pd.merge(stat, desc, on = ['description'], how = 'left')
#stat = pd.merge(stat, bbEvents, on = ['events', 'bb_type', 'description'], how = 'left')

del desc

 ## B. Create Plate Appearence Identifier for each batter

In [5]:
stat = stat.sort_values(['game_date', 'sv_id', 'at_bat_number', 'pitch_number'])
stat['batter'] = stat['batter'].astype(float)
#stat['idBatter'] = stat.groupby('batter')['batter'].rank(method='first').astype(int)
stat['tempKey'] = range(0, len(stat))

#PA Column
#PA
stat['PAind'] = np.where((stat['events'].notnull()) | (stat['descClean'] == 'hit_into_play'), 1, 0)
stat['temp'] = stat.sort_values('tempKey', ascending = False).groupby('batter').PAind.transform(lambda x: x.eq(1).cumsum())
stat['maxPA'] = stat.groupby('batter').temp.transform('max')
stat['PA'] = stat['maxPA'] + 1 - stat['temp']

**Subset relevant fields**

In [6]:
stat = stat[[
'game_year',
'game_date',
'PAind',
'PA',
'batter',
'batter_name',
'b_stands',
'pitcher',
'p_throws',
'hit_distance_sc',
'launch_speed',
'launch_angle',
'zone',
'events',
'descClean',
'bb_type'
]]

** Using Dask, Clean Data Types **

In [7]:
import dask.dataframe as dd
import dask.array as da
from dask.distributed import Client, progress
client = Client(processes = False)
client.restart()

0,1
Client  Scheduler: inproc://192.168.1.108/4364/1  Dashboard: http://localhost:8787,Cluster  Workers: 1  Cores: 4  Memory: 8.46 GB


In [8]:
statD = dd.from_pandas(stat, npartitions=5)

def parse_dates(df):
  return pd.to_datetime(df['game_date'], format = '%Y-%m-%d')

statD['game_date'] = statD.map_partitions(parse_dates, meta = ('game_date', 'str')).compute()

integers = ['PA', 'batter', 'game_year']

for ints in integers:
    statD[ints] = statD[ints].astype(int)

#floats
flt = ['hit_distance_sc', 'launch_speed', 'launch_angle']

for fl in flt:
    statD[fl] = statD[fl].astype(float)

stat = statD.compute()
del statD

 # II. Generate Predictive Features

We several different types of features. First, rolling averages of launch angles, exit velocities, and hit distances were calculated for each hitter using a window of 2400 pitches. Then, we applied 0/1 indicators for different batted ball types (doubles, triples, homeruns, strikeouts, etc.) and calculated league averages for these feature. 

After that we calculated plate discipline and bat-to-ball metrics. These include metrics related to the percentage of strikes that a batter swung at or did not swing at, the percentage of balls that a batter swung at or did not swing at, and then the results of each swing (e.g. swinging strike, or contact).

In [9]:
#angle
leagueAvg = stat['launch_angle'].rolling(window = stat.shape[0], min_periods = 1).mean()
stat['meanLaunchAngle'] = leagueAvg
stat['q1LaunchAngle'] = stat['launch_angle'].rolling(window = stat.shape[0], min_periods = 1).quantile(.25)
stat['q3LaunchAngle'] = stat['launch_angle'].rolling(window = stat.shape[0], min_periods = 1).quantile(.75)
stat['lowLaunch'] = np.where(stat['launch_angle'] < stat['q1LaunchAngle'], 1, 0)
stat['medLaunch'] = np.where((stat['launch_angle'] >= stat['q1LaunchAngle']) \
                             & stat['launch_angle'] <= stat['q3LaunchAngle'], 1, 0)
stat['highLaunch'] = np.where(stat['launch_angle'] > stat['q3LaunchAngle'], 1, 0)
stat['launchAboveAvg'] = stat['launch_angle'] / stat['meanLaunchAngle'] 

#exit
leagueAvg = stat['launch_speed'].rolling(window = stat.shape[0], min_periods = 1).mean()
stat['meanLaunchSpeed'] = leagueAvg

stat['q1LaunchSpeed'] = stat['launch_speed'].rolling(window = stat.shape[0], min_periods = 1).quantile(.25)
stat['q3LaunchSpeed'] = stat['launch_speed'].rolling(window = stat.shape[0], min_periods = 1).quantile(.75)
stat['softContact'] = np.where(stat['launch_speed'] < stat['q1LaunchSpeed'], 1, 0)
stat['medContact'] = np.where((stat['launch_speed'] >= stat['q1LaunchSpeed']) \
                             & stat['launch_speed'] <= stat['q3LaunchSpeed'], 1, 0)
stat['hardContact'] = np.where(stat['launch_speed'] > stat['q3LaunchSpeed'], 1, 0)

stat['contactAboveAvg'] = stat['launch_speed'] / stat['meanLaunchSpeed']

#distance
leagueAvg = stat['hit_distance_sc'].rolling(window = stat.shape[0], min_periods = 1).mean()
stat['meanDistance'] = leagueAvg

stat['q1Distance'] = stat['hit_distance_sc'].rolling(window = stat.shape[0], min_periods = 1).quantile(.25)
stat['q3Distance'] = stat['hit_distance_sc'].rolling(window = stat.shape[0], min_periods = 1).quantile(.75)
stat['shortDistance'] = np.where(stat['hit_distance_sc'] < stat['q1Distance'], 1, 0)
stat['medDistance'] = np.where((stat['hit_distance_sc'] >= stat['q1Distance']) \
                             & stat['hit_distance_sc'] <= stat['q3Distance'], 1, 0)
stat['longDistance'] = np.where(stat['hit_distance_sc'] > stat['q3Distance'], 1, 0)

stat['distanceAboveAvg'] = stat['hit_distance_sc'] / stat['meanDistance']


The below features calculate batted ball averages for the entire league, as well as a player's relative performance for each batted ball type.

In [10]:
#strike indicator
stat['isStrike'] = np.where(stat['zone'] <= 9, 1, 0)
#outcome indicators

#1B
stat['outcome1B'] = np.where(stat['events'] == 'single', 1, 0)
leagueAvg = stat['outcome1B'].rolling(window = stat.shape[0], min_periods = 1).mean()
stat['mean1B'] = leagueAvg
#2B
stat['outcome2B'] = np.where(stat['events'] == 'double', 1, 0)
leagueAvg = stat['outcome2B'].rolling(window = stat.shape[0], min_periods = 1).mean()
stat['mean2B'] = leagueAvg
#3B
stat['outcome3B'] = np.where(stat['events'] == 'triple', 1, 0)
leagueAvg = stat['outcome3B'].rolling(window = stat.shape[0], min_periods = 1).mean()
stat['mean3B'] = leagueAvg
#HR
stat['outcomeHR'] = np.where(stat['events'] == 'home_run', 1, 0)
leagueAvg = stat['outcomeHR'].rolling(window = stat.shape[0], min_periods = 1).mean()
stat['meanHR'] = leagueAvg
#walk
stat['outcomeBB'] = np.where(stat['events'] == 'walk', 1, 0)
leagueAvg = stat['outcomeBB'].rolling(window = stat.shape[0], min_periods = 1).mean()
stat['meanBB'] = leagueAvg
#strikeout
stat['outcomeK'] = np.where(stat['events'] == 'strikeout', 1, 0)
leagueAvg = stat['outcomeK'].rolling(window = stat.shape[0], min_periods = 1).mean()
stat['meanK'] = leagueAvg
#Hit by pitch
stat['outcomeHBP'] = np.where(stat['events'] == 'hit_by_pitch', 1, 0)
leagueAvg = stat['outcomeHBP'].rolling(window = stat.shape[0], min_periods = 1).mean()
stat['meanHBP'] = leagueAvg

#batted ball categories
stat['groundball'] = np.where(stat['bb_type'] == 'ground_ball', 1,0)
stat['linedrive'] = np.where(stat['bb_type'] == 'line_drive', 1,0)
stat['flyball'] = np.where(stat['bb_type'] == 'fly_ball', 1,0)
stat['popup'] = np.where(stat['bb_type'] == 'popup', 1,0)
stat['foul'] = np.where(stat['descClean'] == 'foul', 1,0)

stat.drop(['bb_type'], inplace = True, axis = 1)

**Plate Discipline and Bat to Ball Ability ** 

In [11]:
#feature generation
swings = ['hit_into_play', 'foul', 'swinging_strike']

#batter decision
stat['threshSwing'] = np.where(stat['descClean'].isin(swings), 1, 0)

#in or out of zone result
stat['zSwing'] = np.where((stat['isStrike'] == 1) & (stat['threshSwing'] == 1), 1, 0)
stat['oSwing'] = np.where((stat['isStrike'] == 0) & (stat['threshSwing'] == 1), 1, 0)

stat['zTake'] = np.where((stat['isStrike'] == 1) & (stat['threshSwing'] == 0), 1, 0)
stat['oTake'] = np.where((stat['isStrike'] == 0) & (stat['threshSwing'] == 0), 1, 0)

##dfs
swingDf = stat.loc[stat['threshSwing'] == 1, ['idBatter', 'PA', 'batter', 'zSwing', 'oSwing', \
                                              'groundball', 'linedrive', 'flyball', 'popup', \
                                              'foul', 'lowLaunch', 'medLaunch', \
                                             'highLaunch', 'launchAboveAvg', 'softContact', 'medContact', \
                                             'hardContact', 'contactAboveAvg', 'shortDistance', 'medDistance', \
                                             'longDistance', 'distanceAboveAvg']]
takeDf = stat.loc[stat['threshSwing'] == 0, ['idBatter', 'PA', 'batter', 'zTake', 'oTake', \
                                             'descClean']]

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  from ipykernel import kernelapp as app
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  return self._getitem_tuple(key)
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  app.launch_new_instance()


**Swing Events Data:**

In [12]:
swingDf = pd.get_dummies(swingDf, prefix = 'swing', prefix_sep = '_')

#calculate rolling averages
cols = list(swingDf.columns)
del cols[0:3]

for col in cols:
    swingDf[col] = swingDf.groupby('batter')[col].rolling(window = wP, min_periods = mpP).mean().reset_index(0,drop=True)

swingDf = swingDf.sort_values('idBatter', ascending=False).drop_duplicates(['PA','batter'])
swingDf.sort_values(['batter', 'PA', 'idBatter'])

swingDf.drop('idBatter', inplace = True, axis = 1)

**Take Events Data:**

In [13]:
takeDf = pd.get_dummies(takeDf, prefix = 'take', prefix_sep = '_')

cols = list(takeDf.columns)
del cols[0:3]

for col in cols:
    takeDf[col] = takeDf.groupby('batter')[col].rolling(window = wP, min_periods = mpP).mean().reset_index(0,drop=True)

takeDf = takeDf.sort_values('idBatter', ascending=False).drop_duplicates(['PA','batter'])
takeDf.sort_values(['batter', 'PA', 'idBatter'])

takeDf.drop('idBatter', inplace = True, axis = 1)

**Opponent Pitcher Metrics:**

As a final step in feature generation, we calculate data for a batter's opponent pitcher. This will allow our model to identify when a batter is facing a pitcher who is relatively good or bad at preventing home runs, or limiting balls in play, or limiting fly balls.

In [14]:
stat['pitcherHR'] = stat.groupby('pitcher')['outcomeHR'].rolling(window = wP, min_periods = mpP).mean().reset_index(0,drop=True)
stat['pitcherK'] = stat.groupby('pitcher')['outcomeK'].rolling(window = wP, min_periods = mpP).mean().reset_index(0,drop=True)
stat['pitcherGB'] = stat.groupby('pitcher')['groundball'].rolling(window = wP, min_periods = mpP).mean().reset_index(0,drop=True)
stat['pitcherFB'] = stat.groupby('pitcher')['flyball'].rolling(window = wP, min_periods = mpP).mean().reset_index(0,drop=True)
stat['pitcherLD'] = stat.groupby('pitcher')['linedrive'].rolling(window = wP, min_periods = mpP).mean().reset_index(0,drop=True)
stat['platoon'] = np.where(stat['b_stands'] == stat['p_throws'], 1, 0)

In [15]:
#create Stat Feat for Join
stat['PA'] = stat['PA'] - 1
statResult = stat.loc[stat['PAind'] == 1, ['game_date', 'PA', 'batter', 'b_stands', \
                                           'pitcher', 'p_throws', 'outcome1B', 'outcome2B', \
                                           'outcome3B', 'outcomeHR']]

#join all
statFeat = pd.merge(stat[['PA', 'batter', 'mean1B', 'mean2B', 'mean3B', 'meanHR', 'meanBB', 'meanK', 'meanHBP', \
                         'pitcherHR', 'pitcherK', 'pitcherGB', 'pitcherFB', 'pitcherLD', \
                          'platoon']], \
                      swingDf, on = ['PA', 'batter'], how = 'left')
statFeat = pd.merge(statFeat, takeDf, on = ['PA', 'batter'], how = 'left')

In [16]:
statFeat.to_csv('statFeatures.csv')
statResult.to_csv('statResults.csv')