## Potential question
What kinds of pitches are fouled off?
Features that might matter: ball/strike count, pitch type, pitch location, pitch velocity, etc.
Unsupervised k-means clustering problem

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

In [2]:
data = pd.read_csv('edited_pitch_table.csv')

In [None]:
data.head()

In [None]:
list(data)     # Returns all column headers

In [None]:
data.shape
# 64 features
# 12,134 samples (pitches) in this three day period

### Features that might matter
pit_hand_cd, bat_hand_cd

pa_ball_ct, pa_strike_ct

outs_ct

pitch_seq

start_bases_cd

event_outs_ct

pitch_res (this is the one that denotes foul or not)

all of the following pitch characterization variables (except sv_id and type_conf) (I have no idea what most of those features represent...)

In [None]:
#Split dataset into fouls and not-fouls
#Should I split into train/test before or after this step?
fouls = data[data['pitch_res'] == 'F']
notfouls = data[data['pitch_res'] != 'F']
print('Fouls: ', fouls.shape, 'Not Fouls: ', notfouls.shape)

2050 pitches were fouled off, 10,084 were not fouled off, unbalanced dataset.
Possible approaches:
1. Focus on fouled-off pitches, do K-means clustering do identify types of situations that lead to fouled-off pitches
2. Try to classify pitches as fouled or not-fouled, then investigate feature importance

I like the second approach to start with, maybe the first to follow-up

### Random forest classifier
Keeping in mind that this is currently an unbalanced dataset.

Also, there are some categorical variables included in here, how will they be handled?

Apparently poorly, so need to encode the string variables

In [None]:
# Which features are categorical?
X.dtypes
# Pitcher hand, batter hand, pitch sequence, pitch type, pitch type sequence

In [3]:
'''Try using pd.get_dummies()?
   Could even skip the LabelEncoder step completely?'''
catcols = ['pit_hand_cd', 'bat_hand_cd', 'pitch_type']
hots = pd.get_dummies(data, columns=catcols, prefix = catcols)
hots.head()
# Seems to work!

Unnamed: 0,retro_game_id,year,st_fl,regseason_fl,playoffs_fl,game_type,game_type_des,game_id,home_team_id,home_team_lg,...,pitch_type_FC,pitch_type_FF,pitch_type_FO,pitch_type_FS,pitch_type_FT,pitch_type_IN,pitch_type_KC,pitch_type_KN,pitch_type_SI,pitch_type_SL
0,HOU201606010,2016,F,T,F,R,Regular Season,447654,hou,AL,...,0,1,0,0,0,0,0,0,0,0
1,HOU201606010,2016,F,T,F,R,Regular Season,447654,hou,AL,...,1,0,0,0,0,0,0,0,0,0
2,HOU201606010,2016,F,T,F,R,Regular Season,447654,hou,AL,...,0,1,0,0,0,0,0,0,0,0
3,HOU201606010,2016,F,T,F,R,Regular Season,447654,hou,AL,...,0,1,0,0,0,0,0,0,0,0
4,HOU201606010,2016,F,T,F,R,Regular Season,447654,hou,AL,...,1,0,0,0,0,0,0,0,0,0


In [4]:
# Some rows have missing values, so drop them (not sure exactly where/what they are right now)
hots = hots.dropna()
hots.shape    # 20 samples dropped, due to the OneHotEncoding, not sure why?

(12114, 78)

In [5]:
# Split data into X and y
'''y should probably actually be a new column, 'foul or not', 0 or 1. Just get pipeline working with this. '''
y = hots['pitch_res']
# X needs to exclude old encoded columns, but also stuff like the gameID, teamIDs, etc...
dropcols = [
    'retro_game_id',
    'year',
    'st_fl',
    'regseason_fl',
    'playoffs_fl',
    'game_type',
    'game_type_des',
    'game_id',
    'home_team_id',
    'home_team_lg',
    'away_team_id',
    'away_team_lg',
    'interleague_fl',
    'park_name',
    'park_lock',
    'pitch_seq',
    'pa_terminal_fl',
    'pa_event_cd',
    'pitch_res',
    'pitch_des',
    'pitch_id',
    'pitch_type_seq',
    'sv_id'
]
X = hots.drop(dropcols, axis=1)    # Needs to have a list of columns passed
print('X: ', X.shape, 'y: ', y.shape)

X:  (12114, 55) y:  (12114,)


In [6]:
# Train/test split, with original dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
#X_train.dtypes

In [7]:
trainingforest = RandomForestClassifier()
trainingforest.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

In [8]:
trainingforest.score(X_test, y_test)
# 5 possible outcomes right now (will change to foul T/F later), so .5 is better than chance (0.2), but not great

0.51775887943971988

In [22]:
# Feature importances
colsX = pd.Series(X_test.columns.values)
imps = pd.concat([colsX, pd.Series(trainingforest.feature_importances_)], axis=1).reset_index()
imps.columns = ['index', 'feature', 'importance']
imps.sort_values(by='importance', ascending=0)   # Sorted by importance
# Some of these features are somewhat opaque to me at the moment, particularly x/px/x0/ax/pfx_x etc.

Unnamed: 0,index,feature,importance
35,35,zone,0.092044
20,20,px,0.059363
12,12,x,0.054957
21,21,pz,0.054341
13,13,y,0.043297
27,27,vz0,0.031992
24,24,z0,0.029702
22,22,x0,0.027987
6,6,pa_strike_ct,0.026807
37,37,spin_rate,0.026596
