## Project Description

MLB Advanced Media, as stated in a job description for which I was intrigued by, was looking to develop insights into predictability of a hit based on data acquired through their Statcast tool. Statcast is a high-speed, high-accuracy device that tracks ball and player movements. 

The findings of this task would be for use by analysts and commentators during game broadcasts. The problem statement for the specfic prediction I undertook is:

Based on the ballistics of the pitch and the ball hit into play, what is the likelihood it results in a hit.

## Notebook Description

12\. Using one of the stronger performing models from prior steps, attempt to **tune hyperparameters** to improve model performance.

**Model:** 
`RandomForestClassifier()`

**Hyperparameters tuning:** 
n_estimators, criterion

**Results:**
Best parameters: n_estimators = 9, criterion = 'gini'
Accuracy Score = ~77%

### Initialize packages and read in pickled data

In [1]:
# ! pip install scrapy
# ! pip install psycopg2
# ! pip install sqlalchemy
# ! pip install missingno --quiet
# ! pip install scipy

In [2]:
% run __init__.py

In [3]:
cd ..

/home/jovyan


In [4]:
df_model = pd.read_pickle('data/df_model.p')

In [5]:
df_model.shape

(127052, 88)

### Resampling by Down-Sampling `no_hit` class

In [6]:
df_model_hit = df_model[df_model['hit_flag']==True]
df_model_no_hit = df_model[df_model['hit_flag']==False]
df_model_hit.shape, df_model_no_hit.shape

((41459, 88), (85593, 88))

In [7]:
df_model_no_hit_rs = df_model_no_hit.sample(df_model_hit.shape[0])

In [8]:
df_model_no_hit_rs.shape

(41459, 88)

In [9]:
df_model_rs = pd.concat([df_model_hit, df_model_no_hit_rs], axis=0)

In [10]:
df_model_rs.shape

(82918, 88)

### Set up target and predictors

In [11]:
df_model_rs.drop('player_id', axis=1, inplace=True)

In [12]:
target = df_model_rs['hit_flag']
predictors = df_model_rs.drop('hit_flag', axis=1)

In [13]:
predictors.columns

Index(['mph', 'ev_mph', 'dist', 'spin_rate', 'launch_angle', 'zone_1.0',
       'zone_11.0', 'zone_12.0', 'zone_13.0', 'zone_14.0', 'zone_2.0',
       'zone_3.0', 'zone_4.0', 'zone_5.0', 'zone_6.0', 'zone_7.0', 'zone_8.0',
       'zone_9.0', 'zone_unknown', 'ab_count_0-0', 'ab_count_0-1',
       'ab_count_0-2', 'ab_count_1-0', 'ab_count_1-1', 'ab_count_1-2',
       'ab_count_2-0', 'ab_count_2-1', 'ab_count_2-2', 'ab_count_3-0',
       'ab_count_3-1', 'ab_count_3-2', 'inning_Bot 1', 'inning_Bot 10',
       'inning_Bot 11', 'inning_Bot 12', 'inning_Bot 13', 'inning_Bot 14',
       'inning_Bot 15', 'inning_Bot 16', 'inning_Bot 17', 'inning_Bot 18',
       'inning_Bot 19', 'inning_Bot 2', 'inning_Bot 3', 'inning_Bot 4',
       'inning_Bot 5', 'inning_Bot 6', 'inning_Bot 7', 'inning_Bot 8',
       'inning_Bot 9', 'inning_Top 1', 'inning_Top 10', 'inning_Top 11',
       'inning_Top 12', 'inning_Top 13', 'inning_Top 14', 'inning_Top 15',
       'inning_Top 16', 'inning_Top 17', 'inning_Top 18

BoxCox requires all positive values, so I'll start this workflow by using a `MinMaxScaler` on my data

### `MinMaxScaler`

In [14]:
df_model_proc_all = predictors.copy()

In [15]:
min_max = MinMaxScaler(feature_range=(1E-10,1))

In [16]:
df_model_mm = pd.DataFrame(min_max.fit_transform(df_model_proc_all), 
                           index=df_model_proc_all.index, 
                           columns=df_model_proc_all.columns)

### Skew-Normalize Features

#### `box_cox`

In [17]:
def box_cox(predictors):
    '''Input dataframe to deskew it'''
    df_model_bc = pd.DataFrame()
    for col in predictors.columns:
        box_cox, lmbda = boxcox(predictors[col])
        df_model_bc[col] = pd.Series(box_cox)
    
    df_model_bc.set_index(predictors.index, inplace=True)
    
    return df_model_bc

In [18]:
df_model_skewnorm = box_cox(df_model_mm)

  llf -= N / 2.0 * np.log(np.sum((y - y_mean)**2. / N, axis=0))


In [19]:
df_model_skewnorm.head(3)

Unnamed: 0_level_0,mph,ev_mph,dist,spin_rate,launch_angle,zone_1.0,zone_11.0,zone_12.0,zone_13.0,zone_14.0,...,full_pitch_Knuckle-curve,full_pitch_Knuckleball,full_pitch_Pitch out,full_pitch_Screwball,full_pitch_Slider,full_pitch_Two-Seam Fastball,full_pitch_Unidentified,pitch_rollup_fastball,pitch_rollup_offspeed,pitch_rollup_other
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
434378-29,-0.098396,-0.121813,-0.193546,-0.295169,-0.333682,-102205600.0,-2491220000.0,-36599900000.0,-334457100.0,-25237370.0,...,-3.316738e+20,-7.360592e+99,-1.341306e+154,-1.340828e+154,-2667.748417,-1406.596871,-6.131121e+130,0.0,-67.029032,-1.697548e+116
434378-38,-0.108279,-0.116197,-0.507745,-0.318682,-0.458085,-102205600.0,0.0,-36599900000.0,-334457100.0,-25237370.0,...,-3.316738e+20,-7.360592e+99,-1.341306e+154,-1.340828e+154,-2667.748417,-1406.596871,-6.131121e+130,0.0,-67.029032,-1.697548e+116
434378-56,-0.243138,-0.255841,-0.576823,-0.238371,-0.353044,-102205600.0,-2491220000.0,-36599900000.0,-334457100.0,-25237370.0,...,-3.316738e+20,-7.360592e+99,-1.341306e+154,-1.340828e+154,-2667.748417,-1406.596871,-6.131121e+130,-10.65764,0.0,-1.697548e+116


### Create dataframe with only selected features

In [20]:
df_model = df_model_skewnorm.copy()

In [21]:
df_model.shape, target.shape

((82918, 86), (82918,))

### Build Pipeline

In [22]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

In [23]:
# scaler = StandardScaler()
# logreg_feats_est = LogisticRegression(penalty='l1')
# sfm = SelectFromModel(logreg_feats_est, threshold='mean')
# knn = KNeighborsClassifier()

In [30]:
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('sfm', SelectFromModel(LogisticRegression(penalty='l1'), threshold='mean')), 
    ('rand_forest', RandomForestClassifier())
])

In [34]:
pipe_params = {
    'rand_forest__n_estimators':range(5,11),
    'rand_forest__criterion':['gini', 'entropy']
}

In [35]:
rf_gs = GridSearchCV(pipe, param_grid=pipe_params, cv=10)

Need to figure out capacity issue

In [36]:
rf_gs.fit(df_model, target)

  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um

  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um

  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)
  x = um.multiply(x, x, out=x)


GridSearchCV(cv=10, error_score='raise',
       estimator=Pipeline(steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('sfm', SelectFromModel(estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l1', random_state=N...imators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'rand_forest__n_estimators': range(5, 11), 'rand_forest__criterion': ['gini', 'entropy']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [37]:
rf_gs.best_params_

{'rand_forest__criterion': 'gini', 'rand_forest__n_estimators': 9}

In [38]:
rf_gs.best_score_

0.77634530500012056