## Project Description

MLB Advanced Media, as stated in a job description for which I was intrigued by, was looking to develop insights into predictability of a hit based on data acquired through their Statcast tool. Statcast is a high-speed, high-accuracy device that tracks ball and player movements. 

The findings of this task would be for use by analysts and commentators during game broadcasts. The problem statement for the specfic prediction I undertook is:

Based on the ballistics of the pitch and the ball hit into play, what is the likelihood it results in a hit.

## Notebook Description

10\. **Resample** classes so our target classes (hit, no_hit) are balanced.

- Resample target class
- Deskew data using `box_cox()`
- Apply `MinMaxScaler()` to data
- Using `SelectFromModel()` with a `LogisticRegression()` estimator and an L1 penalty, select important features
- Train model on deskewed, normalized, and feature optimized data
- Define function to train and test a model using specified predictors and targets
- Models used:
    - K Nearest Neighbors
    - Logistic Regression
    - Decision Tree Classifier
    - Random Forest Classifier
- Results:
    - Test scores drop ever so slightly after resampling the target classes. This suggests that the accuracy score was getting bolstered by random chance. While the model doesn't predict as accurately, it is more represntative of how it'll perform on out of sample data.

|   Model Name   |   Test Score  |   Train Score   |
| -----------|:---------------:|--------------:|
| K Nearest Neighbors | 0.7792 | 0.8456 |
| Logistic Regression | 0.6989 | 0.6943 |
| Decision Tree Classifier | 0.7407 | 0.9998 |
| Random Forest Classifier | 0.7783 | 0.9838 |

### Initialize packages and read in pickled data

In [2]:
# ! pip install scrapy
# ! pip install psycopg2
# ! pip install sqlalchemy
# ! pip install missingno --quiet
# ! pip install scipy

In [3]:
% run __init__.py

In [4]:
cd ..

/home/jovyan


In [5]:
df_model = pd.read_pickle('data/df_model.p')

In [6]:
df_model.shape

(127052, 88)

### Resampling by Down-Sampling `no_hit` class

In [7]:
df_model_hit = df_model[df_model['hit_flag']==True]
df_model_no_hit = df_model[df_model['hit_flag']==False]
df_model_hit.shape, df_model_no_hit.shape

((41459, 88), (85593, 88))

In [50]:
df_model_no_hit_rs = df_model_no_hit.sample(df_model_hit.shape[0])

In [51]:
df_model_no_hit_rs.shape

(41459, 88)

In [52]:
df_model_rs = pd.concat([df_model_hit, df_model_no_hit_rs], axis=0)

In [53]:
df_model_rs.shape

(82918, 88)

### Set up target and predictors

In [54]:
df_model_rs.drop('player_id', axis=1, inplace=True)

In [55]:
target = df_model_rs['hit_flag']
predictors = df_model_rs.drop('hit_flag', axis=1)

BoxCox requires all positive values, so I'll start this workflow by using a `MinMaxScaler` on my data

### `MinMaxScaler`

In [56]:
df_model_proc_all = predictors.copy()

In [57]:
min_max = MinMaxScaler(feature_range=(1E-10,1))

In [58]:
df_model_mm = pd.DataFrame(min_max.fit_transform(df_model_proc_all), 
                           index=df_model_proc_all.index, 
                           columns=df_model_proc_all.columns)

### Skew-Normalize Features

#### `box_cox`

In [59]:
def box_cox(predictors):
    '''Input dataframe to deskew it'''
    df_model_bc = pd.DataFrame()
    for col in predictors.columns:
        box_cox, lmbda = boxcox(predictors[col])
        df_model_bc[col] = pd.Series(box_cox)
    
    df_model_bc.set_index(predictors.index, inplace=True)
    
    return df_model_bc

In [60]:
df_model_skewnorm = box_cox(df_model_mm)

  llf -= N / 2.0 * np.log(np.sum((y - y_mean)**2. / N, axis=0))


In [61]:
df_model_skewnorm.head(3)

Unnamed: 0_level_0,mph,ev_mph,dist,spin_rate,launch_angle,zone_1.0,zone_11.0,zone_12.0,zone_13.0,zone_14.0,...,full_pitch_Knuckle-curve,full_pitch_Knuckleball,full_pitch_Pitch out,full_pitch_Screwball,full_pitch_Slider,full_pitch_Two-Seam Fastball,full_pitch_Unidentified,pitch_rollup_fastball,pitch_rollup_offspeed,pitch_rollup_other
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
434378-29,-0.098449,-0.121862,-0.193561,-0.291016,-0.336621,-123726900.0,-2451574000.0,-37621600000.0,-306825700.0,-28132490.0,...,-1.039342e+21,-2.004817e+95,-1.3413720000000002e+154,-1.340828e+154,-2639.70659,-1399.03279,-2.0309529999999999e+130,0.0,-66.682923,-2.275311e+114
434378-38,-0.108344,-0.116241,-0.507854,-0.31475,-0.460872,-123726900.0,0.0,-37621600000.0,-306825700.0,-28132490.0,...,-1.039342e+21,-2.004817e+95,-1.3413720000000002e+154,-1.340828e+154,-2639.70659,-1399.03279,-2.0309529999999999e+130,0.0,-66.682923,-2.275311e+114
434378-56,-0.243565,-0.256109,-0.576965,-0.233709,-0.35595,-123726900.0,-2451574000.0,-37621600000.0,-306825700.0,-28132490.0,...,-1.039342e+21,-2.004817e+95,-1.3413720000000002e+154,-1.340828e+154,-2639.70659,-1399.03279,-2.0309529999999999e+130,-10.691163,0.0,-2.275311e+114


### Standardize Features

**`StandardScaler()` on All Features**

Switched from `MinMaxScaler()` in prior notebooks to `StandardScaler()`. I had run `StandardScaler()` on the prior notebooks initially and then later decided to try with `MinMaxScaler()`. While the predictive scores weren't changed by the switch, the `MinMaxScaler()` led to some odd features being deemed important by the feature selection. Therefore, I decided to stick with `StandardScaler()`.

In [62]:
standardized = (StandardScaler().fit_transform(df_model_skewnorm))
df_standardized = pd.DataFrame(standardized, columns=df_model_skewnorm.columns, index=df_model_skewnorm.index)

In [63]:
df_standardized.shape

(82918, 86)

In [64]:
target.shape

(82918,)

### Feature Selection

#### `SelectFromModel` with L1 penalty estimator

In [65]:
sfm = SelectFromModel(LogisticRegression(penalty='l1'), threshold='mean')
sfm.fit(df_standardized, target)

SelectFromModel(estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l1', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
        prefit=False, threshold='mean')

In [66]:
sfm_feats = np.where(sfm.get_support())[0]
sfm_feats

array([ 1,  2,  4,  8,  9, 18])

In [67]:
columns = list(df_standardized.columns)

sfm_feats_names = []
for i in sfm_feats:
    sfm_feats_names.append(columns[i])
    
sfm_feats_names

['ev_mph', 'dist', 'launch_angle', 'zone_13.0', 'zone_14.0', 'zone_unknown']

### Create dataframe with only selected features

In [68]:
df_slim = df_standardized[sfm_feats_names]

In [69]:
df_slim.shape, target.shape

((82918, 6), (82918,))

### `train_test_split` the data

In [70]:
X_train, X_test, y_train, y_test = train_test_split(df_slim, target, test_size=.2, stratify=target)

### K Neighbors Classifier - manual modeling

In [71]:
knn = KNeighborsClassifier()

In [72]:
knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [73]:
y_pred = knn.predict(X_test)

In [74]:
print(accuracy_score(y_test, y_pred))

0.775747708635


In [75]:
knn.score(X_test, y_test)

0.77574770863482878

In [76]:
print(f1_score(y_test, y_pred))

0.775137553661


### Define `run_benchmark` function to automate modeling

In [77]:
def run_benchmark(model, model_name, predictors, target):
    X_train, X_test, y_train, y_test = train_test_split(predictors, target, stratify=target)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    return {'train_score' : model.score(X_train, y_train), 
            'test_score' : model.score(X_test, y_test),
            'kappa_score': cohen_kappa_score(y_pred, y_test),
            'model_name' : model_name }

# credit to Joshua Cook

### Run Models on Resampled, Skew-Normalized, and Standardized Data

Taking a look at `cohen_kappa_score()` to assess validity of handling class imbalance in this step.

### K Neighbors Classifier

In [78]:
from sklearn.metrics import cohen_kappa_score

In [79]:
knn_output = run_benchmark(KNeighborsClassifier(n_jobs=7), 
                               'kneighbors',
                               df_slim, 
                               target)

In [80]:
knn_output

{'kappa_score': 0.55832127351664251,
 'model_name': 'kneighbors',
 'test_score': 0.77916063675832126,
 'train_score': 0.84556506078343086}

### Logistic Regression

In [81]:
log_reg_output = run_benchmark(LogisticRegression(n_jobs=7), 
                               'logistic regression',
                               df_slim, 
                               target)

In [82]:
log_reg_output

{'kappa_score': 0.39787747226242165,
 'model_name': 'logistic regression',
 'test_score': 0.69893873613121082,
 'train_score': 0.69431401556570405}

### Decision Tree

In [83]:
dtree_output = run_benchmark(DecisionTreeClassifier(), 
                             'decision tree',
                             df_slim, 
                               target)

In [84]:
dtree_output

{'kappa_score': 0.48142788229618905,
 'model_name': 'decision tree',
 'test_score': 0.74071394114809452,
 'train_score': 0.99980703672734295}

### Random Forest Classifier

In [85]:
rand_forest_output = run_benchmark(RandomForestClassifier(n_jobs=7), 
                                  'random forest', 
                                  df_slim, 
                                  target)

In [86]:
rand_forest_output

{'kappa_score': 0.55668113844669564,
 'model_name': 'random forest',
 'test_score': 0.77834056922334782,
 'train_score': 0.98377500482408187}

### Show benchmark models side-by-side

In [88]:
output = [
    knn_output,
    log_reg_output,
    dtree_output, 
    rand_forest_output
]

pd.DataFrame(output)

Unnamed: 0,kappa_score,model_name,test_score,train_score
0,0.558321,kneighbors,0.779161,0.845565
1,0.397877,logistic regression,0.698939,0.694314
2,0.481428,decision tree,0.740714,0.999807
3,0.556681,random forest,0.778341,0.983775


Test scores drop ever so slightly after resampling the target classes. This suggests that the accuracy score was getting bolstered by random chance. While the model doesn't predict as accurately, it is more represntative of how it'll perform on out of sample data.