## Project Description

MLB Advanced Media, as stated in a job description for which I was intrigued by, was looking to develop insights into predictability of a hit based on data acquired through their Statcast tool. Statcast is a high-speed, high-accuracy device that tracks ball and player movements. 

The findings of this task would be for use by analysts and commentators during game broadcasts. The problem statement for the specfic prediction I undertook is:

Based on the ballistics of the pitch and the ball hit into play, what is the likelihood it results in a hit.

## Notebook Description

7\. **Normalize** / **Standardize** data and run same models on that data and assess scores

- Define function to train and test a model using specified predictors and targets
- Apply `MinMaxScaler()` to data
- Train model on normalized data
- Models used:
    - K Nearest Neighbors
    - Logistic Regression
    - Decision Tree Classifier
    - Random Forest Classifier
- Results:
    - While logistic regression, decision tree, and random forest perform the same with `MinMaxScaler` normalization as they did before, K neighbors performs significantly worse. `MinMaxScaler` is forcing every feature to a range between 0 and 1. 

    - Based on EDA, I believe there are some very unimportant features in the data that have become more noisy as a result of normalization. However, my hypothesis is this will make it easier to parse them out in feature selection later on.

|   Model Name   |   Test Score  |   Train Score   |
| -----------|:---------------:|--------------:|
| K Nearest Neighbors | 0.6662 | 0.7797 |
| Logistic Regression | 0.7209 | 0.7204 |
| Decision Tree Classifier | 0.7563 | 1.0 |
| Random Forest Classifier | 0.8011 | 0.9874 |

### Initialize packages and read in pickled data

In [1]:
% run __init__.py

In [2]:
cd ..

/home/jovyan


In [3]:
df_model = pd.read_pickle('data/df_model.p')

In [4]:
df_model.shape

(127052, 88)

### Set up df to `run_benchmark` function on standardized data and compare against un-standardized

In [5]:
df_model.drop('player_id', axis=1, inplace=True)

In [6]:
target = df_model['hit_flag']
predictors = df_model.drop('hit_flag', axis=1)

In [7]:
def run_benchmark(model, model_name, dataframe, target_col):
    target = dataframe[target_col]
    tmp_df = dataframe.drop(target_col, axis=1)
    X_train, X_test, y_train, y_test = train_test_split(tmp_df, target, stratify=target)
    model.fit(X_train, y_train)
    return {'train_score' : model.score(X_train, y_train), 
            'test_score' : model.score(X_test, y_test), 
            'model_name' : model_name }

# credit to Joshua Cook

### Standardize Features

In [8]:
df_model_st_all = predictors.copy()
# df_model_st_num = predictors.copy()

**`MinMaxScaler()` on All Features**

In [9]:
standardized = (MinMaxScaler().fit_transform(df_model_st_all))
df_standardized = pd.DataFrame(standardized, columns=df_model_st_all.columns, index=df_model_st_all.index)

In [10]:
df_standardized.shape

(127052, 86)

In [11]:
target.shape

(127052,)

In [12]:
df_model_st = pd.concat([df_standardized, target], axis=1)

### Run Models on Standardized Data

### K Neighbors Classifier

K Neighbors takes a long time to run. Specifically, the model scoring.

In [19]:
knn_output = run_benchmark(KNeighborsClassifier(n_jobs=7),
                           'kneighbors',
                           df_model_st, 
                           'hit_flag')

In [20]:
knn_output

{'model_name': 'kneighbors',
 'test_score': 0.6662154078644964,
 'train_score': 0.77969125502418957}

### Logistic Regression

In [13]:
log_reg_output = run_benchmark(LogisticRegression(), 
                               'logistic regression',
                               df_model_st, 
                               'hit_flag')

In [14]:
log_reg_output

{'model_name': 'logistic regression',
 'test_score': 0.72093316122532503,
 'train_score': 0.72035596973417704}

### Decision Tree

In [15]:
dtree_output = run_benchmark(DecisionTreeClassifier(), 
                             'decision tree',
                             df_model_st, 
                             'hit_flag')

In [16]:
dtree_output

{'model_name': 'decision tree',
 'test_score': 0.75632024682807042,
 'train_score': 1.0}

### Random Forest Classifier

In [22]:
rand_forest_output = run_benchmark(RandomForestClassifier(), 
                                  'random forest', 
                                  df_model_st, 
                                  'hit_flag')

In [23]:
rand_forest_output

{'model_name': 'random forest',
 'test_score': 0.80108931775965742,
 'train_score': 0.98741722549297406}

### Show benchmark models side-by-side

In [24]:
output = [
    knn_output,
    log_reg_output,
    dtree_output,
    rand_forest_output
]

pd.DataFrame(output)

Unnamed: 0,model_name,test_score,train_score
0,kneighbors,0.666215,0.779691
1,logistic regression,0.720933,0.720356
2,decision tree,0.75632,1.0
3,random forest,0.801089,0.987417


While logistic regression, decision tree, and random forest perform the same with `MinMaxScaler` normalization as they did before, K neighbors performs significantly worse. `MinMaxScaler` is forcing every feature to a range between 0 and 1. 

Based on EDA, I believe there are some very unimportant features in the data that have become more noisy as a result of normalization. However, my hypothesis is this will make it easier to parse them out in feature selection later on.