## Project Description

MLB Advanced Media, as stated in a job description for which I was intrigued by, was looking to develop insights into predictability of a hit based on data acquired through their Statcast tool. Statcast is a high-speed, high-accuracy device that tracks ball and player movements. 

The findings of this task would be for use by analysts and commentators during game broadcasts. The problem statement for the specfic prediction I undertook is:

Based on the ballistics of the pitch and the ball hit into play, what is the likelihood it results in a hit.

## Notebook Description

6\. Perform **benchmark models** on data without preprocessing and feature selection on the data and without hyperparameter tuning on the model

- Define function to train and test a model using specified predictors and targets
- Models will be trained without any preprocessing or feature selection on the data and without any hyperparameter tuning in the model to get a benchmark accuracy score
- Models used:
    - K Nearest Neighbors
    - Logistic Regression
    - Decision Tree Classifier
    - Random Forest Classifier
- Results:
    - Random Forest and K Neighbors performed best, each predicting with ~80% accuracy on the test data. Compared to the ~67% by guessing no hit for everything, this is good but we can do better.
    - *Note: Despite high train scores, the tree models are not overfit. The test score is pretty close to the other two models' test scores.*

|   Model Name   |   Test Score  |   Train Score   |
| -----------|:---------------:|--------------:|
| K Nearest Neighbors | 0.7922 | 0.8530 |
| Logistic Regression | 0.7196 | 0.7182 |
| Decision Tree Classifier | 0.7544 | 1.0 |
| Random Forest Classifier | 0.7987 | .9871 |

### Initialize packages and read in pickled data

In [2]:
% run __init__.py

In [3]:
cd ..

/home/jovyan


In [4]:
df_model = pd.read_pickle('data/df_model.p')

In [5]:
df_model.shape

(127052, 88)

### Try predicting our target (hit / no-hit) with a few models

**Remember our baseline metric:** If we guess hit for everything, we'd be right 32.6% of the time, or 67.4% of the time if we guessed no hit.

In [18]:
def run_benchmark(model, model_name, dataframe, target_col):
    target = dataframe[target_col]
    tmp_df = dataframe.drop(target_col, axis=1)
    X_train, X_test, y_train, y_test = train_test_split(tmp_df, target, stratify=target)
    model.fit(X_train, y_train)
    return {'train_score' : model.score(X_train, y_train), 
            'test_score' : model.score(X_test, y_test),
            'model_name' : model_name }

# credit to Joshua Cook

### K Neighbors Classifier

In [27]:
knn_output = run_benchmark(KNeighborsClassifier(), 
                           'kneighbors', 
                           df_model.drop(['player_id'], axis=1), 
                           'hit_flag')

In [28]:
knn_output

{'model_name': 'kneighbors',
 'test_score': 0.79221106318672674,
 'train_score': 0.8530260575722276}

### Logistic Regression

In [19]:
log_reg_output = run_benchmark(LogisticRegression(), 
                               'logistic regression',
                               df_model.drop(['player_id'], 
                                         axis=1), 
                               'hit_flag')

In [20]:
log_reg_output

{'model_name': 'logistic regression',
 'test_score': 0.71961086799105878,
 'train_score': 0.71821511402155547}

### Decision Tree

In [21]:
dtree_output = run_benchmark(DecisionTreeClassifier(), 
                             'decision tree',
                             df_model.drop(['player_id'], 
                                         axis=1), 
                             'hit_flag')

In [22]:
dtree_output

{'model_name': 'decision tree',
 'test_score': 0.75443125649340426,
 'train_score': 1.0}

### Random Forest Classifier

In [30]:
rand_forest_output = run_benchmark(RandomForestClassifier(), 
                                  'random forest', 
                                  df_model.drop(['player_id'], 
                                               axis=1), 
                                  'hit_flag')

In [31]:
rand_forest_output

{'model_name': 'random forest',
 'test_score': 0.79872807984132477,
 'train_score': 0.98707091059828522}

### Show benchmark models side-by-side

In [32]:
output = [
    knn_output,
    log_reg_output,
    dtree_output, 
    rand_forest_output
]

pd.DataFrame(output)

Unnamed: 0,model_name,test_score,train_score
0,kneighbors,0.792211,0.853026
1,logistic regression,0.719611,0.718215
2,decision tree,0.754431,1.0
3,random forest,0.798728,0.987071


Random Forest and K Neighbors performed best, each predicting with ~80% accuracy on the test data. Compared to the ~67% by guessing no hit for everything, this is good but we can do better.

*Note: Despite high train scores, the tree models are not overfit. The test score is pretty close to the other two models' test scores.*