# Overview
1. Setup
2. Hyperparameter (HP) Definition
3. Description of HPs for a few common models (i.e. what the HPs do, how they affect the model, and tradeoffs) 
4. Build simple random forest model with no tuning
5. Tuning Methods

Have attendees build a new model from scratch and incorporate HP tuning

# Setup
Download data [here](https://www.kaggle.com/nickhould/craft-cans)

Perform the following steps using a terminal:

- Clone the repository: `git clone https://github.com/benattix/ml_tutorials.git`

- Change directory: `cd ml_tutorials`

- Create new conda environment with .yml file: `conda env create -f environment.yml`

- Activate new environment `source activate hptuning`

- Launch a Jupyter notebook: `jupyter notebook`

# Definition
A hyperparameter is a parameter whose value is set before the learning process begins.

## Example

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split, cross_val_score

from sklearn.ensemble import RandomForestClassifier

In [None]:
import lightgbm as lgb

In [None]:
df = pd.read_csv('data/beers.csv', index_col=0)
print(df.shape)
df.head()

In [None]:
# drop columns with null 'ibu'
df = df.dropna(subset=['ibu'], axis=0)
print(df.shape)

In [None]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df)
X
# X.head()

In [None]:
# split into 
# X = df.drop('ibu', axis=1)
X = df.drop(['ibu', 'name', 'style'], axis=1)
Y = df['ibu']

In [None]:
rf = RandomForestClassifier()
rf.fit(X, Y)

# Tuning Methods
1. Manual Search
2. Grid Search
3. Random Search
4. Bayesian Search

(describe each method and have hands-on examples of how to use them. Explain pros and cons of each):

## Manual Search

## Grid Search

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
### Random Forest
parameters = {'criterion': ['entropy', 'gini'], 
              'max_features': ['auto', 'sqrt', 'log2'],
              'n_estimators': [i for i in range(10,30)]}

rf = GridSearchCV(RandomForestClassifier(), parameters, cv=5, scoring='f1_weighted')
%time rf.fit(X, Y)

print('The best parameters are critereon = %s, max_features = %s, n_estimators = %d' \
      % (rf.best_params_['criterion'], rf.best_params_['max_features'], rf.best_params_['n_estimators']))
print('The best model score is %3.3f' % rf.best_score_)

In [None]:
### Light GBM
parameters = {'criterion': ['entropy', 'gini'], 
              'max_features': ['auto', 'sqrt', 'log2'],
              'n_estimators': [i for i in range(10,30)]}

rf = GridSearchCV(RandomForestClassifier(), parameters, cv=5, scoring='f1_weighted')
%time rf.fit(X, Y)

print('The best parameters are critereon = %s, max_features = %s, n_estimators = %d' \
      % (rf.best_params_['criterion'], rf.best_params_['max_features'], rf.best_params_['n_estimators']))
print('The best model score is %3.3f' % rf.best_score_)

## Random Search

In [None]:
from sklearn.model_selection import RandomizedSearchCV

## Bayesian Search
(https://github.com/fmfn/BayesianOptimization)

In [None]:
from bayes_opt import BayesianOptimization

optimizer = BayesianOptimization(
    f=black_box_function,
    pbounds=pbounds,
    random_state=1,
)

In [None]:
optimizer.maximize(
    init_points=2,
    n_iter=3,
)