# Hyperparameter tuning in Python

#### Hyperparameters and Parameters
* New, complex algorithms have many hyperparameters
* It becomes increasingly important to learn how to efficiently find optimal combinations, as this search will likely take up a large portion of your time
* Often it is quite easy to simply run Scikit Learn functions on the default settings or perhaps code from a tutorial or book without really digging under the hood
* However, what lies underneath is of vital importance to good model building

### Parameters
* **Parameters:** 
    * Components of the final model that are learned through the modeling process
    * Crucially, **you do not set these manually.** In fact, you can't.
    * The algorithm will discover parameters for you (they are learned during the modeling process).
    
* To know what parameters an algorithm will produce, you need:
    * 1. To know a bit about the algorithm itself and how it works.
    * 2. Consult the sklearn documentation to see where the parameter is stored in the returned object.
            * parameters are found under the 'Attributes' section - *not* the 'parameters' section!

* **Parameters in Random Forest
* parameters are in the node decisions 
    * what feature to split
    * what value to split on
    * **to view individual tree of Random Forest:**
    
```
rf_clf = RandomForestClassifier(max_depth=2)
rf_clf.fit(X_train, y_train)

chosen_tree = rf_clf.estimators_[7]
```

#### Extracting Node Decisions
* For example, we can pull out details of the left, second-from-top node of the above isolated tree as follows:

```
# Get the column it split on
split_column = chosen_tree.tree_.feature[1]
split_column_name = X_train.columns[split_column]

# Get the level it split on 
split_value = chosen_tree.tree_.threshold[1]
print("This node split on feature {}, at a. value of {}'.format(split_column_name, split_value))
```

* **Exercise:** Extract the coefficient parameter (found in the coef_ attribute), zip it up with the original column names, and see which variables had the largest positive effect on the target variable.

```
# Create a list of original variable names from the training DataFrame
original_variables = X_train.columns

# Extract the coefficients of the logistic regression estimator
model_coefficients = log_reg_clf.coef_[0]

# Create a dataframe of the variables and coefficients & print it out
coefficient_df = pd.DataFrame({"Variable" : original_variables, "Coefficient": model_coefficients})
print(coefficient_df)

# Print out the top 3 positive variables
top_three_df = coefficient_df.sort_values(by='Coefficient', axis=0, ascending=False)[0:3]
print(top_three_df)
```

```
# Extract the 7th (index 6) tree from the random forest
chosen_tree = rf_clf.estimators_[6]

# Visualize the graph using the provided image
imgplot = plt.imshow(tree_viz_image)
plt.show()

# Extract the parameters and level of the top (index 0) node
split_column = chosen_tree.tree_.feature[0]
split_column_name = X_train.columns[split_column]
split_value = chosen_tree.tree_.threshold[0]

# Print out the feature and level
print("This node split on feature {}, at a value of {}".format(split_column_name, split_value))
```

#### Hyperparameters Overview
* Hyperparameters are something that you set before the modeling process begins
* The algorithm does not learn the value of hyperparameters during the modeling process (and this is the crucial differentiator between hyperparameters and parameters: whether you set it or whether the algorithmm learns it and informs you)
* Some hyperparameters are more important than others
* Some hyperparameters will not help model performance and are related to computational decisions or what information to retain for analysis
* `n_jobs` : how many cores to use (speed up computational time)
* `random_seed`
* `verbose` : whether to print out information as the modeling occurs 

* `rf_new_predictions = rf_clf_new.fit(X_train, y_train).predict(X_test)`


#### Hyperparameter Values
* Begin automating the work
* Some hyperparameters are likely better to start your tuning with than others
* *Which* values to try for hyperparameters?
    * This will be specific to the algorthm and to the hyperparameter itself 
    * However, there do exist some best practice guidelines and tips
    
#### Top tips for deciding ranges of values to try for different hyperparameters:

* **What values NOT to set (as they may conflict):** 
    * some values of the hyperparameter `penalty` conflict with some values of the hyperparameter `solver`
    * Some conflicts will not result in an error, but may result in a model construction we had not anticipated: for example `ElasticNet` with the `normalize` hyperparameter
    * Close inspection of the sklearn documentation is important
    * Be aware of setting "silly" values for different algorithms, for example:
        * Random forest with low number of trees
            * Would you consider it a forest if it had only 2 trees?
        * 1 Neighbor in KNN algorithm
            * Averaging the vote of 1 person doesn't sound very robust
        * Increasing a hyperparameter by a very small amount is unlikely to greatly improve a model 
            * One more tree in a forest, for example, isn't likely to have a large impact
        
* **Try a for loop to iterate through options:

```
neighbors_list = [3, 5, 10, 20, 50, 75]
for test_number in neightbors_list:
    model = KNeighborsClassifier(N_neighbors = test_number)
    predictions = model.fit(X_train, y_train).predict(X_test)
    accuracy = accuracy_score(y_test, predictions)
    accuracy_list.append(accuracy)
```
* Store the results in a DataFrame to view the effect of this hyperparameter on the accuracy of the model:

```
results_df = pd.DataFrame({'neighbors': neighbors_list, 'accuracy':accuracy_list})
print(results_df)
```
* Printing the DataFrame in this way, shows that (it appears) adding any more neighbors than 20 does not help.
* A common tool that is used to assist with analyzing the impact of a singular hyperparameter on an end result is called a **learning curve.**

```
neighbors_list = list(range(5, 500, 5))
accuracy_list = []
for test_number in neighbors_list:
    model = KNeighborsClassifier(n_neighbors = test_number)
    predicions = model.fit(X_train, y_train_.predict(X_test)
    accuracy = accuracy_score(y_test, predictions)
    accuracy_list.append(accuracy)
results_df = pd.DataFrame({'neighbors':neighbors_list, 'accuracy': accuracy_list})

plt.plot(results_df['neighbors'], results_df['accuracy'])
# Add the labels and title
plt.gca().set(xlabel='n_neighbors', ylabel='Accuracy', title='Accuracy for different n_neighbors')
plt.show()
```
* One thing to be aware of is that Python's `range` function does not work for decimal steps, which is important for hyperparameters that work on that scale. 
    * A handy trick uses numpys **`np.linspace(start, end, num)`**:
        * Create a number of values (`num`) evenly spread within an interval (`start`, `end`) that you specify

```
# Set the learning rates & accuracies list
learn_rates = np.linspace(0.01, 2, num=30)
accuracies = []

# Create the for loop
for learn_rate in learn_rates:
  	# Create the model, predictions & save the accuracies as before
    model = GradientBoostingClassifier(learning_rate=learn_rate)
    predictions = model.fit(X_train, y_train).predict(X_test)
    accuracies.append(accuracy_score(y_test,predictions))

# Plot results    
plt.plot(learn_rates, accuracies)
plt.gca().set(xlabel='learning_rate', ylabel='Accuracy', title='Accuracy for different learning_rates')
plt.show()
```

### Grid Search
* Automate 2 hyperparameters or more with a nested for loop:
* Firstly define model creation function:

In [3]:
def gbm_grid_search(learn_rate, max_depth):
    model = GradientBoostingClassifier(
            learning_rate=learn_rate,
            max_depth= max_depth)
    predictions = model.fit(X_train, y_train).predict(X_test)
    return([learn_rate, max_depth, accuracy_score(y_test, predictions)])

Now we can loop through our lists of hyperparameters and call our function

```
results_list = []

for learn_rate in learn_rate_list:
    for max_depth in max_depth_list:
        results_list.append(gbm_grid_search(learn_rate, max_depth))

results_df = pd.DataFrame(results_list, columns=['learning_rate', 'max_depth', 'accuracy'])
print(results_df)
```
* We have a nested loop so we can test all values of our first hyperparameter for all values of our second hyperparameter
* Importantly, the relationship between models created and hyperparameters or values to test is **not a linear relationship, but an exponential one.**

In [4]:
def gbm_grid_search(learn_rate, max_depth, subsample, max_features):
    model = GradientBoostingClassifier(
            learning_rate = learn_rate,
            max_depth= max_depth,
            subsample = subsample, 
            max_features= max_features)
    predictions = model.fit(X_train, y_train).predict(X_test)
    return([learn_rate, max_depth, accuracy_score(y_test, predictions)])

```
for learn_rate in learn_rate_list:
    for max_depth in max_depth_list:
        for subsample in subsample_list:
            for max_features in max_features_list:
                results_list.append(gbm_grid_search(learn_rate, max_depth, subsample, max_features))
results_df = pd.DataFrame(results_list, columns=['learning_rate', 'max_depth', 'subsample', 'max_features', 'accuracy'])
print(results_df)
```

* Safe to say, we cannot keep nesting forever, as our code becomes complex and inefficient

#### Grid Search
* **Grid Search has a number of advantages:**
    * It's programmatic
    * It saves many lines of code
    * It is guaranteed to find the best model within the grid you specify 
        * But, obviously, if you specify a poor grid with silly or conflicting values, you won't get a good score.
    * It is an easy methodology to explain compared to some other more complex ones.
* **Grid search also has a number of disadvantages:**
    * It is **very computationally expensive!**
    * It is **uninformed**. Results of one model don't help creating the next model 
    * (We will cover informed methods later)

### GridSearch with sklearn
Steps in a Grid Search:
* 1. Select an algorithm to tune the hyperparameters (sometimes called an "estimator")
* 2. Defining which hyperparameters we will tune
* 3. Defining a range of values for each hyperparameter
* 4. Setting a cross-validation scheme.
* 5. Define a score function so we can decide which square on our grid was "the best."
* 6. Include extra useful information or functions

#### GridSearchCV Object Inputs:
* **`estimator`** : our algorithm
* **`param_grid`** : sets which hyperparameters and values to test; must be a dictionary. Dictionary keys must be hyperparameter names, and the values a list of values to test.
* **`cv`** : choice of how to undertake cross-validation; you could specify different cross-validation types here; but simply providing an integer will create a k-fold
* **`scoring`** : which scoring function used to evaluate model's performance; you can use your own custom metric, or one available from sklearn's `metrics` module. See all available metrics using: `sorted(metrics.SCORERS.keys())`
* **`refit`** : set to `True`, means the best hyperparameter combinations are used to undertake a fitting to the training data; the GridSearchCV object can be used as an estimator directly; can be handy in some situations where you don't need to save the best hyperparameters and train another model.
* **`n_jobs`** : assists with parallel execution; you can effectively 'split up' your work and have many models being created at the same time. This is possible because the results of one model do not affect the next one. Be careful using all your cores for modelling if you want to do other work, however. You can check how many cores you have available, which determines how many models you can run in parallel with:
    * `import os`
    * `print(ox.cpu_count())`
    
* **`return_train_score`** : logs statistics about the training runs that were undertaken. This can be useful for plotting and understanding test vs training set performance (and hence bias-variance tradeoff). While informative, this is computationally expensive and will not assist in finding the best model.

In [5]:
import os
print(os.cpu_count())

8


#### Building a GridSearchCV object:

```
param_grid = {'max_depth': [2, 4, 6, 8], 'min_samples_leaf': [1, 2, 4, 6]}

rf_class = RandomForestClassifier(criterion='entropy', max_features = 'auto')

grid_rf_class = GridSearchCV(estimator= rf_class,
                             param_grid = parameter_grid,
                             scoring = 'accuracy',
                             n_jobs = 4,
                             cv = 10,
                             refit = True,
                             return_train_score = True)
```
* With `refit = True`, we can directly use the GridSearchCV on=bject a an estimator. That means swe can fit onto our data and make predictions, just like any other sklearn estimator!

```
grid_rf_class.fit(X_train, y_train)

grid_rf_class.predict(X_test)
```

```
# Create a Random Forest Classifier with specified criterion
rf_class = RandomForestClassifier(criterion='entropy')

# Create the parameter grid
param_grid = {'max_depth': [2, 4, 8, 15], 'max_features': ['auto', 'sqrt']} 

# Create a GridSearchCV object
grid_rf_class = GridSearchCV(
    estimator=rf_class,
    param_grid=param_grid,
    scoring='roc_auc',
    n_jobs=4,
    cv=5,
    refit=True, return_train_score=True)
print(grid_rf_class)
```
                            
    

#### Understanding a grid search output
* The properties of the GridSearchCV Object can be cateogrized into three different groups:
* **A results log:**
    * `cv_results_`
* **The best results:**
    * `best_index_`
    * `best_params_`
    * `best_score_`
* **Extra information:**
    * `scorer_`
    * `n_splits_`
    * `refit_time_`

* Properties are accessed using the dot notation: `grid_search_object.property`.

#### The `.cv_results_` property
* A dictionary that we can read into a pandas DataFrame to explore:
    * `cv_results_df = pd.DataFrame(grid_rf_class.cv_results_)`
    * `time` columns refer to the time it took to fit and score the model.
    * the `params` columns contain information on the different parameters taht were used in the model.
    * the `params` column contains dictionary of all the parameters from the previous `param` columns
        * Remember that each row in this DataFrame is about one model
        * We need to use `pd.set_option` to ensure we don't truncate the results we are printing when using the params column: `pd.set_option("display.max_colwidth", -1)`
    * `test_score` columns contain the score on our test set for each of our cross-folds as well as some summary statistics      
    * `rank_test_score` orders the `mean_test_score` from best to worst
    * Extracting the best row:
        * `best_row = cv_results_df[cv_results_df["rank_test_score"] == 1]`
    * The `test_score` columns are then repeated for the `train_scores` columns
        * **Note:** If we had not set return_train_score to True, this would not include the training scores
        * There is also no ranking column for the training scores, as we only care about test set performance
#### The best grid square
* `best_params_`, the dictionary of parameters that gave the best score
* `best_score_`, the actual best score
* `best_index_`, the row in our `cv_results_.rank_test_score` that was the best.
* `best_estimator_` property is an estimator built using the best parameters from the grid search
    * Because it is an estimator, we can use it to predict on our test set
    
* We can also use the GridSearchCV object itself directly a an estimator

#### Extra information:
* These are not very useful properties, but may be important if you construct your grid search differently
* `scorer_` : what scorer function was used on the held out data
* `n_splits_` : how many cros-validation splits

## Random Search 

Why does Random Search work?
* Bengio & Bergstra (2012): "This paper shows empirically and theoretically that randomly chosen trials are more efficient for hyper-parameter optimization than trials on a grid."
* **Two main reasons:**
* 1. Not every hyperparameter is important.
* 2. A little trick of probability.

* With relatively few trials we can get close to a maximum score with a relatively high probability.
* A grid search may spend lots of time in a "bad area" as it covers everything exhaustively.

* Some important notes about Random Search:
* **The maximum is still only as good as the grid you set!**
* Remember to fairly compare this to grid search, you need to have the same modeling 'budget.'

#### Random Search in sklearn
* Very similar to sklearn's GridSearchCV process:
    * 1. Decide an algorithm/estimator
    * 2. Defining which hyperparameters we will tune 
    * 3. Defining a range of values for each hyperparameter
    * 4. Setting a cross-validation scheme
    * 5. Define a score function
    * 6. Include extra useful information or functions
    
* There is only one difference when undertaking a random search
    * **7. Decide how many samples to take (then sample)**
    
* Two key differencesz:
    * **`n_iter`**: the number of samples for the random search to take from your grid. 
    * **`param_distributions`**: is slightly different from `param_grid`, allowing optional ability to set a distribution for sampling; if you just give a list, the default is for all combinations to have equal chance to be chosen (/"sample uniformly").
    
```
learn_rate_list = np.linspace(0.001, 2, 150)
min_samples_leaf_list = list(range(1, 51))

parameter_grid = {
        'learning_rate' : learn_rate_list,
        'min_samples_leaf' : min_samples_leaf_list}

number_models = 10

random_GBM_class = RandomizedSearchCV(
                        estimator = GradientBoostingClassifier(),
                        param_distributions = parameter_grid,
                        n_iter = number_models,
                        scoring = 'accuracy',
                        n_jobs=4, 
                        cv=10,
                        refit=True,
                        return_train_score = True)
random_GBM_class.fit(X_train, y_train)
```
#### Analyze the output

```
# Make sure we set the liits of Y and X appropriately
x_lims = [np.min(learn_rate_list), np.max(learn_rate_list)]
y_lims = [np.min(min_samples_leaf_list), np.max(min_samples_leaf_list)]

# Plot grid results
plt.scatter(rand_y, rand_x, c=['blue']*10)
plt.gca().set(xlabel='learn_rate', ylabel='min_samples_leaf', title='Random Search Hyperparameters')
plt.show()
```

```
# Create the parameter grid
param_grid = {'max_depth': list(range(5,26)), 'max_features': ['auto' , 'sqrt']} 

# Create a random search object
random_rf_class = RandomizedSearchCV(
    estimator = RandomForestClassifier(n_estimators=80),
    param_distributions = param_grid, n_iter = 5,
    scoring='roc_auc', n_jobs=4, cv = 3, refit=True, return_train_score = True )

# Fit to the training data
random_rf_class.fit(X_train, y_train)

# Print the values used for both hyperparameters
print(random_rf_class.cv_results_['param_max_depth'])
print(random_rf_class.cv_results_['param_max_features'])
```
