# Algorithm Selection and Hyperparameter Tuning with `scikit-learn`

This chapter contains code examples for model selection and hyperparameter tuning with `scikit-learn`.



In [None]:
%matplotlib inline

In [8]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn
import pandas

  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)


In [31]:
seaborn.set_style("ticks")
plt.rcParams["axes.grid"] = True

## Model Selection Tools in `scikit-learn`

### Parameter Search

A relatively simple ML algorithm, such as the *decision tree algorithm*, already has a large number of parameters with which we could configure it before it sees the training data. All of these parameters can potentially influence the performance of the learned model. Which parameters to tweak is a matter of understanding the algorithm and understanding the data. 

Remembering the section on **model complexity**, we conclude that the **depth of a decision tree** (i.e. the maximum number of steps from the root to a leaf) is an important parameter: The shallower the tree, the fewer criteria it can check before arriving at a prediction - possibly risking _underfitting_. On the other hand, the deeper the tree, the higher the risk for _overfitting_.



In [4]:
from sklearn.tree import DecisionTreeClassifier

In [5]:
DecisionTreeClassifier?

[0;31mInit signature:[0m
[0mDecisionTreeClassifier[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mcriterion[0m[0;34m=[0m[0;34m'gini'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0msplitter[0m[0;34m=[0m[0;34m'best'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmax_depth[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmin_samples_split[0m[0;34m=[0m[0;36m2[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmin_samples_leaf[0m[0;34m=[0m[0;36m1[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmin_weight_fraction_leaf[0m[0;34m=[0m[0;36m0.0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmax_features[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mrandom_state[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmax_leaf_nodes[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmin_impurity_decrease[0m[0;34m=[0m[0;36m0.0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmin_impurity_split[0m


There is only one way to really know the optimal depth: **Experiment with different parameters and measure performance**. Fortunately `scikit-learn` has helpful tools to make this possible in a few lines of code

- [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html): Tries out every combination of parameters from a given "grid" and evalutes them using cross-validation.
- [RandomizedSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html): Randomly tries some of the possible combinations of parameters - necessary for large search spaces.

In [21]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

In [36]:
from sklearn.metrics import precision_score, make_scorer

In [49]:
param_search = GridSearchCV(
    estimator=DecisionTreeClassifier(),
    param_grid={
        "max_depth": range(1,10)
    },
    scoring=make_scorer(precision_score, average="micro")
)

In [50]:
import datascience101

In [51]:
data = datascience101.datasets.read_iris()

In [52]:
data.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [53]:
X, y = data[data.columns.difference(["species"])], data["species"]

In [54]:
param_search.fit(X, y)



GridSearchCV(cv='warn', error_score='raise-deprecating',
       estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'max_depth': range(1, 10)}, pre_dispatch='2*n_jobs',
       refit=True, return_train_score='warn',
       scoring=make_scorer(precision_score, average=micro), verbose=0)

The fitted search object will tell you the best parameters found in the experiments:

In [55]:
param_search.best_params_

{'max_depth': 4}

And conveniently, the fitted search estimator is already able to make predictions using the best model found:

In [56]:
y_pred = param_search.predict(X)

### Exercise: Algorithm Search

**Rather than tuning the parameters of one algorithm, we can also use the search tools to try out differnt types of algorithms. This can be done using a `Pipeline`. For this we treat the name of a pipeline stage as a parameter. Try it out!**

In [58]:
from sklearn.pipeline import Pipeline

In [60]:
# TODO: your code here

---
_This notebook is licensed under a [Creative Commons Attribution 4.0 International License (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/). Copyright © 2018 [Point 8 GmbH](https://point-8.de)_

