# Scikit-learn

Scikit-learn provides a higher-level parallelism with joblib.

## Use parallelism during training

Let's create a dataset.

In [21]:
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=100000, n_features=10,
                          n_informative=4, n_redundant=0,
                          random_state=0, shuffle=False)

Then we instantiate a random forest classifier

In [22]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(max_depth=2, random_state=0)

Train the classifier

In [23]:
%%time
clf.fit(X, y)

CPU times: user 10.1 s, sys: 2.49 ms, total: 10.1 s
Wall time: 10.1 s


Print the score

In [24]:
clf.score(X,y)

0.73151

Print the signature of RandomForestClassifier. You will notice a parameter named n_jobs. 

In [25]:
?RandomForestClassifier

[0;31mInit signature:[0m
[0mRandomForestClassifier[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mn_estimators[0m[0;34m=[0m[0;36m100[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcriterion[0m[0;34m=[0m[0;34m'gini'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmax_depth[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmin_samples_split[0m[0;34m=[0m[0;36m2[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmin_samples_leaf[0m[0;34m=[0m[0;36m1[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmin_weight_fraction_leaf[0m[0;34m=[0m[0;36m0.0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmax_features[0m[0;34m=[0m[0;34m'sqrt'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmax_leaf_nodes[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmin_impurity_decrease[0m[0;34m=[0m[0;36m0.0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mbootstrap[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[

Let's retrain the classifier but this time with more CPU.

In [26]:
%%time

# TODO

CPU times: user 5 µs, sys: 0 ns, total: 5 µs
Wall time: 9.06 µs


In [27]:
%%time
clf = RandomForestClassifier(max_depth=2, random_state=0, n_jobs=-1)
clf.fit(X,y)

CPU times: user 22.6 s, sys: 163 ms, total: 22.8 s
Wall time: 2.23 s


In [28]:
clf.score(X,y)

0.73151

## Use parallelism during hyperparameter search

Hyperparameters are parameters that are not directly learnt within estimators. In scikit-learn they are passed as arguments to the constructor of the estimator classes. There exist several hyperparameter optimizers:
 * `sklearn.model_selection.GridSearchCV`: Exhaustive search over specified parameter values for an estimator.
 * `sklearn.model_selection.HalvingGridSearchCV`: Search over specified parameter values with successive halving.
 * `sklearn.model_selection.ParameterGrid`: Grid of parameters with a discrete number of values for each.
 * `sklearn.model_selection.ParameterSampler`: Generator on parameters sampled from given distributions.
 * `sklearn.model_selection.RandomizedSearchCV`: Randomized search on hyper parameters.
 * `sklearn.model_selection.HalvingRandomSearchCV`: Randomized search on hyper parameters.

Try to perform a hyperparameter optimization with `GridSearchCV` and the following parameter grid:
```python
param_grid = { 
    'n_estimators': [50, 100],
    'max_depth' : [2,4,6],
}
```
Look as the parameters of `GridSearchCV` and try ti use more CPU.

In [29]:
from sklearn.model_selection import GridSearchCV

In [None]:
%%time
estimator = RandomForestClassifier(random_state=0)

# TODO

In [48]:
%%time
estimator = RandomForestClassifier(random_state=0)


param_grid = { 
    'n_estimators': [50, 100],
    'max_depth' : [2,4,6],
}


grid_search = GridSearchCV(estimator, param_grid, verbose=2, n_jobs=-1)
grid_search.fit(X, y)

Fitting 5 folds for each of 6 candidates, totalling 30 fits
CPU times: user 25.3 s, sys: 365 ms, total: 25.7 s
Wall time: 1min 40s


In [49]:
grid_search.best_estimator_

In [50]:
grid_search.best_score_

0.89659