# Scikit-learn

Scikit-learn provides a higher-level parallelism with joblib.

## Use parallelism during training

Let's create a dataset.

In [None]:
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=100000, n_features=10,
                          n_informative=4, n_redundant=0,
                          random_state=0, shuffle=False)

Then we instantiate a random forest classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(max_depth=2, random_state=0)

Train the classifier

In [None]:
%%time
clf.fit(X, y)

Print the score

In [None]:
clf.score(X,y)

Print the signature of RandomForestClassifier. You will notice a parameter named n_jobs. 

In [None]:
?RandomForestClassifier

Let's retrain the classifier but this time with more CPU.

In [None]:
%%time

# TODO

In [None]:
clf.score(X,y)

## Use parallelism during hyperparameter search

Hyperparameters are parameters that are not directly learnt within estimators. In scikit-learn they are passed as arguments to the constructor of the estimator classes. There exist several hyperparameter optimizers:
 * `sklearn.model_selection.GridSearchCV`: Exhaustive search over specified parameter values for an estimator.
 * `sklearn.model_selection.HalvingGridSearchCV`: Search over specified parameter values with successive halving.
 * `sklearn.model_selection.ParameterGrid`: Grid of parameters with a discrete number of values for each.
 * `sklearn.model_selection.ParameterSampler`: Generator on parameters sampled from given distributions.
 * `sklearn.model_selection.RandomizedSearchCV`: Randomized search on hyper parameters.
 * `sklearn.model_selection.HalvingRandomSearchCV`: Randomized search on hyper parameters.

Try to perform a hyperparameter optimization with `GridSearchCV` and the following parameter grid:
```python
param_grid = { 
    'n_estimators': [50, 100],
    'max_depth' : [2,4,6],
}
```
Look as the parameters of `GridSearchCV` and try ti use more CPU.

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
%%time
estimator = RandomForestClassifier(random_state=0)

# TODO

In [None]:
grid_search.best_estimator_

In [None]:
grid_search.best_score_