# scikit-learn

[scikit-learn](http://scikit-learn.org/stable/index.html) is a free software machine learning library for the Python programming language. The [Getting Started guide](https://scikit-learn.org/stable/getting_started.html) illustrates some of the main features that `scikit-learn` provides.

`scikit-learn` provides many built-in machine learning algorithms and models, called estimators. Each estimator can be fitted to some data using its `fit` method.

Below is an example of fitting a Random Forest classifier to some data.

In [4]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state = 0)

# two samples and three features
X = [[1, 2, 3],
    [11, 12, 13]]

# sample classes
y = [0, 1]

clf.fit(X, y)

RandomForestClassifier(random_state=0)

The `fit` method generally accepts two inputs:

1. The samples matrix `X`, where the size of `X` is typically `(n_samples, n_features)`.
2. The target values `y` which are real numbers for regression or integers (or any other discrete set of values) for classification. `y` is usually a one-dimensional array where the `i`th entry corresponds to the target of the `i`th sample of `X`.

Both `X` and `y` are usually expected to be `NumPy` arrays or equivalent array-like data types, though some estimators work with other formats such as sparse matrices.

Once the estimator is fitted, it can be used for predicting target values of new data.

In [9]:
clf.predict(
    [[4, 5, 6], [14, 15, 16]]
)

array([0, 1])

## Transformers and pre-processors

Machine learning workflows typically consists of a pre-processing step that transforms or imputes data. In `scikit-learn`, pre-processors and transformers follow the same API as the estimator objects (they actually all inherit from the same `BaseEstimator` class). The transformer objects don't have a `predict` method but rather a `transform` method that outputs a newly transformed sample matrix.

In [10]:
from sklearn.preprocessing import StandardScaler
X = [
    [0, 15],
    [1, -10]
    ]

# scale data according to computed scaling values
StandardScaler().fit(X).transform(X)

array([[-1.,  1.],
       [ 1., -1.]])

## Chaining pre-processors and estimators

Transformers and estimators (predictors) can be combined together into a single unifying object called a `Pipeline`. The pipeline offers the same API as a regular estimator: it can be fitted and used for prediction using `fit` and `predict`.

In the following example, the iris dataset is split into training and testing sets, and the accuracy score is computed on the test data.

In [12]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# create a pipeline object
pipe = make_pipeline(
    StandardScaler(),
    LogisticRegression()
)

# load the iris dataset and split it into train and test sets
X, y = load_iris(return_X_y = True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)

# fit the whole pipeline
pipe.fit(X_train, y_train)

# we can now use it like any other estimator
accuracy_score(pipe.predict(X_test), y_test)

0.9736842105263158

## Model evaluation

`scikit-learn` provides many tools for model evaluation, in particular for cross-validation. The example below shows how to perform 5-fold cross-validation using the `cross_validate` helper. Note that it is also possible to manually iterate over the folds, select a different data splitting strategy, and use custom scoring functions.

In [13]:
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_validate

X, y = make_regression(n_samples=1000, random_state=0)
lr = LinearRegression()

# default is 5-fold CV
result = cross_validate(lr, X, y)
result['test_score']

array([1., 1., 1., 1., 1.])

## Automatic parameter searches

All estimators have parameters that can be tuned and the generalisation power of an estimator often critically depends on a few parameters. Quite often, it is not clear what the exact values of these parameters should be since they depend on the data. `scikit-learn` provides tools to automatically find the best parameter combinations (via cross-validation).

In the following example, we randomly search over the parameter space of a Random Forest with a `RandomizedSearchCV` object. A `RandomForestRegressor` has a `n_estimators` parameter that determines the number of trees in the forest and a `max_depth` parameter that determines the maximum depth of each tree. When the search is over, the `RandomizedSearchCV` behaves as a `RandomForestRegressor` that has been fitted with the best set of parameters.

In [15]:
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split
from scipy.stats import randint

X, y = fetch_california_housing(return_X_y = True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)

# define the parameter space that will be searched over
param_distributions = {
    'n_estimators': randint(1, 5),
    'max_depth': randint(5, 10)
}

# create a searchCV object and fit it to the data
search = RandomizedSearchCV(
    estimator = RandomForestRegressor(random_state = 0),
    n_iter = 5,
    param_distributions = param_distributions,
    random_state = 0
)

search.fit(X_train, y_train)

search.best_params_

{'max_depth': 9, 'n_estimators': 4}

The search object now acts like a normal random forest estimator with the optimal parameters

In [16]:
search.score(X_test, y_test)

0.735363411343253

## Next steps

See the [User Guide](https://scikit-learn.org/stable/user_guide.html#user-guide) for details on all available tools.