## Intro

+ Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. 
+ It also provides various tools for model fitting, data preprocessing, model selection, model evaluation, and many other utilities.


## Fitting and predicting: estimator basics

+ Scikit-learn provides dozens of built-in machine learning algorithms and models, called estimators. 
+ Each estimator can be fitted to some data using its fit method.
+ The fit method generally accepts 2 inputs:
    + The samples matrix (or design matrix) X. The size of X is typically (n_samples, n_features), which means that samples are represented as rows and features are represented as columns
    + The target values y which are real numbers for regression tasks, or integers for classification (or any other discrete set of values). 
    + Once the estimator is fitted, it can be used for predicting target values of new data. 


## Transformers and pre-processors

+  typical pipeline consists of a pre-processing step that transforms or imputes the data, and a final predictor that predicts target values
+ The transformer objects don’t have a predict method but rather a transform method that outputs a newly transformed sample matrix X

## Pipelines: chaining pre-processors and estimators

+ Transformers and estimators (predictors) can be combined together into a single unifying object: a Pipeline. The pipeline offers the same API as a regular estimator: it can be fitted and used for prediction with fit and predict

In [8]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline

In [9]:
# create a pipeline object
pipe = make_pipeline(StandardScaler(), LogisticRegression())


In [10]:
# load the iris dataset and split it into train and test sets
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [11]:
# fit the while pipeline
pipe.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('logisticregression', LogisticRegression())])
# use it like any other estimator
accuracy_score(pipe.predict(X_test), y_test)

0.9736842105263158

## Model evaluation
+ We here briefly show how to perform a 5-fold cross-validation procedure, using the cross_validate helper

In [12]:
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_validate

X, y = make_regression(n_samples=1000, random_state=0)
lr = LinearRegression()

result = cross_validate(lr, X, y) # defaults to a 5-fold CV
result['test_score'] # r_squared is higg, bc data is easy

array([1., 1., 1., 1., 1.])

## Automatic parameter searches

+ All estimators have parameters (often called hyper-parameters in the literature) that can be tuned
+ Scikit-learn provides tools to automatically find the best parameter combinations (via cross-validation)

In [13]:
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split
from scipy.stats import randint

In [14]:
X, y = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)


Downloading Cal. housing from https://ndownloader.figshare.com/files/5976036 to /Users/brendan/scikit_learn_data


In [25]:
# define the parameter space that will be searched over
param_distributions = {'n_estimators': randint(1,5),
                      'max_depth': randint(5,10)}

In [26]:
# now create a searhCV object and fit it to the data
search = RandomizedSearchCV(estimator=RandomForestRegressor(random_state=0),
                             n_iter=5,
                             param_distributions=param_distributions,
                             random_state=0)

In [28]:
search.fit(X_train, y_train)
RandomizedSearchCV(estimator=RandomForestRegressor(random_state=0), n_iter=5,
                   param_distributions={'max_depth': 9,
                                        'n_estimators': 4},
                   random_state=0)
search.best_params_

{'max_depth': 9, 'n_estimators': 4}

In [29]:
search.score(X_test, y_test)

0.735363411343253