## Introduction to sklearn
### University of Virginia
### Programming for Data Science
### Last Updated: November 8, 2021
---  

### PREREQUISITES
- variables
- data types
- pandas

### SOURCES 
- [sklearn Introduction](https://scikit-learn.org/stable/index.html)
- [R-Squared in sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html)
- [sklearn Pipelines](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)

### OBJECTIVES
- Introduce some basic functionality of the `sklearn` package
- Illustrate how to fit a regression model with `sklearn`
- Illustrate how to prepare data and fit a regression model with `sklearn` using Pipelines

### CONCEPTS
- `sklearn` interface
- Pipeline

---

## I. Introduction to sklearn 

The `scikit-learn` package - shortened to `sklearn` - provides nice functionality for machine learning.  
It also includes a nice Pipeline object for facilitated construction and persistance of models.  
This notebook will provide a very brief demo of `sklearn`. You are encouraged to explore and go further.

## II. Preprocess Data and Fit a Regression Model

import sklearn dataset and functionality

In [90]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score # R-squared
from sklearn.pipeline import Pipeline

import pandas as pd

Fetch dataset

In [91]:
housing = fetch_california_housing()

In [92]:
housing

{'data': array([[   8.3252    ,   41.        ,    6.98412698, ...,    2.55555556,
           37.88      , -122.23      ],
        [   8.3014    ,   21.        ,    6.23813708, ...,    2.10984183,
           37.86      , -122.22      ],
        [   7.2574    ,   52.        ,    8.28813559, ...,    2.80225989,
           37.85      , -122.24      ],
        ...,
        [   1.7       ,   17.        ,    5.20554273, ...,    2.3256351 ,
           39.43      , -121.22      ],
        [   1.8672    ,   18.        ,    5.32951289, ...,    2.12320917,
           39.43      , -121.32      ],
        [   2.3886    ,   16.        ,    5.25471698, ...,    2.61698113,
           39.37      , -121.24      ]]),
 'target': array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894]),
 'feature_names': ['MedInc',
  'HouseAge',
  'AveRooms',
  'AveBedrms',
  'Population',
  'AveOccup',
  'Latitude',
  'Longitude'],
 'DESCR': '.. _california_housing_dataset:\n\nCalifornia Housing dataset\n--------------------

Inspect what's in the object

In [93]:
housing.keys()

dict_keys(['data', 'target', 'feature_names', 'DESCR'])

Look at the target variable:

In [94]:
housing.target

array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894])

In [95]:
len(housing.target)

20640

*housing.data* contains the data

In [96]:
housing.data

array([[   8.3252    ,   41.        ,    6.98412698, ...,    2.55555556,
          37.88      , -122.23      ],
       [   8.3014    ,   21.        ,    6.23813708, ...,    2.10984183,
          37.86      , -122.22      ],
       [   7.2574    ,   52.        ,    8.28813559, ...,    2.80225989,
          37.85      , -122.24      ],
       ...,
       [   1.7       ,   17.        ,    5.20554273, ...,    2.3256351 ,
          39.43      , -121.22      ],
       [   1.8672    ,   18.        ,    5.32951289, ...,    2.12320917,
          39.43      , -121.32      ],
       [   2.3886    ,   16.        ,    5.25471698, ...,    2.61698113,
          39.37      , -121.24      ]])

In [97]:
housing.data.shape

(20640, 8)

Split the data into 60% train, 40% test sets. The helper function `train_test_split` is helpful for this:

In [98]:
x_train, x_test, y_train, y_test = train_test_split(housing.data, housing.target, train_size = 0.6, random_state=314)

Show the dataset sizes

In [99]:
len(x_train), len(x_test), len(y_train), len(y_test)

(12384, 8256, 12384, 8256)

We will do the following:
- scale the x_train and x_test data using `StandardScaler`
- train a regression model on (x_train, y_train)
- predict on x_test
- evaluate performance against y_test

---

### Instantiate StandardScaler object and fit it

In [100]:
scaler = StandardScaler()
x_train_s = scaler.fit_transform(x_train)

Look at the scaled training data. Computing column means and standard deviations, we see the data is now centered and scaled.

In [101]:
x_train_s

array([[-5.41768936e-01,  8.97274139e-01, -1.85640214e-02, ...,
        -4.28501066e-02, -2.20840635e-01,  5.77185876e-02],
       [-4.42312682e-01,  1.85379671e+00, -1.21232764e-01, ...,
         1.15088564e-02,  1.75535469e+00, -9.08412716e-01],
       [-1.21121241e+00, -9.36060779e-01, -7.45820655e-01, ...,
         1.34336061e-01,  7.71939953e-01, -4.52879304e-01],
       ...,
       [ 2.88703976e-01, -1.09548121e+00,  2.45801399e-01, ...,
         6.25970319e-02, -7.68743129e-01,  1.03886748e+00],
       [-4.89745665e-01,  2.04617870e-02, -3.46951864e-01, ...,
        -8.33657386e-02,  4.53500897e-01, -1.16371166e+00],
       [ 1.44241788e-01,  2.59592428e-01,  8.92070260e-04, ...,
        -6.70406776e-02, -1.40562124e+00,  1.25411953e+00]])

In [68]:
x_train_s.mean(axis=0)

array([-4.39298549e-15, -1.02774522e-16, -1.06658763e-14, -5.55920742e-15,
       -1.52942545e-17, -7.05635754e-16,  9.44804814e-15, -8.41492753e-14])

In [27]:
x_train_s.std(axis=0)

array([1., 1., 1., 1., 1., 1., 1., 1.])

### Instantiate Logistic Regression model and fit it

The order of the passed data matters: (x, y)

In [102]:
reg = LinearRegression().fit(x_train_s, y_train)

show the predictor coefficients

In [103]:
reg.coef_

array([ 0.81403101,  0.11883616, -0.260123  ,  0.31025271, -0.00178077,
       -0.04600269, -0.91689468, -0.88930004])

show the intercept

In [104]:
reg.intercept_

2.060088804909497

Predictions on training set (use *x_train_s* to predict *y_train*). These are optimistic.

In [105]:
y_train_pred = reg.predict(x_train_s)

Measure R-squared on train set. Note the order is y_true, y_predicted

In [70]:
print(r2_score(y_train, y_train_pred))

0.6039404342150474


---

#### TRY FOR YOURSELF
Do the following:

- Import your own dataset
- Use `sklearn` to split the data, scale it, fit a model, and compute a measure of model fit.

---

## III. Set up a Pipeline to Prep Data and Fit a Regression Model

Using a pipeline in machine learning gives many strong benefits:

- it keeps all of the steps together (data processing and the algorithm). as complexity increases, the benefit really kicks in.
- the pipeline can be tuned during fitting (e.g., k-fold cross validation can be included, and the best model returned)
- the model can be saved, loaded, and reused
- the model can be applied to new data for scoring (inference)

Operationally, the pipeline object takes a list of tuples, where each tuple consists of two things: 1) the step's name 2) the transformer/algorithm.

The step's name is arbitrarily set by the user.

The example below replicates the work from section II. It is expected to give identical results.

Note that a model can include many preprocessing steps (even custom Transformer steps), and the pipeline keeps things very tidy and manageable.

---

Define the pipeline:

In [106]:
pipe = Pipeline([
                 ('scaler', StandardScaler()), 
                 ('reg', LinearRegression())
                ])

Fit the model to the training data:

In [107]:
pipe.fit(x_train, y_train)

Pipeline(memory=None,
         steps=[('scaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('reg',
                 LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
                                  normalize=False))],
         verbose=False)

Predict based on training data:

In [108]:
pipe.predict(x_train)

array([1.8550803 , 1.04750294, 0.81125812, ..., 1.82417837, 2.37188604,
       2.37648592])

Let's compare this to the predictions from the original regression model without the pipeline. They should match, and they do.

In [109]:
reg.predict(x_train_s)

array([1.8550803 , 1.04750294, 0.81125812, ..., 1.82417837, 2.37188604,
       2.37648592])

Measure the score, which is the R-squared. Again, this matches the original regression model.

In [110]:
pipe.score(x_train, y_train)

0.6039404342150474

---  

#### TRY FOR YOURSELF
Do the following:

- Using the dataset you imported earlier, build a pipeline to scale the data, fit a model, and compute a measure of model fit.

---