# Linear regression with sklearn API

Setup:

1. Dataset: California housing
2. Linear regression API: `LinearRegression`
3. Training: `fit`(normal equation) and `cross_validate`(normal equation with cross validation).
4. Evaluation: `score`($R^2$ Score) and `cross_val_score` with different scoring parameters.

We will study the model diagnosis with `LearningCurve` and learn how to examine the learned model or weight vector.

In [9]:
#Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression

from sklearn.model_selection import cross_validate
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import train_test_split

from sklearn.metrics import mean_squared_error

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler


In [3]:
np.random.seed(306)
plt.style.use('seaborn')

We will use `ShuffleSplit` cross validation with:
* 10 folds (`n_splits`) and
* Set aside 20% examples as test examples(`test_size`) in each fold.

In [5]:
shuffle_split_cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=0)

## **Step 1**: Load the dataset

In [7]:
features, labels = fetch_california_housing(as_frame=True, return_X_y=True)

print("shape of feature matrix: ", features.shape)
print("shape of label matrix: ", labels.shape)

shape of feature matrix:  (20640, 8)
shape of label matrix:  (20640,)


sanity check

In [8]:
assert(features.shape[0] == labels.shape[0])

## **Step 2**: Data exploration

## **Step 3**: Preprocessing and model building

### 3.1 Train test split

In [11]:
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, random_state=42)

print("# training samples: ", train_features.shape[0])
print("# training samples: ", test_features.shape[0])

# training samples:  15480
# training samples:  5160


Sanity checks

In [12]:
assert (train_features.shape[0] == train_labels.shape[0])
assert (test_features.shape[0] == test_labels.shape[0])

### 3.2 Pipeline: Preprocessing + Model

`Pipeline` object we are going to use have two components:
1. `StandardScaler`
2. `LinearRegression`

In [15]:
lin_reg_pipeline = Pipeline([("feature_scaling", StandardScaler( )),
                            ("lin_reg", LinearRegression())])

lin_reg_pipeline.fit(train_features, train_labels)

Now that we have trained the model, let's check the learned/estimated weight vectors

In [17]:
print("Intercept (w_0):", lin_reg_pipeline[-1].intercept_) # intercept through intercept_
print("Weight vector (w_1,....., w_m):", lin_reg_pipeline[-1].coef_) # rest of the weights through weight_

Intercept (w_0): 2.0703489205426377
Weight vector (w_1,....., w_m): [ 0.85210815  0.12065533 -0.30210555  0.34860575 -0.00164465 -0.04116356
 -0.89314697 -0.86784046]


## **Step 4**: Model Evaluation

### `score`

* $R^2$ score 

In [18]:
#Evaluate model performance in the test set.add
test_score = lin_reg_pipeline.score(test_features, test_labels)
print("Model performance on test set: ", test_score)

train_score = lin_reg_pipeline.score(train_features, train_labels)
print("Model performance on train set: ", train_score)

Model performance on test set:  0.5910509795491352
Model performance on train set:  0.609873031052925


`r2` score is not high enough $=>$ underfitting

### Cross validated score(`cross_val_score`)

* Calculates `r2` on different folds through cross validation

In [20]:
lin_reg_score = cross_val_score(lin_reg_pipeline,
                                train_features,
                                train_labels,
                                scoring = 'neg_mean_squared_error',
                                cv = shuffle_split_cv)

#This will print 10 different scores, one for each score
print(lin_reg_score)

# We can take the mean and standard deviation of the score and report it.
print(f"\nScore of linear regression model on the test set: \n"
        f"{lin_reg_score.mean():.3f} +/- {lin_reg_score.std():.3f}")

[-0.50009976 -0.52183352 -0.55931218 -0.52110499 -0.56059203 -0.50510767
 -0.52386194 -0.54775518 -0.5007161  -0.54713448]

Score of linear regression model on the test set: 
-0.529 +/- 0.022


Other available 'scoring` parameters
* `explained_variance`
* `max_error`
* `neg_mean_absolute_error`
* `neg_root_mean_squared_error`
* `neg_mean_squared_log_error`
* `neg_median_absolute_error`
* `neg_mean_absolute_percentaage_error`
* `r2`

### Cross Validation

* `cross_validate` gives access to models trained in each fold along with some other statistics

In [21]:
lin_reg_cv_results = cross_validate(lin_reg_pipeline,
                                    train_features,
                                    train_labels,
                                    cv = shuffle_split_cv,
                                    scoring="neg_mean_squared_error",
                                    return_train_score=True,
                                    return_estimator=True)

In [22]:
lin_reg_cv_results

{'fit_time': array([0.01699948, 0.00999665, 0.00800085, 0.00799918, 0.00933242,
        0.00699949, 0.00798869, 0.00769305, 0.00799775, 0.00883651]),
 'score_time': array([0.00199914, 0.00199866, 0.00200057, 0.00100088, 0.00300717,
        0.00199842, 0.00200725, 0.00143361, 0.00101113, 0.00200868]),
 'estimator': [Pipeline(steps=[('feature_scaling', StandardScaler()),
                  ('lin_reg', LinearRegression())]),
  Pipeline(steps=[('feature_scaling', StandardScaler()),
                  ('lin_reg', LinearRegression())]),
  Pipeline(steps=[('feature_scaling', StandardScaler()),
                  ('lin_reg', LinearRegression())]),
  Pipeline(steps=[('feature_scaling', StandardScaler()),
                  ('lin_reg', LinearRegression())]),
  Pipeline(steps=[('feature_scaling', StandardScaler()),
                  ('lin_reg', LinearRegression())]),
  Pipeline(steps=[('feature_scaling', StandardScaler()),
                  ('lin_reg', LinearRegression())]),
  Pipeline(steps=[('featu