# We will be looking at various Linear Models

## Linear Regression/Least Squares

In [3]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import mglearn
from sklearn.model_selection import train_test_split

In [4]:
from sklearn.linear_model import LinearRegression

X, y = mglearn.datasets.make_wave(n_samples=60)       # Dataset from mglearn module. It is one dimensional dataset

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)      # Splitting the data
lr = LinearRegression().fit(X_train, y_train)                                   # Building the model

In [5]:
# Let us now check the slope and intercept coefficients
print("lr.coef_: {}".format(lr.coef_))
print("lr.intercept_: {}".format(lr.intercept_))

lr.coef_: [0.39390555]
lr.intercept_: -0.031804343026759746


For, higher dimensional data, the coefficient is an NumPy array with the same dimensionality as the feature vector.
The intercept is always a real no.

In [6]:
print("Training set score: {:.2f}".format(lr.score(X_train, y_train)))
print("Test set score: {:.2f}".format(lr.score(X_test, y_test)))

Training set score: 0.67
Test set score: 0.66


As, we can see the low R^2 value for the training set suggest some underfitting. So, the linear model might be too simple and suggests that the linear model not perform well for low dimensional data. We should consider other models for low dimensional data.

#### Let us see how the model performs on higher dimensional data.

Let’s take a look at how LinearRegression performs on a more complex dataset, like the Boston Housing dataset.
Remember that this dataset has 506 samples and 105 derived features

In [11]:
X, y = mglearn.datasets.load_extended_boston()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

lr = LinearRegression().fit(X_train, y_train)

In [12]:
print("Training set score: {:.2f}".format(lr.score(X_train, y_train)))
print("Test set score: {:.2f}".format(lr.score(X_test, y_test)))

Training set score: 0.95
Test set score: 0.61


This suggests that the linear model is overfitting for higher dimensional data. 
So we should now consider a model that allows us to manage complexity.

## Ridge Regression

##### Ridge Regression uses L2 regularization. So it uses the L2 norm of w as a penalty term; leading to a small magnitude w.

##### alpha * ||w||^2 + (ŷ - y)^2                       where ŷ is the prediction and y is the actual value

We will continue using the same Boston Housing Data

In [13]:
from sklearn.linear_model import Ridge

ridge = Ridge().fit(X_train, y_train)
print("Training set score: {:.2f}".format(ridge.score(X_train, y_train)))
print("Test set score: {:.2f}".format(ridge.score(X_test, y_test)))

Training set score: 0.89
Test set score: 0.75


The training set score is lower suggesting that the model is not overfitting as compared to Linear Reggression. 
Also, the test set score has increased suggesting that the model is generalizing well.

###### We can control how much the model regularizes by controlling the size of the coefficeint, alpha, of the L2 penalty term

Let us try alpha = 10 and alpha = 0.1

In [14]:
ridge10 = Ridge(alpha=10).fit(X_train, y_train)
print("Training set score: {:.2f}".format(ridge10.score(X_train, y_train)))
print("Test set score: {:.2f}".format(ridge10.score(X_test, y_test)))

Training set score: 0.79
Test set score: 0.64


In [15]:
ridge01 = Ridge(alpha=0.1).fit(X_train, y_train)
print("Training set score: {:.2f}".format(ridge01.score(X_train, y_train)))
print("Test set score: {:.2f}".format(ridge01.score(X_test, y_test)))

Training set score: 0.93
Test set score: 0.77


We can see that alpha = 0.1 is good for this dataset.
The size of the "good" alpha differs from each dataset.
The higher the alpha, the more the regularization effect and smaller the magnitude of the w parameter.
The default alpha is 1.0

## LASSO

##### LASSO uses the L1 regularization. So it uses the L1 norm of w as a penalty term; leading to sparse w.

##### alpha * ||w||^1 + (ŷ - y)^2                       where ŷ is the prediction and y is the actual value

We will continue using the same Boston Housing dataset with 105 features.

In [16]:
from sklearn.linear_model import Lasso
lasso = Lasso().fit(X_train, y_train)

print("Training set score: {:.2f}".format(lasso.score(X_train, y_train)))
print("Test set score: {:.2f}".format(lasso.score(X_test, y_test)))
print("Number of features used: {}".format(np.sum(lasso.coef_ != 0)))

Training set score: 0.29
Test set score: 0.21
Number of features used: 4


Very bad perfomance indicating the model is too simple due to the underfitting. We can see only 4/105 features are used

###### We can reduce the coefficient, alpha, of the L1 norm of the penalty term w in order to reduce its effect and also reduce underfitting.

When we reduce the alpha, we MUST increase max_iter, which is the maximum number of iterations per run.

In [17]:
# we increase the default setting of "max_iter",
# otherwise the model would warn us that we should increase max_iter.
lasso001 = Lasso(alpha=0.01, max_iter=100000).fit(X_train, y_train)

print("Training set score: {:.2f}".format(lasso001.score(X_train, y_train)))
print("Test set score: {:.2f}".format(lasso001.score(X_test, y_test)))
print("Number of features used: {}".format(np.sum(lasso001.coef_ != 0)))

Training set score: 0.90
Test set score: 0.77
Number of features used: 33


There is now a significant improvement in perfomance and almost comparable to Ridge regression(alpha = 0.1).

In [18]:
lasso00001 = Lasso(alpha=0.0001, max_iter=100000).fit(X_train, y_train)

print("Training set score: {:.2f}".format(lasso00001.score(X_train, y_train)))
print("Test set score: {:.2f}".format(lasso00001.score(X_test, y_test)))
print("Number of features used: {}".format(np.sum(lasso00001.coef_ != 0)))

Training set score: 0.95
Test set score: 0.64
Number of features used: 96


With an even lower alpha, the model almost becomes like Linear Regression as the penalty term is very low. There is overfitting.

##### In practice, L2 is usually better for regularization. However, if you have a large number of features and only a few are important, you should then consider using Lasso.