<a href="https://colab.research.google.com/github/franciscosalido/AIML/blob/master/Ridge_and_Lasso_Regularization_Methods_in_Linear_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Regularization Method (Shrinkage Method)**

A lot of variables and too much parameters in a rich dimension dataset exposes us to the **Curse of Dimensionality** where the number of observations is not sufficient to cover all possibilities of combinations between the independent variables.

We can resort to dimensionality reduction techniques such as transforming to PCA and eliminating the PCA with least magnitude of eigenvalues. This cam be a laborious process before we find the right number of important components.

Instead we can emplay the Regularising / shrinkage Methods.

Regularisation helps us to deal with the problem of overfitting by reducing the weight given to a particular feature *x*. This allows us to retain more features while not giving undue weight to one in particular.

 Regularisation is mediated by a parameter λ, as can be seen in the cost function:


$$J(\theta)=\frac{1}{2m}\Big(\sum^{m}_{i=1}(h_{\theta}(x^{(i)}-y^{(i)})^2\Big)+\frac{\lambda}{2m}\Big(\sum^{n}_{j=1}\theta^2_j\Big)$$

The first term is essentially the mean-squared-error term, whilst the additive term multiplies the sum of the square of the parameters (θ) by λ over 2m, where m is the number of training examples. 

 **The Penalty or Lambda Factor (λ) is the tuning parameter that decides how much we want to penalize the flexibility of our model.**

Since the objective is to minimise J(θ) (minθJ(θ)) using a large λ will require small values of θj in order to achive a minimum value.

The increase in flexibility of a model is represented by increase in its coefficients, and if we want to minimize the above function, then these coefficients need to be small. This is how the Ridge regression technique prevents coefficients from rising too high. Also, notice that we shrink the estimated association of each variable with the response, except the intercept β0, This intercept is a measure of the mean value of the response when xi1 = xi2 = …= xip = 0.

## Regression:

This is a form of regression, that constrains/ regularizes or shrinks the coefficient estimates towards zero. In other words, **this technique discourages learning a more complex or flexible model, so as to avoid the risk of overfitting.**

A simple relation for linear regression looks like this. Here Y represents the learned relation and β represents the coefficient estimates for different variables or predictors(X).

$$Y ≈ β0 + β1X1 + β2X2 + …+ βpXp$$

The fitting procedure involves a loss function, known as residual sum of squares or RSS. The coefficients are chosen, such that they minimize this loss function.

![alt text](https://miro.medium.com/max/908/1*DY3-IaGcHjjLg7oYXx1O3A.png)

Now, this will adjust the coefficients based on your training data. If there is noise in the training data, then the estimated coefficients won’t generalize well to the future data. This is where regularization comes in and shrinks or regularizes these learned estimates towards zero.


### 1. Ridge Regression(Tikhonov Regularization) 

is similar to the linear regression where the objective is to find the best fit surface. The difference is in the way the best coefficients are found. Unlike linear regression where the optimization fuction is SSE, in the Ridge Regression it is slighty different.

![alt text](https://miro.medium.com/max/1106/1*CiqZ8lhwxi5c4d1nV24w4g.png)

Above image shows ridge regression, where the RSS is modified by adding the shrinkage quantity. Now, the coefficients are estimated by minimizing this function. Here, λ is the tuning parameter that decides how much we want to penalize the flexibility of our model. The increase in flexibility of a model is represented by increase in its coefficients, and if we want to minimize the above function, then these coefficients need to be small. This is how the Ridge regression technique prevents coefficients from rising too high. Also, notice that we shrink the estimated association of each variable with the response, except the intercept β0, This intercept is a measure of the mean value of the response when xi1 = xi2 = …= xip = 0.
When λ = 0, the penalty term has no eﬀect, and the estimates produced by ridge regression will be equal to least squares. However, as λ→∞, the impact of the shrinkage penalty grows, and the ridge regression coeﬃcient estimates will approach zero. As can be seen, selecting a good value of λ is critical. Cross validation comes in handy for this purpose. The coefficient estimates produced by this method are also known as the L2 norm.
The coefficients that are produced by the standard least squares method are scale equivariant, i.e. if we multiply each input by c then the corresponding coefficients are scaled by a factor of 1/c. Therefore, regardless of how the predictor is scaled, the multiplication of predictor and coefficient(Xjβj) remains the same. However, this is not the case with ridge regression, and therefore, we need to standardize the predictors or bring the predictors to the same scale before performing ridge regression. The formula used to do this is given below.

![alt text](https://miro.medium.com/max/562/1*6KRAdbf-CApFPR7gASZaSA.png)


### Lasso Regression 

![alt text](https://miro.medium.com/max/1094/1*tHJ4sSPYV0bDr8xxEdiwXA.png)

Lasso is another variation, in which the above function is minimized. Its clear that this variation differs from ridge regression only in penalizing the high coefficients. It uses |βj|(modulus)instead of squares of β, as its penalty. In statistics, this is known as the L1 norm.
Lets take a look at above methods with a different perspective. The ridge regression can be thought of as solving an equation, where summation of squares of coefficients is less than or equal to s. And the Lasso can be thought of as an equation where summation of modulus of coefficients is less than or equal to s. Here, s is a constant that exists for each value of shrinkage factor λ. These equations are also referred to as constraint functions.
Consider their are 2 parameters in a given problem. Then according to above formulation, the ridge regression is expressed by β1² + β2² ≤ s. This implies that ridge regression coefficients have the smallest RSS(loss function) for all points that lie within the circle given by β1² + β2² ≤ s.
Similarly, for lasso, the equation becomes,|β1|+|β2|≤ s. This implies that lasso coefficients have the smallest RSS(loss function) for all points that lie within the diamond given by |β1|+|β2|≤ s.

## Import Libraries

In [4]:
import io
import numpy as np # Import Numerical Linear Algebra library
import pandas as pd # Import Data Procession library
import matplotlib.pyplot as plt # Import the Pyplot instance from Matplotlib library
import seaborn as sns # Import a Statistical Data Visualization library

# Set matplotlib to automatic print any plot
%matplotlib inline 

  import pandas.util.testing as tm


In [0]:
# Import some Preprocessing Libraries
from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split

# Import Linear Regression Machine Learning Libray
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso


## Mount Google Drive and Load Data


In [8]:
# Mount the google drive and set the path to the datasets
from google.colab import drive
drive.mount(r'/content/drive', force_remount=True)

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
root_path = r'/content/drive/My Drive/AIML/'  #set a dir to the project folder
!ls /content/drive/My\ Drive/AIML/. #change dir to the project folder

In [0]:
mpg_df = pd.read_csv(r'/content/drive/My Drive/AIML/car-mpg.csv')
mpg_df = mpg_df.drop('car_name', axis= 1)
mpg_df['origin'] = mpg_df['origin'].replace({1: 'america', 2: 'europe', 3: 'asia'})
mpg_df = pd.get_dummies(mpg_df, columns= ['origin'])
mpg_df = mpg_df.replace('?', np.nan)
mpg_df = mpg_df.apply(lambda x: x.fillna(x.median()), axis= 0)

In [34]:
mpg_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   mpg             398 non-null    float64
 1   cyl             398 non-null    int64  
 2   disp            398 non-null    float64
 3   hp              398 non-null    object 
 4   wt              398 non-null    int64  
 5   acc             398 non-null    float64
 6   yr              398 non-null    int64  
 7   car_type        398 non-null    int64  
 8   origin_america  398 non-null    uint8  
 9   origin_asia     398 non-null    uint8  
 10  origin_europe   398 non-null    uint8  
dtypes: float64(3), int64(4), object(1), uint8(3)
memory usage: 26.2+ KB


### Separate Independent and Dependent Variables

In [0]:
# Copy all the predictor variables into 'df' dataframe. Since 'mpg' is the dependent variable we will drop it
df = mpg_df.drop('mpg', axis= 1)
y = mpg_df['mpg']

#scale all the columns of the mpg_df. This will produce a numpy array
df_scaled = scale(df)
df_scaled = pd.DataFrame(df_scaled, columns= df.columns) # ideally the training and test should be

In [0]:
X_train, X_test, y_train, y_test = train_test_split(df_scaled, y, test_size= 0.30, random_state= 1)

### Fit a Simple Linear Model

In [43]:
regression_model = LinearRegression()
regression_model.fit(X_train, y_train)

for idx, col_name in enumerate(X_train.columns):
  print('The coefficient for {} is {}'.format(col_name, regression_model.coef_[idx]))

The coefficient for cyl is 2.5059518049385026
The coefficient for disp is 2.5357082860560514
The coefficient for hp is -1.7889335736325294
The coefficient for wt is -5.551819873098727
The coefficient for acc is 0.11485734803440747
The coefficient for yr is 2.9318465482116087
The coefficient for car_type is 2.977869737601945
The coefficient for origin_america is -0.583295529016598
The coefficient for origin_asia is 0.34749313804322646
The coefficient for origin_europe is 0.3774164680868858


### Create a Regularized RIDGE Model

In [60]:
ridge = Ridge(alpha= 0.3) # alpha = lambda or penalty factor
ridge.fit(X_train, y_train)
print('Ridge Model Coefficients are {}'.format(ridge.coef_))

Ridge Model Coefficients are [ 2.47057467  2.44494419 -1.78573889 -5.47285499  0.10115618  2.92319984
  2.94492098 -0.57949986  0.34667456  0.37344909]


### Create a regularized LASSO Model

In [61]:
lasso = Lasso(alpha= 0.1) # alpha = lambda or penalty factor
lasso.fit(X_train, y_train)
print('Lasso Model Coefficients are {}'.format(lasso.coef_))

Lasso Model Coefficients are [ 1.10693517  0.         -0.71587138 -4.2127655  -0.          2.73245903
  1.66333749 -0.63587683  0.          0.        ]


##### Observe that many of the coefficients have become 0 indicating drop of those dimensions from the model

### Score Comparison

In [62]:
print('Linear Regression Model Train Data Score {}'.format (regression_model.score(X_train, y_train)))
print('Linear Regression Model Test Data Score {}'.format (regression_model.score(X_test, y_test)))

Linear Regression Model Train Data Score 0.8343770256960538
Linear Regression Model Test Data Score 0.8513421387780067


In [63]:
print('Ridge Regulated Regression Model Train Data Score {}'.format (ridge.score(X_train, y_train)))
print('Ridge Regulated Regression Model Test Data Score {}'.format (ridge.score(X_test, y_test)))

Ridge Regulated Regression Model Train Data Score 0.8343617931312616
Ridge Regulated Regression Model Test Data Score 0.8518882171608504


In [64]:
print('Lasso Regulated Regression Model Train Data Score {}'.format (lasso.score(X_train, y_train)))
print('Lasso Regulated Regression Model Test Data Score {}'.format (lasso.score(X_test, y_test)))

Lasso Regulated Regression Model Train Data Score 0.8211445134781438
Lasso Regulated Regression Model Test Data Score 0.8577234201035426


## Polynomial Models

Generate **Polynomial MOdels** reflecting the non-linear interectio between some dimensions

In [71]:
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree= 2, interaction_only= True)

df_poly = poly.fit_transform(df_scaled)
X_train, X_test, y_train, y_test = train_test_split(df_poly, y, test_size= 0.30, random_state= 42)
X_train.shape,  X_test.shape, y_train.shape, y_test.shape 

((278, 56), (120, 56), (278,), (120,))

Simple Non Regularized Linear Model on Polynomial Model

In [73]:
regression_model.fit(X_train, y_train)
print(regression_model.coef_)

[-5.23950483e-14 -3.94993722e+11 -8.20434379e+00 -4.69185280e-01
 -2.32071184e+00 -1.28082586e+00  3.07808348e+00 -5.25926031e+11
 -8.62777747e+10  1.21097021e+12 -1.09325969e+12 -7.37633999e+00
  3.50008965e-01  2.47619630e+00  3.83912468e+00 -1.46136581e+00
 -1.26549529e+12 -2.23653548e+12 -8.31918613e+11 -7.94068368e+11
 -2.92164023e-01  3.44442803e+00 -2.58710596e+00  4.32897432e+00
 -9.11761477e+00 -3.35471076e+11 -2.76484737e+11 -2.63905363e+11
 -2.67759610e+00 -8.71451317e-01 -1.33400916e+00 -2.24010192e+00
  9.80660240e+10  8.08229405e+10  7.71456960e+10  4.30419922e-01
 -1.25904846e+00  3.12756348e+00 -3.18139406e+11 -2.62200518e+11
 -2.50271041e+11  7.22656250e-01  1.18246460e+00 -1.14740573e+11
 -9.45655810e+10 -9.02630802e+10  6.16333008e-01  1.38717803e+11
  1.14326861e+11  1.09125271e+11 -1.23128664e+11  4.92754853e+11
  4.70335722e+11 -3.52819618e+11 -2.12206643e+12  1.03661363e+12]


In [76]:
print('Linear Regression Model Train Data Score {}'.format (regression_model.score(X_train, y_train)))
print('Linear Regression Model Test Data Score {}'.format (regression_model.score(X_test, y_test)))

Linear Regression Model Train Data Score 0.9056705596501557
Linear Regression Model Test Data Score 0.8603406378551786


**Curse of Dimensionality** The number of observations is very small compared to the possible permutation combination in 56 dimensions. This is a **OVERFITED MODEL** with many sharps peaks and valleys as **High Variance** of the magnitude of this coefficients.

In [74]:
ridge = Ridge(alpha= 0.3)
ridge.fit(X_train, y_train)
print('Ridge Model Coefficients are {}'.format(ridge.coef_))

Ridge Model Coefficients are [ 0.          4.1730544  -4.92375054 -0.65366193 -3.53027896 -0.88649648
  2.99041772  0.95685766 -0.25333115  0.17254399  0.14126084 -3.9927363
 -0.09261544  1.26809072  3.26426667 -0.97210318  1.12193971 -2.31727258
  6.18727189 -3.53652651 -0.40360966  2.43100791 -1.58121257  3.21235912
 -5.58873458  1.60249757 -3.62902077  1.76494052 -1.66615775 -0.7325558
 -1.32513928 -1.80418966 -0.95746559  1.07989925  0.08573661  0.17096678
 -0.83126145  1.01410662  0.13992751  0.55430355 -0.7585981   0.70834673
  1.29761506 -0.4378339  -0.08808269  0.64884671  0.58921745 -0.3734776
  0.30467064  0.15556392 -0.87866259  3.44156205 -2.48867051 -0.09698357
 -0.06558065 -0.15000736]


In [77]:
print('Ridge Regulated Regression Model Train Data Score {}'.format (ridge.score(X_train, y_train)))
print('Ridge Regulated Regression Model Test Data Score {}'.format (ridge.score(X_test, y_test)))

Ridge Regulated Regression Model Train Data Score 0.9034731201596709
Ridge Regulated Regression Model Test Data Score 0.8882737907483466


In [75]:
lasso = Lasso(alpha= 0.1) # alpha = lambda or penalty factor
lasso.fit(X_train, y_train)
print('Lasso Model Coefficients are {}'.format(lasso.coef_))

Lasso Model Coefficients are [ 0.         -0.         -0.54242836 -1.29616132 -4.76592876  0.
  2.8145911   0.         -0.45337005  0.          0.         -0.
 -0.          0.          0.18889447 -0.          0.         -0.
  0.03814912 -0.         -0.          1.15059122  0.         -0.
 -0.         -0.          0.         -0.          0.         -0.
 -0.36807608 -0.          0.          0.         -0.         -0.
 -0.3070781  -0.28268409  0.         -0.         -0.          0.3007103
  0.         -0.7383446   0.          0.14313782  0.         -0.20919617
  0.          0.          0.         -0.          0.         -0.
 -0.         -0.        ]


In [78]:
print('Lasso Regulated Regression Model Train Data Score {}'.format (lasso.score(X_train, y_train)))
print('Lasso Regulated Regression Model Test Data Score {}'.format (lasso.score(X_test, y_test)))

Lasso Regulated Regression Model Train Data Score 0.8732985269048128
Lasso Regulated Regression Model Test Data Score 0.9162242105645322
