# Ridge Regression

Ridge regression or L2 Regularization analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates are unbiased, but their variances are large so they may be far from the true value.It avoids the problem of Overfitting.

In [82]:
# importing the modules.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [65]:
#reading the csv data.

# Here: Unemployment Data is taken for data analysis.

f = pd.read_csv('unemployment.csv')

In [66]:
f.info()

# Every column has non-null data.
# 478 Rows, 7 Columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 478 entries, 0 to 477
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  478 non-null    int64  
 1   date        478 non-null    object 
 2   pce         478 non-null    float64
 3   pop         478 non-null    int64  
 4   psavert     478 non-null    float64
 5   uempmed     478 non-null    float64
 6   unemploy    478 non-null    int64  
dtypes: float64(3), int64(3), object(1)
memory usage: 26.3+ KB


In [67]:
f.describe()

# mean - 6997.177824 (unemployment)

Unnamed: 0.1,Unnamed: 0,pce,pop,psavert,uempmed,unemploy
count,478.0,478.0,478.0,478.0,478.0,478.0
mean,239.5,3654.230962,246348.939331,6.72113,7.124059,6997.177824
std,138.130976,2609.656755,30126.735749,3.476889,1.640329,1859.035642
min,1.0,507.8,198712.0,-3.0,4.0,2685.0
25%,120.25,1272.45,220094.25,4.0,5.8,6052.5
50%,239.5,3082.45,242515.5,7.6,6.9,7187.5
75%,358.75,5474.15,272277.25,9.5,8.375,8250.25
max,478.0,9705.0,301913.0,14.6,12.3,12051.0


In [68]:
# Feature and Target Selection

X = f.iloc[:,:6]
y = f['unemploy']

## Data Cleaning

In [69]:
# Change the index value to Unnamed:0

X

Unnamed: 0.1,Unnamed: 0,date,pce,pop,psavert,uempmed
0,1,1967-06-30,507.8,198712,9.8,4.5
1,2,1967-07-31,510.9,198911,9.8,4.7
2,3,1967-08-31,516.7,199113,9.0,4.6
3,4,1967-09-30,513.3,199311,9.8,4.9
4,5,1967-10-31,518.5,199498,9.7,4.7
...,...,...,...,...,...,...
473,474,2006-11-30,9478.5,301070,-1.1,7.3
474,475,2006-12-31,9540.3,301296,-0.9,8.1
475,476,2007-01-31,9610.6,301481,-1.0,8.1
476,477,2007-02-28,9653.0,301684,-0.7,8.5


In [70]:
X = X.set_index('Unnamed: 0')
X

Unnamed: 0_level_0,date,pce,pop,psavert,uempmed
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,1967-06-30,507.8,198712,9.8,4.5
2,1967-07-31,510.9,198911,9.8,4.7
3,1967-08-31,516.7,199113,9.0,4.6
4,1967-09-30,513.3,199311,9.8,4.9
5,1967-10-31,518.5,199498,9.7,4.7
...,...,...,...,...,...
474,2006-11-30,9478.5,301070,-1.1,7.3
475,2006-12-31,9540.3,301296,-0.9,8.1
476,2007-01-31,9610.6,301481,-1.0,8.1
477,2007-02-28,9653.0,301684,-0.7,8.5


## Feature Scaling

We have to normalize the data. This is done because the units of the variables differ significantly and may influence the modeling process. To prevent this, we will do normalization via scaling of the predictors between 0 and 1.


In [71]:
# When you get the following error: 
# ValueError: could not convert string to float: '1967-06-30'
# discard string columns 

X = X.drop(['date'],axis=1)
X

Unnamed: 0_level_0,pce,pop,psavert,uempmed
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,507.8,198712,9.8,4.5
2,510.9,198911,9.8,4.7
3,516.7,199113,9.0,4.6
4,513.3,199311,9.8,4.9
5,518.5,199498,9.7,4.7
...,...,...,...,...
474,9478.5,301070,-1.1,7.3
475,9540.3,301296,-0.9,8.1
476,9610.6,301481,-1.0,8.1
477,9653.0,301684,-0.7,8.5


In [72]:
c = X.columns
X = pd.DataFrame(scale(X))
X.columns=c
X.columns

# rescaling the features such that they have the properties of a standard normal distribution 

Index(['pce', 'pop', 'psavert', 'uempmed'], dtype='object')

In [73]:
X

# change: with Std. Normal Distribution

Unnamed: 0,pce,pop,psavert,uempmed
0,-1.206951,-1.582875,0.886452,-1.601391
1,-1.205762,-1.576262,0.886452,-1.479337
2,-1.203537,-1.569550,0.656120,-1.540364
3,-1.204841,-1.562971,0.886452,-1.357282
4,-1.202846,-1.556758,0.857661,-1.479337
...,...,...,...,...
473,2.234152,1.818265,-2.251819,0.107372
474,2.257859,1.825775,-2.194236,0.595590
475,2.284825,1.831922,-2.223027,0.595590
476,2.301090,1.838667,-2.136653,0.839699


## Model Training

In [74]:
# holdout-validation method

# splits the data into training and test dataset.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=40)

In [75]:
# check the shape of the Training and Test Data

print(X_train.shape); print(X_test.shape)

(382, 4)
(96, 4)


## Model Building - Linear Regression

In [76]:
regressor = LinearRegression()

regressor.fit(X_train, y_train)

LinearRegression()

In [78]:
# Predict with Training and Test dataset

predict_train = regressor.predict(X_train)
print(np.sqrt(mean_squared_error(y_train,predict_train)))
print(r2_score(y_train, predict_train))


751.2331536698186
0.8338622166918239


In [80]:
predict_test = regressor.predict(X_test)
print(np.sqrt(mean_squared_error(y_test,predict_test))) 
print(r2_score(y_test, predict_test))

795.9051763384305
0.8266727213127318


 It prints evaluation metrics: RMSE and R-squared on the training set.<br>
 ```The evaluation metrics is above 80% which shows a good performance```

## Model Building - Ridge Regression

It is done by adding a penalty parameter that is equivalent to the square of the magnitude of the coefficients.

```Loss function = OLS + alpha * summation (squared coefficient values)```

In [89]:
ridge = Ridge(alpha=0.01)

# Select a suitable alpha value.

In [87]:
## A low alpha value can lead to over-fitting,
## a high alpha value can lead to under-fitting.

In [91]:
ridge.fit(X_train, y_train)
# fits the model to the training data.

Ridge(alpha=0.01)

In [94]:
predict_train_ridge = ridge.predict(X_train)
# predicting the training model

print(np.sqrt(mean_squared_error(y_train,predict_train_ridge)))
print(r2_score(y_train, predict_train_ridge))

751.2341729834667
0.8338617658421136


In [95]:
predict_test_ridge= ridge.predict(X_test)
# predicting the test model

print(np.sqrt(mean_squared_error(y_test,predict_test_ridge))) 
print(r2_score(y_test, predict_test_ridge))

795.835897465616
0.8267028942434855
