## Regularization in Regression

* **Regularization - is a technique that adds the penalty as model complexity increases.**


* When model complexity will increase, it will lead to overfitting.


* Overfitting happens when model learns signal as well as noise in the data.


* So an overfit model will always perform very well on the training data and underperform on testing / actual data.


* In order to create parsimonious (less complex) model, we employ regularization techniques.


    1. L1-Regularization or Lasso


    2. L2-Regularization or Ridge
    
    
    3. Elasticnet Regularization

# Build a regression model which will try to predict unemployment within an economy

In [1]:
# import libraries

import numpy as np
import pandas as pd

In [2]:
# import data

data=pd.read_csv("economics.csv")

In [3]:
data.head()

Unnamed: 0,date,pce,pop,psavert,uempmed,unemploy
0,1967-07-01,507.4,198712,12.5,4.5,2944
1,1967-08-01,510.5,198911,12.5,4.7,2945
2,1967-09-01,516.3,199113,11.7,4.6,2958
3,1967-10-01,512.9,199311,12.5,4.9,3143
4,1967-11-01,518.1,199498,12.5,4.7,3066


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 574 entries, 0 to 573
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   date      574 non-null    object 
 1   pce       574 non-null    float64
 2   pop       574 non-null    int64  
 3   psavert   574 non-null    float64
 4   uempmed   574 non-null    float64
 5   unemploy  574 non-null    int64  
dtypes: float64(3), int64(2), object(1)
memory usage: 27.0+ KB


In [6]:
data.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
pce,574.0,4843.510453,3579.287206,507.4,1582.225,3953.55,7667.325,12161.5
pop,574.0,257189.381533,36730.801593,198712.0,224896.0,253060.0,290290.75,320887.0
psavert,574.0,7.936585,3.124394,1.9,5.5,7.7,10.5,17.0
uempmed,574.0,8.610105,4.108112,4.0,6.0,7.5,9.1,25.2
unemploy,574.0,7771.557491,2641.960571,2685.0,6284.0,7494.0,8691.0,15352.0


### Data Dictionary

* psavert - personal saving rate


* pce - personal consumption expenditure, USD Billions


* uempmed - median duration of unemployment, weeks


* unemploy - number of unemployed (thousands)


* pop - Population in thousands

In [8]:
data.columns

Index(['date', 'pce', 'pop', 'psavert', 'uempmed', 'unemploy'], dtype='object')

In [9]:
target=data['unemploy']
features=data[['pce', 'pop', 'psavert', 'uempmed']]

In [10]:
features=features/features.max()

In [11]:
features.describe() #standardize data

Unnamed: 0,pce,pop,psavert,uempmed
count,574.0,574.0,574.0,574.0
mean,0.398266,0.801495,0.466858,0.341671
std,0.294313,0.114466,0.183788,0.16302
min,0.041722,0.619258,0.111765,0.15873
25%,0.130101,0.700857,0.323529,0.238095
50%,0.325087,0.788627,0.452941,0.297619
75%,0.630459,0.904651,0.617647,0.361111
max,1.0,1.0,1.0,1.0


In [12]:
X=features
y=target

In [13]:
# split the data for training the model

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.2,random_state=42)

In [14]:
# build regression model

from sklearn.linear_model import LinearRegression

lm=LinearRegression()

In [15]:
lm.fit(X_train,y_train)

LinearRegression()

In [16]:
from sklearn.metrics import mean_absolute_error,mean_squared_error,r2_score

In [17]:
# Training Accuracy - Accuracy wrt the training data

y_pred_train = lm.predict(X_train)

print('Training Accuracy')

print(f'RMSE : {np.sqrt(mean_squared_error(y_train,y_pred_train))}')
print(f'R-Squared : {(r2_score(y_train,y_pred_train))}')


# Testing Accuracy - Accuracy wrt the testing data

y_pred_test = lm.predict(X_test)

print('\n\nTesting Accuracy')

print(f'RMSE : {np.sqrt(mean_squared_error(y_test,y_pred_test))}')
print(f'R-Squared : {(r2_score(y_test,y_pred_test))}')

print(f'\n\nCoefficients : {lm.coef_}')

Training Accuracy
RMSE : 985.028574875737
R-Squared : 0.8510879906820179


Testing Accuracy
RMSE : 1001.916307593938
R-Squared : 0.8854186379903035


Coefficients : [-19146.90350229  56970.37011106   5107.93739571  13551.16810809]


In [18]:
data['unemploy'].min()

2685

In [19]:
from sklearn.linear_model import Ridge, Lasso, ElasticNet


#### RIDGE----LOSS FUNCTION = OLS + α (sum of squared coefficient values)

##### alpha = smoothing param
##### if α = 0, then it becomes simple OLS only
##### Low α leads to overfitting
##### High α leads to underfitting


In [20]:
rr=Ridge(alpha=0.01)

In [21]:
rr.fit(X_train,y_train)

# Training Accuracy - Accuracy wrt the training data

y_pred_train = rr.predict(X_train)

print('Training Accuracy')

print(f'RMSE : {np.sqrt(mean_squared_error(y_train,y_pred_train))}')
print(f'R-Squared : {(r2_score(y_train,y_pred_train))}')


# Testing Accuracy - Accuracy wrt the testing data

y_pred_test = rr.predict(X_test)

print('\n\nTesting Accuracy')

print(f'RMSE : {np.sqrt(mean_squared_error(y_test,y_pred_test))}')
print(f'R-Squared : {(r2_score(y_test,y_pred_test))}')

print(f'\n\nLM Coefficients : {lm.coef_}')
print(f'\n\nRR Coefficients : {rr.coef_}')

Training Accuracy
RMSE : 988.4714111995353
R-Squared : 0.8500452277864494


Testing Accuracy
RMSE : 999.568234828448
R-Squared : 0.8859550702424106


LM Coefficients : [-19146.90350229  56970.37011106   5107.93739571  13551.16810809]


RR Coefficients : [-17406.11927015  51531.47457744   4509.20141652  13654.41294602]


### Lasso Regression or L1-Norm

* Lasso - Least absolute shrinkage and selection operator


**LOSS FUNCTION = OLS + α (absolute values of magnitude of coefficients)**


In [22]:
lsm = Lasso(alpha=0.001)

lsm.fit(X_train,y_train)

Lasso(alpha=0.001)

In [23]:
# Training Accuracy - Accuracy wrt the training data

y_pred_train = lsm.predict(X_train)

print('Training Accuracy')

print(f'RMSE : {np.sqrt(mean_squared_error(y_train,y_pred_train))}')
print(f'R-Squared : {(r2_score(y_train,y_pred_train))}')


# Testing Accuracy - Accuracy wrt the testing data

y_pred_test = lsm.predict(X_test)

print('\n\nTesting Accuracy')

print(f'RMSE : {np.sqrt(mean_squared_error(y_test,y_pred_test))}')
print(f'R-Squared : {(r2_score(y_test,y_pred_test))}')

print(f'\n\nLM Coefficients : {lm.coef_}')
print(f'\n\nRR Coefficients : {rr.coef_}')
print(f'\n\nLSM Coefficients : {lsm.coef_}')


Training Accuracy
RMSE : 985.0285793909925
R-Squared : 0.8510879893168276


Testing Accuracy
RMSE : 1001.9110813423041
R-Squared : 0.8854198333585475


LM Coefficients : [-19146.90350229  56970.37011106   5107.93739571  13551.16810809]


RR Coefficients : [-17406.11927015  51531.47457744   4509.20141652  13654.41294602]


LSM Coefficients : [-19144.90151453  56964.19811383   5107.24474545  13551.21639753]


### Elasticnet regression

Combines both L1 and L2 normalization

In [24]:
en = ElasticNet()

en.fit(X_train,y_train)

ElasticNet()

In [25]:
# Training Accuracy - Accuracy wrt the training data

y_pred_train = en.predict(X_train)

print('Training Accuracy')

print(f'RMSE : {np.sqrt(mean_squared_error(y_train,y_pred_train))}')
print(f'R-Squared : {(r2_score(y_train,y_pred_train))}')


# Testing Accuracy - Accuracy wrt the testing data

y_pred_test = en.predict(X_test)

print('\n\nTesting Accuracy')

print(f'RMSE : {np.sqrt(mean_squared_error(y_test,y_pred_test))}')
print(f'R-Squared : {(r2_score(y_test,y_pred_test))}')

print(f'\n\nLM Coefficients : {lm.coef_}')
print(f'\n\nRR Coefficients : {rr.coef_}')
print(f'\n\nLSM Coefficients : {lsm.coef_}')
print(f'\n\nEN Coefficients : {en.coef_}')


Training Accuracy
RMSE : 2331.4927412156394
R-Squared : 0.16574234437779267


Testing Accuracy
RMSE : 2690.6795263008275
R-Squared : 0.17362905746232427


LM Coefficients : [-19146.90350229  56970.37011106   5107.93739571  13551.16810809]


RR Coefficients : [-17406.11927015  51531.47457744   4509.20141652  13654.41294602]


LSM Coefficients : [-19144.90151453  56964.19811383   5107.24474545  13551.21639753]


EN Coefficients : [ 697.72334777  284.66132963 -207.46899075  594.21973384]
