# **Regularization in Regression**

####  **What is Regularization ?**


In mathematics, statistics, finance, computer science, particularly in machine learning and inverse problems, regularization is the process of adding information in order to solve an ill-posed problem or to prevent overfitting.

#### **Why do we need Regularization ?**

Regularization is used in machine learning models to cope with the problem of overfitting i.e. when the difference between training error and the test error is too high.

#### **Regularization in Regression**
There are mainly two types of regularization techniques:


- Lasso Regression
- Ridge Regression

#### **Lasso Regression**
A regression model which uses L1 Regularization technique is called LASSO(Least Absolute Shrinkage and Selection Operator) regression.
#### **Ridge Regression**
A regression model that uses L2 regularization technique is called Ridge regression. 
Lasso Regression adds “absolute value of magnitude” of coefficient as penalty term to the loss function(L)


- For the Dataset being used [Click here](https://www.kaggle.com/quantbruce/real-estate-price-prediction).


# Data Exploration

In [None]:
import pandas as pd

In [None]:
data=pd.read_csv('/content/Real estate.csv')
data.head()

Unnamed: 0,No,X1 transaction date,X2 house age,X3 distance to the nearest MRT station,X4 number of convenience stores,X5 latitude,X6 longitude,Y house price of unit area
0,1,2012.917,32.0,84.87882,10,24.98298,121.54024,37.9
1,2,2012.917,19.5,306.5947,9,24.98034,121.53951,42.2
2,3,2013.583,13.3,561.9845,5,24.98746,121.54391,47.3
3,4,2013.5,13.3,561.9845,5,24.98746,121.54391,54.8
4,5,2012.833,5.0,390.5684,5,24.97937,121.54245,43.1


In [None]:
data.shape

(414, 8)

In [None]:
data.columns

Index(['No', 'X1 transaction date', 'X2 house age',
       'X3 distance to the nearest MRT station',
       'X4 number of convenience stores', 'X5 latitude', 'X6 longitude',
       'Y house price of unit area'],
      dtype='object')

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 414 entries, 0 to 413
Data columns (total 8 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   No                                      414 non-null    int64  
 1   X1 transaction date                     414 non-null    float64
 2   X2 house age                            414 non-null    float64
 3   X3 distance to the nearest MRT station  414 non-null    float64
 4   X4 number of convenience stores         414 non-null    int64  
 5   X5 latitude                             414 non-null    float64
 6   X6 longitude                            414 non-null    float64
 7   Y house price of unit area              414 non-null    float64
dtypes: float64(6), int64(2)
memory usage: 26.0 KB


In [None]:
data.nunique()

No                                        414
X1 transaction date                        12
X2 house age                              236
X3 distance to the nearest MRT station    259
X4 number of convenience stores            11
X5 latitude                               234
X6 longitude                              232
Y house price of unit area                270
dtype: int64

In [None]:
data.isnull().sum()

No                                        0
X1 transaction date                       0
X2 house age                              0
X3 distance to the nearest MRT station    0
X4 number of convenience stores           0
X5 latitude                               0
X6 longitude                              0
Y house price of unit area                0
dtype: int64

In [None]:
data.isnull().sum().sum()

0

In [None]:
x=data.drop('Y house price of unit area',axis=1)
y=data['Y house price of unit area']

In [None]:
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y=train_test_split(x,y,test_size=0.3, random_state=2)

Working with Linear Regression:

In [None]:
from sklearn.linear_model import LinearRegression
reg= LinearRegression().fit(train_x, train_y)

In [None]:
reg.score(test_x,test_y)

0.49855205778389894

In [None]:
reg.score(train_x,train_y)

0.6221172943838935

From the above two outputs we can see that, score for testing data is 49% which is lower than that of training data where it's score is 62%.

This simply means that there is overfitting of data.

To deal with such overfitting sklearn provides L1 regularization or also known as LASSO Regression.

**LASSO regression**

In [34]:
#importing the model, initializing it with parameters and fitting the training data into the model.
from sklearn import linear_model
reg1=linear_model.Lasso(alpha=1, max_iter=50, tol=0.1)
reg1.fit(train_x, train_y)

Lasso(alpha=1, copy_X=True, fit_intercept=True, max_iter=50, normalize=False,
      positive=False, precompute=False, random_state=None, selection='cyclic',
      tol=0.1, warm_start=False)

In [35]:
reg1.score(test_x,test_y)

0.4484234946328681

In [36]:
reg1.score(train_x,train_y)

0.5896955339745802

Here, the scores for training and testing data are not too apart.

To improve this further we have L2 Regularization or also known as 
Ridge regression.

**Ridge Regression**

In [38]:
#importing the model, initializing it with parameters and fitting the training data into the model.
from sklearn.linear_model import Ridge
reg2=Ridge(alpha=1, max_iter=50, tol=0.1)
reg2.fit(train_x, train_y)

Ridge(alpha=1, copy_X=True, fit_intercept=True, max_iter=50, normalize=False,
      random_state=None, solver='auto', tol=0.1)

In [39]:
reg2.score(test_x,test_y)

0.461237787836873

In [40]:
reg2.score(train_x,train_y)

0.6029112827375744

**Here, we can see that after l2 regularization the scores for testing and training data have improved.**