## Regularisation

Regularisation seeks to solve a few common model issues:
- minimising model complexity
- penalising the loss function
- reducing model overfitting

It does this by: requiring some additional bias, requiring a search for **optimal penalty hyperparameter**

There are three main types of Regularisation:
1. L1 Regularisation: adds a penalty equal to the absolute value of the magnitude of coefficients. It limits the size of the coefficients, and yield sparse models where some coeff becomes zero. When some coefficients become extremely small and insignificant, L1 will not even consider them to be a model param. 
   1. LASSO Regression
2. L2 Regularisation: adds a penalty equal to the square of the magnitude of the coefficients. All coefficients are shrunk by the same factor, does not necessary eliminate coeff as L1 does. 
   1. Ridge Regression
3. Combining L1 and L2: adds an alpha param which becomes mathematically vital in deciding the ratio between L1 and L2. 
   1. Elastic Net 


These regularisation approaches come with a cost:
- it introduces an additional hyperparam that needs to be tuned
- a multiplier to the penalty to decide the strength of the penalty.


## Feature Scaling

Feature Scaling provides many benefits to our ML process.
Some ML models that rely on distance metrics **requires** scaling to perform well. 

The main idea: improves the convergence of the steepest descent algo, which do not possess the property of scale invariance. 
What this essentially means is, in case of multiple units for multiple features, the scale of these features may also vary, and because of this, the coefficients for some features might be more affected during training than others because of the scaling differences. In order to avoid that, we need to update the scale of our features before we train them so that the coefficients are equally affected, and their updation is not proportion to the scale of the features. 
- It also impacts the interpretability of coefficients of the features, since the coeff are tuned on the scaled features and their interpretation can no longer be scaled to the original unscaled features.
- Having said that, there are some ML algos where scaling will not have any impact (for instance Decision trees, random forest, etc.), essentially the algos where gradient descent does not play any role. 

### Methods of feature scaling

There are two main ways to scale features:
1. Standardisation: Rescales data to have a mean of 0 and a std of 1
2. Normalisation: Rescales all data values to be between 0-1

In order to perform these methods on our data, we can use the methods from sklearn. 

When we use the methods from sklearn, and call `fit()` method, we are essentially calculating the statistical properties of the data in order to perform feature scaling, it is only when called the `transform()` method that the data is being rescaled. 
- And because of this very important distinction, whenever we do feature scaling, we only need to `fit()` the training data. We do not want to assume anything about the test data when we are training.
  - If we accidently use the test data on the `fit()` call, that would cause something called **data leakage.**

## Cross Validation

Cross validation is a more advanced set of methods for splitting data into training and testing sets. 

The idea behind train and test split is to train on majority of the data and test it on unseen data. This restricts us to train our model on a proportion of the dataset. Also, with the small test dataset, we might not get to evaluate the model's performance on a variety of data. Is there a way to achieve the following:
- Train on all the data
- Evaluate on all the data

One smart way to do this is:
- Assuming we have divided the dataset into K parts, where 1/K is the test dataset.
  - WE first train the model on the above given dataset config, and measure the performance of the model.
  - Then we shift the 1/K window into the previously training dataset split, and the previously test-dataset now becomes the test dataset.
  - WE do this for K iterations, and average out the error, which in turn gives us the precise idea about how our model will perform (true performance of model) in general. 
- This is called **K-fold cross validation**.
  - Common choice of K is 10.
  - Largest possible value of K: nrows (which is called **leave one out** cross validation)


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [5]:
df = pd.read_csv('Advertising.csv')

X =df.drop(['Unnamed: 0', 'sales'], axis=1)
y = df['sales']


In [6]:
from sklearn.preprocessing import PolynomialFeatures

poly_converter = PolynomialFeatures(degree=3, include_bias=False)
poly_features = poly_converter.fit_transform(X)

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(poly_features,y,test_size=0.3,random_state=42)

In [11]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)

In [12]:
X_train = scaler.transform(X_train)
X_test = scaler. transform(X_test)

Now that we have our scaled data, we can take a closer look at one of the previously mentioned Regularisation methods, L2 (Ridge Regression)

### Ridge Regression
As discussed earlier, it is regularisation technique that works by helping reduce the potential for overfitting the data. 
- It does this by adding a penalty term to the error that is based on the squared value of the coeff. 
- The original problem of linear regression was to minimise the residual sum of squares in order to get a linear equation y=mx+c.
  - The idea behind Ridge regression is to add a penatly term which is in proportion to the sum of squares of the coeff. Ridge Regression seeks to minimise the entire error term RSS+Penalty. The penalty are also called **Shrinkage Penalty** .
  - The **lambda coeff** that this penalty brings along decides how severe the penalty should be. 
  - IN short, Ridge regression is in no way changing the line (or the equation of the line), it is just changing the terms we want to minimise. The only difference would be the change in coeff found, that are to avoid overfitting. 

In [16]:
from sklearn.linear_model import Ridge

ridge_model = Ridge(alpha=10)

ridge_model.fit(X_train, y_train)
test_predictions = ridge_model.predict(X_test)

In [17]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

mae = mean_absolute_error(y_test, test_predictions)
mse = mean_squared_error(y_test, test_predictions)
rmse = np.sqrt(mse)

In [23]:
from sklearn.linear_model import RidgeCV
#from sklearn.metrics import SCORERS

ridge_cv = RidgeCV(alphas=(0.1,1.0,10), scoring='neg_mean_absolute_error')
ridge_cv.fit(X_train, y_train)

In [24]:
ridge_cv.alpha_

0.1

In [26]:
y_preds_ridgeCV = ridge_cv.predict(X_test)

rCV_mae = mean_absolute_error(y_test, y_preds_ridgeCV)
rCV_mse = mean_squared_error(y_test, y_preds_ridgeCV)
rCV_rmse = np.sqrt(rCV_mse)

print(rCV_rmse, rCV_mae)

0.5945136671825971 0.46671241131481794


### Lasso Regression
*(One of the L1 Regularisation)*

As discussed above, L1 adds a penalty term equal to the **absolute value** of the magnitude of the coeffs.
This limits the size of coeff. The key difference with L2 Regression, and the most unique feature of the L1 Regression is
- **it creates sparse models where some coeffs can become zero**, (when the tuning param lambda is sufficiently large)
- Models generate from the LASSO are generally much easier to interprete. 


**LASSO**: Least Absolute Shrinkage and Selection Operator

In [29]:
from sklearn.linear_model import LassoCV

# Lasso() to be used when we know for sure which alpha value is to be the most optimised one. 

# Here epsilon arg is the ratio of min to max alpha values, and n_alphas is the number of alpha between the min and max
# The max iteration arg shows how many iterations should the model take to fit before it finds convergence of model performance.
lasso_cv_model = LassoCV(eps=0.1, n_alphas=100, cv=5)

lasso_cv_model.fit(X_train, y_train)

In [30]:
lasso_cv_model.alpha_

0.4924531806474871

In [32]:
y_preds = lasso_cv_model.predict(X_test)

mae = mean_absolute_error(y_test, y_preds)
mse = mean_squared_error(y_test, y_preds)
rmse = np.sqrt(mse)

print(mae, rmse)

0.681145634283798 1.034912736547873


The performance of the model has dropped when we compare it to the L2 regularisation, but as we see below, most of the coeff are zero which means it is only considering two params for its results and the model is a lot more interpretable. The model complexity has also severly reduced.

We can boost the performance even more by considering more alpha values. We can do it by adjusting the epsilon value, and increasing the number of iterations the model can run in order to find the convergence in terms of performance. 

In [33]:
print(lasso_cv_model.coef_)

[0.97675148 0.         0.         0.         3.8148913  0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.        ]


### Elastic Net
*(L1 and L2 Regularisation)*

As discussed above, this combines L1 and L2 approaches.

In [34]:
from sklearn.linear_model import ElasticNetCV

elasticNet_model = ElasticNetCV(l1_ratio=[.1,.5,.7,.9,.95,.99,1], eps=0.001, n_alphas=100, max_iter=1000000)

elasticNet_model.fit(X_train, y_train)

In [39]:
y_preds = elasticNet_model.predict(X_test)

In [40]:
mae = mean_absolute_error(y_test, y_preds)
mse = mean_squared_error(y_test, y_preds)
rmse = np.sqrt(mse)

print(mae, rmse)

0.5123045552899853 0.6308043049172916


In [41]:
elasticNet_model.coef_

array([ 5.15048089,  0.4274257 ,  0.29684446, -4.53337994,  3.38937185,
       -0.4288993 ,  0.        ,  0.        ,  0.        ,  1.17891049,
       -0.        ,  0.        ,  0.16706037, -0.        ,  0.        ,
        0.        ,  0.11083672,  0.        ,  0.06155549])

Its great to understand conceptually what each regularisation method does. 
The tip is:
- It is always better to use **ElasticNet** with the L1 ratio arg, leaning completely both towards L1 and L2, and then do enough iterations with the K-fold cross validation so that the model chooses its own approach for penalty.

## Recap

- We have data, we look at the shape and type of the data (categorical, continuous, etc.)
  - We clean the data so that there is no missing values, etc.
  - We can plot the data and see if there are any patterns (if the n_features are limited)
- We perform standardisation to scale the features (we have only seen this till now)
  - We perform train, test, split
- We use vanilla linear regression first, and look at the performance metrics MAE, MSE and RMSE.
  - We also plot the residual loss against y_test
  - If we see any patterns there, we can think of polynomial regression
- We can try out polynomial regression using different combination of higher order features and its interation pattern using PolynomialFeatures. 
  - We again train a linear regression model using the new set of features.
  - Compare the results with the earlier linear regression, if there is still a hope for improving results without overfitting we can take a look at regularisation methods. 
- Just go with the Elastic Net approach and let the algo make the decision regarding the better approach.