## Regularisation

Regularisation seeks to solve a few common model issues:
- minimising model complexity
- penalising the loss function
- reducing model overfitting

It does this by: requiring some additional bias, requiring a search for **optimal penalty hyperparameter**

There are three main types of Regularisation:
1. L1 Regularisation: adds a penalty equal to the absolute value of the magnitude of coefficients. It limits the size of the coefficients, and yield sparse models where some coeff becomes zero. When some coefficients become extremely small and insignificant, L1 will not even consider them to be a model param. 
   1. LASSO Regression
2. L2 Regularisation: adds a penalty equal to the square of the magnitude of the coefficients. All coefficients are shrunk by the same factor, does not necessary eliminate coeff as L1 does. 
   1. Ridge Regression
3. Combining L1 and L2: adds an alpha param which becomes mathematically vital in deciding the ratio between L1 and L2. 
   1. Elastic Net 


These regularisation approaches come with a cost:
- it introduces an additional hyperparam that needs to be tuned
- a multiplier to the penalty to decide the strength of the penalty.


## Feature Scaling

Feature Scaling provides many benefits to our ML process.
Some ML models that rely on distance metrics **requires** scaling to perform well. 

The main idea: improves the convergence of the steepest descent algo, which do not possess the property of scale invariance. 
What this essentially means is, in case of multiple units for multiple features, the scale of these features may also vary, and because of this, the coefficients for some features might be more affected during training than others because of the scaling differences. In order to avoid that, we need to update the scale of our features before we train them so that the coefficients are equally affected, and their updation is not proportion to the scale of the features. 
- It also impacts the interpretability of coefficients of the features, since the coeff are tuned on the scaled features and their interpretation can no longer be scaled to the original unscaled features.
- Having said that, there are some ML algos where scaling will not have any impact (for instance Decision trees, random forest, etc.), essentially the algos where gradient descent does not play any role. 

### Methods of feature scaling

There are two main ways to scale features:
1. Standardisation: Rescales data to have a mean of 0 and a std of 1
2. Normalisation: Rescales all data values to be between 0-1

In order to perform these methods on our data, we can use the methods from sklearn. 

When we use the methods from sklearn, and call `fit()` method, we are essentially calculating the statistical properties of the data in order to perform feature scaling, it is only when called the `transform()` method that the data is being rescaled. 
- And because of this very important distinction, whenever we do feature scaling, we only need to `fit()` the training data. We do not want to assume anything about the test data when we are training.
  - If we accidently use the test data on the `fit()` call, that would cause something called **data leakage.**

## Cross Validation
