# ___Regularization___
-------------------------

In [2]:
# regularization is used to mitigate overfitting in ML models
# there are in fact several approaches to reduce overfitting in ML models

# 1) get more training data
# 2) leverage feature engineering, cherrypick the most pertinent features and try culling noise and redundancy
# 3) apply regularization

In [3]:
# one drawback of feature selection is that we might loose some valuable information in the training data
# it'd be better if we could control how much influence each feature exerts on the prediction, so we could selectively minimize the influence
# of less relevant predictors

In [4]:
# imagine the following model,

# ___$f(x) = 23.76 x_0 - 879.35 x_2 + 35.987 x_3^2 - 5.76 x_4^{4.564} + 564.763$___

In [5]:
# when we use feature selection, we remove a feature completely
# this is equivalent to setting its coefficient to 0.00
# i.e if we wanted to remove the feature x_4, we'd change 5.76 to 0.00

# ___$f(x) = 23.76 x_0 - 879.35 x_2 + 35.987 x_3^2 - 0.00 x_4^{4.564} + 564.763$___
# ___$f(x) = 23.76 x_0 - 879.35 x_2 + 35.987 x_3^2 + 564.763$___

In [6]:
# regularization helps us minimize the influence of such features without completely eliminating them!
# regularization forces the model to shrink the coefficents without demanding them to be set to 0.00

In [7]:
# regularization can be applied to both the weights and the bias term, but conventionally it is used mainly for the weight terms!

In [8]:
# imagine a situation where we have 2 polynomial models,

# ___$f_{\overrightarrow{w}, b}(\overrightarrow{x}) = w_1x_1 + w_2x_2^2 + w_3x_3^3 + b  \rightarrow (1)$___
# ___$f_{\overrightarrow{w}, b}(\overrightarrow{x}) = w_1x_1 + w_2x_2^2 + w_3x_3 + w_4x_{4.15} + w_5x_x^{6.54} + b \rightarrow (2)$___

In [9]:
# let's say that model (1) makes poor predictions due to underfitting and model (2) makes poor predictions due to overfitting
# this means features x1, x2, x3 are not enough
# but x4 and x5 is too much

In [10]:
# we need to come up with a strategy that could minimize the influence of x4 and x5 in our predictions!
# consider the following cost function,

# ___$j(\overrightarrow{w}, b) = \frac{1}{2N} \sum_{i = 0}^{N} (f_{\overrightarrow{w}, b}(\overrightarrow{x_i}) - y_i)^2 + 1000 w_4^2 + 4587 w_5^2$___

In [12]:
# this cost function will tax the model, when w_4 and w_5 are high, because when these values are high, the cost will be very high as these weights
# get squared and multiplied by very large numbers
# this will make the model move towards smaller values for w_4 and w_5!
# hence, the model will settle for very small yet non-zero values for these higher order polynomials!

In [13]:
# in reality, models do not cherrypick features for penalization,
# coefficients of all features will be penalized during the gradient descent!

In [15]:
# imagine a model with M features and N records,

# ___$j(\overrightarrow{w}, b) = \frac{1}{2N} \sum_{i = 1}^{N} (f_{\overrightarrow{w}, b}(\overrightarrow{x_i}) - y_i)^2 + \frac{\lambda}{2N} \sum_{k = 1}^M w_k^2$___

In [19]:
# lambda is called the regularization paramter
# similar to the learning rate alpha, the lambda needs to be choosen by the user

In [20]:
# both the cost and the regularization term get scaled by 1/2N
# using the same denominator helps maintain the values closer.

In [18]:
# in the above equation, we chose not to penalize the bias term!

# ___$\frac{\lambda}{2N} \sum_{k = 1}^M w_k^2$___

In [21]:
# the above is called the regularization term
# when lambda is 0, there will be no regularization and the model will fall back to overfitting the data
# with an extremely high labmbda, all the weights will be reduced to a near zero value and the model may start to undefit the data!