# ___Regularized Linear Regression___
------------------

In [1]:
# this is the cost function that we have previously deduced for regularized regression,

# ___$j(\overrightarrow{w},b) = \frac{1}{2N}\sum_{i = 1}^N (f_{\overrightarrow{w},b}(\overrightarrow{x_i}) - y_i)^2 + \frac{\lambda}{2N} \sum_{p = 1}^M w_p^2$___

In [3]:
# N - number of records
# M - number of features

In [4]:
# gradient descent for regularized linear regression is exactly the same as the gradient descent for linear regression
# but the we'll be using a different cost function

# ___$w_p = w_p - \alpha \cdot \frac{\partial}{\partial{w_p}} j(\overrightarrow{w}, b) $___
# ___$b = b - \alpha \cdot \frac{\partial}{\partial{b}} j(\overrightarrow{w}, b) $___

In [6]:
# repeat until convergence!

# ___$w_p = w_p - \alpha \frac{\partial}{\partial{w_p}} \{\frac{1}{2N}\sum_{i = 1}^N (f_{\overrightarrow{w},b}(\overrightarrow{x_i}) - y_i)^2 + \frac{\lambda}{2N} \sum_{p = 1}^M w_p^2\} $___
# ___$b = b - \alpha \frac{\partial}{\partial{b}} \{\frac{1}{2N}\sum_{i = 1}^N (f_{\overrightarrow{w},b}(\overrightarrow{x_i}) - y_i)^2 + \frac{\lambda}{2N} \sum_{p = 1}^M w_p^2\} $___

In [7]:
# after solving the derivatives,

# ___$w_p = w_p - \alpha (\frac{1}{N}\sum_{i = 1}^N (f_{\overrightarrow{w},b}(\overrightarrow{x_i}) - y_i) x_i + \frac{\lambda}{N} w_p) $___
# ___$b = b - \alpha \frac{1}{N}\sum_{i = 1}^N (f_{\overrightarrow{w},b}(\overrightarrow{x_i}) - y_i) $___

In [11]:
# we could also rewrite the regularized weight update term as follows,

# ___$w_p = w_p - \alpha \frac{1}{N}\sum_{i = 1}^N (f_{\overrightarrow{w},b}(\overrightarrow{x_i}) - y_i) x_i - \alpha \frac{\lambda}{N} w_p $___
# ___$w_p = w_p - \alpha \frac{\lambda}{N} w_p - \alpha \frac{1}{N}\sum_{i = 1}^N (f_{\overrightarrow{w},b}(\overrightarrow{x_i}) - y_i) x_i $___
# ___$w_p = w_p(1  - \alpha \frac{\lambda}{N}) - \alpha \frac{1}{N}\sum_{i = 1}^N (f_{\overrightarrow{w},b}(\overrightarrow{x_i}) - y_i) x_i $___

In [12]:
# the rightmost term in the last equation is the update term used in non-regularized linear regression!
# hence, the only change is that instead of subtracting this term from w_p itself, we are subtracting it from w_p(1 - (alpha x lambda / N))

In [13]:
# the term (1 - alpha x lambda / N) dictates the result of the update
# if it evaluates to a value > 1, then the new weight will not be affected much by subtracting the derivative
# if this evalues to a value < 1.0, then the impact compounds and the weight will be penalized more!

In [14]:
# say that our training dataset has 1_000_000_000 records
N = 1_000_000_000

# and our alpha of choice is 0.0175
alpha = 0.0175

# and we use a regularization parameter 5.0254
_lambda = 5.0254

In [15]:
1 - (alpha * _lambda / N)

0.9999999999120555

In [22]:
# so instead of subtracting the derivative from w_i, we'll be subtracting it from,

f"{(1 - (alpha * _lambda / N)):.10f} w_i"

'0.9999999999 w_i'

In [23]:
# which introduces an additional reduction

In [37]:
# with a higher lambda and alpha 

alpha = 0.25
_lambda = 1_000_000

1 - (alpha * _lambda / N)

0.99975