# What is Ridge Regression?

## Linear Regression and Least sqaures

Whenever we perform Linear Regression, we tend to fit a line to existing data which minimizes the squared error. 
- This schema works very well if our training and testing data are sufficiently large
- But, if our training data is rather small, the act of trying to get the minimum possible error can sometimes lead to overfitting
- To prevent this we use Ridge Regression

## How can Ridge Regression help with this overfitting of training data?

Considering a simple dataset where we have just a single parameter. this can be represeted by the equation:

y = w + bx  where b is 1*1 value

In Linear Regression,
we tend to vary the values of b and w which minimize the (total squared residual)

In Ridge Regression,
we tend to vary the values of b and w which minimize the (total sqaured residual + lambda*slope^2)

- Here the term lambda*slope^2 would act as a bias source.
- This bias term's use can be understood by a simple exmaple:
    Assume we just have 2 points in our training set
    - Then the Linear Regression would eventually give us a straight line passing through both these points.
    - Consider the line equation is y = 3 + 2x
    - In the same scenario Ridge Regression would give us a line which is close to both points but does not directly pass throguh them
    

## How does this slope based term add bias?

In general if the slope of a line is large it means that small changes in the values x (inputs/ weights) will result in larger changes in the value y (output).

Applying the same logic:
- If our line is overfitting the data, it generally means that our line is more dependant on the chnages in the inputs
- by introducing this extra positive term in the error calculation we tend to show preference for lines with less slope and thus less dependance on the inputs

### What is lambda?

NOTE:  lambda is a positive value (o <= lambda <= +infi)

We can make similar insights about lamdba as we did for slope: 
- As we are adding a positve value to our final error score our eventual goal would be to reduce the error considering this additional term
- Here, if we increase the lambda value from 1 to 2, then the bias we are introducing would further increase.
- This means that with the increase in the lambda value, the slope of the final line would decrease



### How can we determine the best value of Lambda (herperparamter) ? 

The best way is to tune this as a hyperparameter:
- we can use cross validation to find the best value of lambda for our problem

Where can we use Ridge Regression?

Ridge Rigression can be used for routine LinerRegression cases and also for Classification.

## Source:  https://www.youtube.com/watch?v=Q81RR3yKn30

# Example

In [3]:
from sklearn import linear_model

In [4]:
# NOTE: alpha below is lambda
reg = linear_model.Ridge(alpha=.5)
reg.fit([[0, 0], [0, 0], [1, 1]], [0, .1, 1])

Ridge(alpha=0.5)

In [5]:
print(f'coefficients = {reg.coef_}, intercept = {reg.intercept_}')

coefficients = [0.34545455 0.34545455], intercept = 0.13636363636363638


## Finding the perfect lambda value for our problem

In [6]:
import numpy as np

In [8]:
reg = linear_model.RidgeCV(alphas=np.logspace(-6, 6, 13))
reg.fit([[0, 0], [0, 0], [1, 1]], [0, .1, 1])

RidgeCV(alphas=array([1.e-06, 1.e-05, 1.e-04, 1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01,
       1.e+02, 1.e+03, 1.e+04, 1.e+05, 1.e+06]))

In [9]:
print(f'alpha = {reg.alpha_}, coefficients = {reg.coef_}, intercept = {reg.intercept_}')

alpha = 0.01, coefficients = [0.47146402 0.47146402], intercept = 0.052357320099276905


# Further Reading

- http://cbcl.mit.edu/publications/ps/MIT-CSAIL-TR-2007-025.pdf
- https://www.mit.edu/~9.520/spring07/Classes/rlsslides.pdf