# Ridge
Auto-generated notebook.

Due to the few points in each dimension and the straight line that linear regression uses to follow these points as well as it can, noise on the observations will cause great variance as shown in the first plot. Every line’s slope can vary quite a bit for each prediction due to the noise induced in the observations.

Ridge regression is basically minimizing a penalised version of the least-squared function. The penalising shrinks the value of the regression coefficients. Despite the few data points in each dimension, the slope of the prediction is much more stable and the variance in the line itself is greatly reduced, in comparison to that of the standard linear regression.

Here, α (alpha) is the parameter which balances the amount of emphasis given to minimizing RSS vs minimizing sum of square of coefficients. α can take various values:

    α = 0:
        The objective becomes same as simple linear regression.
        We’ll get the same coefficients as simple linear regression.
    α = ∞:
        The coefficients will be zero. Why? Because of infinite weightage on square of coefficients, anything less than zero will make the objective infinite.
    0 < α < ∞:
        The magnitude of α will decide the weightage given to different parts of objective.
        The coefficients will be somewhere between 0 and ones for simple linear regression.


https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-ridge-lasso-regression-python/

In [3]:
from sklearn import linear_model

import numpy as np
import pandas as pd
import os
import sys
import plotly.graph_objects as go
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
   sys.path.append(module_path) 

from erudition.learning.modules.GeneralizedLinearModels.helper import helper

pd.options.display.float_format = '{:,.2g}'.format
np.random.seed(42) 

# Prepare the data

This resembles a sine curve but not exactly because of the noise. We’ll use this as an example to test different scenarios in this article. Let’s try to estimate the sine function using polynomial regression with powers of x from 1 to 20. Let’s add a column for each power upto 20 in our dataframe. This can be accomplished using the following code:

In [11]:
x = np.array([i*np.pi/180 for i in range(60,300,4)])
y = np.sin(x) + np.random.normal(0,0.15,len(x))

df = pd.DataFrame(np.column_stack([x,y]),columns=['x','y'])

for i in range(2,20):
    colname = 'x_%d'%i
    df[colname] = df['x']**i

# Plot results

In [13]:
data = []

data.append(scatter(df.x, df.y, 'hello', mode='markers', color='yellow'))

#Initialize a dataframe to store the results:
col = ['rss','intercept'] + ['coef_x_%d'%i for i in range(1,20)]
ind = ['model_pow_%d'%i for i in range(1,20)]
coef_matrix = pd.DataFrame(index=ind, columns=col)

for i in range(2,20):
    plot, res = helper.regression(df, i, linear_model.LinearRegression())
    data.append(plot)
    coef_matrix.iloc[i-1,0:i+2] = res

fig = go.Figure(data=data)
render(fig)

# View Coefficient Matrix

In [14]:
coef_matrix

Unnamed: 0,rss,intercept,coef_x_1,coef_x_2,coef_x_3,coef_x_4,coef_x_5,coef_x_6,coef_x_7,coef_x_8,...,coef_x_10,coef_x_11,coef_x_12,coef_x_13,coef_x_14,coef_x_15,coef_x_16,coef_x_17,coef_x_18,coef_x_19
model_pow_1,,,,,,,,,,,...,,,,,,,,,,
model_pow_2,3.6,1.9,-0.61,-0.0018,,,,,,,...,,,,,,,,,,
model_pow_3,1.1,-1.2,3.2,-1.4,0.15,,,,,,...,,,,,,,,,,
model_pow_4,1.1,-1.7,4.0,-1.8,0.25,-0.0086,,,,,...,,,,,,,,,,
model_pow_5,1.1,-1.9,4.3,-2.0,0.33,-0.021,0.0008,,,,...,,,,,,,,,,
model_pow_6,1.1,-5.3,13.0,-10.0,4.5,-1.1,0.15,-0.008,,,...,,,,,,,,,,
model_pow_7,1.1,-13.0,35.0,-37.0,21.0,-7.2,1.4,-0.15,0.0065,,...,,,,,,,,,,
model_pow_8,1.1,38.0,-140.0,210.0,-170.0,80.0,-23.0,4.1,-0.4,0.016,...,,,,,,,,,,
model_pow_9,1.0,98.0,-370.0,580.0,-510.0,270.0,-94.0,21.0,-2.8,0.22,...,,,,,,,,,,
model_pow_10,1.0,170.0,-670.0,1100.0,-1100.0,660.0,-270.0,72.0,-13.0,1.5,...,0.0031,,,,,,,,,
