# Hands On - Ridge Regressions for more accurate Bordeaux Equations

## Import Data & Check Structure



In [1]:
import pandas as pd
wine = pd.read_csv("https://raw.githubusercontent.com/casbdai/datasets/main/wine_regression.csv")

## Separate Features and Targets

In [2]:
X = wine.drop("price", axis = 1)
X = X.drop("vintage", axis = 1)
y = wine["price"]

# Compare Linear and Ridge Regression

### Linear Regression

In [3]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

linreg = LinearRegression()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.12, shuffle=False, random_state=11)
linreg.fit(X_train,y_train)

y_pred = linreg.predict(X_test)
mean_squared_error(y_test, y_pred, squared=False)

17.394945061631525

## Ridge Regression

The only change that we have to made is importing Ridge instead of Linear Regression. 

Ridge Regression has a penalty called alpha. The higher alpha, the stronger the reguluraization. That means that we penalty on overfitting and highly correlating is stronger and that the model tries to correct these negative influences.

If we set alpha = 0 Ridge Regression is equivalent to the Linear Regression (no penalty).

In [4]:
from sklearn.linear_model import Ridge #import ridge
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

ridgereg = Ridge(alpha=0) # set alpha to 0 > equivalent to linear model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.12, shuffle=False, random_state=11)
ridgereg.fit(X_train,y_train)

y_pred = ridgereg.predict(X_test)
mean_squared_error(y_test, y_pred, squared=False)

17.394945061631613

## Compare Regression Coefficients

Let's define a small helper function that helps us interpreting the regression coefficients!


In [5]:
def get_coefs(model, X_train):
  coef_table = pd.DataFrame(list(X_train.columns)).copy()
  coef_table.insert(len(coef_table.columns),"Coefs",model.coef_.transpose())
  return coef_table

Regression Coefficients for the Linear Regression:

In [6]:
get_coefs(linreg, X_train)

Unnamed: 0,0,Coefs
0,winter.rain,0.043272
1,harvest.rain,-0.058125
2,grow.temp,21.353799
3,harvest.temp,8.214618
4,purchasing.power,-0.002755
5,age,0.138745


Identical Regression Coefficients for the Ridge Regression

In [7]:
get_coefs(ridgereg, X_train)

Unnamed: 0,0,Coefs
0,winter.rain,0.043272
1,harvest.rain,-0.058125
2,grow.temp,21.353799
3,harvest.temp,8.214618
4,purchasing.power,-0.002755
5,age,0.138745


## Let's Penalize our Regression: Increasing alpha to 20

In [8]:
ridgereg = Ridge(alpha=20) # set alpha from 0 to 20 (alpha must be >0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.12, shuffle=False, random_state=11)
ridgereg.fit(X_train,y_train)

y_pred = ridgereg.predict(X_test)
mean_squared_error(y_test, y_pred, squared=False)

10.374894628961144

Increasing an alpha an putting a stronger penalty on variables the are highly correlating with other variables and that increase overfitting, we get a more accurate model!

In [9]:
get_coefs(ridgereg, X_train)

Unnamed: 0,0,Coefs
0,winter.rain,0.039519
1,harvest.rain,-0.0208
2,grow.temp,8.979948
3,harvest.temp,6.976016
4,purchasing.power,-0.001612
5,age,0.554995


We regression coefficients have changed quite drastically! Look at grow.temp its influence has been decreased from 21.35 in the linear regression to 8.97 in the Ridge Regression Model!

# Can you find a better alpha for increasing model performance?