# Ridge Regression

Ridge regression is a linear regression model that uses L2 regularization to prevent overfitting. The regularization term is the sum of the squares of the coefficients multiplied by the regularization parameter alpha. The objective function of ridge regression is to minimize the sum of the squared residuals and the regularization term.

Formula for ridge regression: 

$$\text{minimize} \left( \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \alpha \sum_{j=1}^{p} \beta_j^2 \right)$$

where: 
- $y_i$ is the true value of the target variable for the $i$-th observation
- $\hat{y}_i$ is the predicted value of the target variable for the $i$-th observation
- $\beta_j$ is the coefficient of the $j$-th feature
- $n$ is the number of observations
- $p$ is the number of features
- $\alpha$ is the regularization parameter

Furthermore, the regularization parameter alpha controls the strength of the regularization term. A higher value of alpha results in a stronger regularization, which can help prevent overfitting. 

**'Regularization'** is important when the number of features is large compared to the number of observations, as this can lead to overfitting. Ridge regression is particularly useful when the features are highly correlated, as it can help stabilize the coefficients and reduce the variance of the model.

In [None]:

from sklearn.linear_model import Ridge
import numpy as np

# Load the data
X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
y = np.dot(X, np.array([1, 2])) + 3

# Ridge regression model
clf = Ridge(alpha=1.0)
clf.fit(X, y)

# print
print('Coefficient:', clf.coef_)
print('Intercept:', clf.intercept_)

Coefficient: [0.8 1.4]
Intercept: 4.5


# Comparing Linear Regression and Ridge Regression

In [22]:
# import libearies and titanic dataset
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error, mean_absolute_percentage_error
from sklearn.preprocessing import StandardScaler

In [23]:
# Load the data
titanic = pd.read_csv('titanic.csv')

# preprocess and model
data = ['survived', 'pclass', 'sex', 'age', 'fare']
df = titanic[data]

# Drop missing values
df['age'] = df['age'].fillna(df['age'].mean())

# split into X and y
X = df.drop('survived', axis=1)
y = df['survived']

# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['age'] = df['age'].fillna(df['age'].mean())


In [24]:
# create a pipeline
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

# Preprocessing for numerical data
categorical_featues = ['sex']
numerical_features = ['pclass', 'age', 'fare']

preprocessor = ColumnTransformer(transformers = [('num', 'passthrough', numerical_features),
                                                 ('cat', OneHotEncoder(), categorical_featues)])

# Linenar regression pipeline
lr_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('regressor', LinearRegression())])

ridge_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                                    ('regressor', Ridge(alpha = 1.0))])

# train and evaluate the model
lr_pipeline.fit(X_train, y_train)
y_pred = lr_pipeline.predict(X_test)
lr_mse = mean_squared_error(y_test, y_pred)
lr_r2 = r2_score(y_test, y_pred)
lr_mae = mean_absolute_error(y_test, y_pred)
lr_mape = mean_absolute_percentage_error(y_test, y_pred)
lr_rmse = np.sqrt(lr_mse)

ridge_pipeline.fit(X_train, y_train)
y_pred = ridge_pipeline.predict(X_test)
ridge_mse = mean_squared_error(y_test, y_pred)
ridge_r2 = r2_score(y_test, y_pred)
ridge_mae = mean_absolute_error(y_test, y_pred)
ridge_mape = mean_absolute_percentage_error(y_test, y_pred)
ridge_rmse = np.sqrt(ridge_mse)

print('Linear Regression MSE:', lr_mse)
print('Ridge Regression MSE:', ridge_mse)

print('Linear Regression R2:', lr_r2)
print('Ridge Regression R2:', ridge_r2)

print('Linear Regression MAE:', lr_mae)
print('Ridge Regression MAE:', ridge_mae)

print('Linear Regression MAPE:', lr_mape)
print('Ridge Regression MAPE:', ridge_mape)

print('Linear Regression RMSE:', lr_rmse)
print('Ridge Regression RMSE:', ridge_rmse)

Linear Regression MSE: 0.13704587957535952
Ridge Regression MSE: 0.13706266431159014
Linear Regression R2: 0.4214641597530838
Ridge Regression R2: 0.42139330339820036
Linear Regression MAE: 0.2886488550842549
Ridge Regression MAE: 0.2890570390376864
Linear Regression MAPE: 696608086638462.2
Ridge Regression MAPE: 697368985288119.1
Linear Regression RMSE: 0.3701970820729946
Ridge Regression RMSE: 0.3702197513796234
