# Mini Project: Linear Regression

Written by Adam Ten Hoeve  
COMP 4448 - Data Science Tools 2  
Summer 2021

In [7]:
# Load Required Libraries
import numpy as np
import pandas as pd

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression, Lasso, Ridge

Find your own dataset from an online source with at least 3 input variables. Here is a suggested source of data but you don’t have to use it: https://vincentarelbundock.github.io/Rdatasets/articles/data.html.  There should be one output variable of interest in the data. All the data used for analysis should be continuous. 

Load the data and clean it as you find necessary, standardize the data and split it into training and test data using an appropriate split ratio. 

In [2]:
# Load in the data
df = pd.read_csv("aids.csv", header=0)
# Clean the columns
# Remove the extra index column
df.drop("Unnamed: 0", axis=1, inplace=True)
# Drop missing values
df.dropna(inplace=True)
# Convert column types as necesary
df = pd.get_dummies(df, columns=["quarter"], drop_first=True)
# Standardize the data
scaler = StandardScaler()
df[["delay", "time", "year"]] = scaler.fit_transform(df[["delay", "time", "year"]])
df.head()

Unnamed: 0,year,delay,dud,time,y,quarter_2,quarter_3,quarter_4
0,-1.722508,-1.560948,0,-1.687055,2,0,1,0
1,-1.722508,-1.405372,0,-1.687055,6,0,1,0
2,-1.722508,-1.172007,0,-1.687055,0,0,1,0
3,-1.722508,-0.938643,0,-1.687055,1,0,1,0
4,-1.722508,-0.705279,0,-1.687055,1,0,1,0


In [3]:
# Split into features and response
X = df.drop("y", axis=1)
y = df["y"]
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(399, 7)
(399,)
(171, 7)
(171,)


Construct a linear regression model using ordinary least squares method by applying the .LinearRegression() constructor in sklearn and find the training and test accuracy of this model using mean square error (mse).

In [4]:
# Create the linear regression model and fit to the training set
lr = LinearRegression().fit(X_train, y_train)

# Predict on the training set
train_preds = lr.predict(X_train)
# Determine the MSE on the training set
n_train = len(X_train)
train_mse = np.sum((y_train - train_preds)**2) / n_train
print("MSE on the training set:", train_mse)

# Predict on the test set
test_preds = lr.predict(X_test)
# Determine the MSE on the test set
n_test = len(X_test)
test_mse = np.sum((y_test - test_preds)**2) / n_test
print("MSE on the test set:", test_mse)

MSE on the training set: 387.6500819236873
MSE on the test set: 419.70556601556973


Fit a lasso regression on the data and check the training and test accuracy of the model using mse. Use the default alpha or penalty constant.

In [8]:
# Create a Lasso regression model
lasso = Lasso().fit(X_train, y_train)

# Determine the MSE of the training set
train_preds_lasso = lasso.predict(X_train)
train_mse_lasso = np.sum((y_train - train_preds_lasso)**2) / n_train
print("Lasso Regression MSE on the training set:", train_mse_lasso)

# Determine the MSE of the test set
test_preds_lasso = lasso.predict(X_test)
test_mse_lasso = np.sum((y_test - test_preds_lasso)**2) / n_test
print("Lasso Regression MSE on the test set:", test_mse_lasso)

Lasso Regression MSE on the training set: 405.74373178664933
Lasso Regression MSE on the test set: 452.1139479370937


Fit a ridge regression on the data and check the training and test accuracy of the model. Use the default alpha or penalty constant

In [9]:
# Create a Ridge regression model
ridge = Ridge().fit(X_train, y_train)

# Determine the MSE of the training set
train_preds_ridge = ridge.predict(X_train)
train_mse_ridge = np.sum((y_train - train_preds_ridge)**2) / n_train
print("Ridge Regression MSE on the training set:", train_mse_ridge)

# Determine the MSE of the test set
test_preds_ridge = ridge.predict(X_test)
test_mse_ridge = np.sum((y_test - test_preds_ridge)**2) / n_test
print("Ridge Regression MSE on the test set:", test_mse_ridge)

Ridge Regression MSE on the training set: 387.68546727681166
Ridge Regression MSE on the test set: 420.2845439010731


Tune the alpha hyperparameters of the lasso and ridge regression using any tuning technique of your choice. What is the best alpha value for the lasso regression and what is the best alpha value for the ridge regression?

In [10]:
# Define a sequence of alpha values for the lasso and ridge regressions
param_grid = {"alpha": np.arange(0.05, 1, 0.05)}

# Use GridSearch to find the optimal alpha for Lasso Regression
grid_lasso = GridSearchCV(Lasso(), param_grid, scoring='neg_mean_squared_error')
grid_lasso.fit(X_train, y_train)
print("Best alpha for the lasso regression:", grid_lasso.best_params_)

# Use GridSearch to find the optimal alpha for Ridge Regression
grid_ridge = GridSearchCV(Ridge(), param_grid, scoring='neg_mean_squared_error')
grid_ridge.fit(X_train, y_train)
print("Best alpha for ridge regression:", grid_ridge.best_params_)

Best alpha for the lasso regression: {'alpha': 0.25}
Best alpha for ridge regression: {'alpha': 0.9500000000000001}
