# Production Model

For my production model, I have chosen to go with ridge regression, as it had the lowest RMSE out of all of my models. 

Theoretically, to improve my ridge regression model, I would adjust the alphas and number of cross validations, as shown below.  

In [3]:
import pandas as pd
from sklearn.linear_model import Ridge, RidgeCV
from sklearn.metrics import r2_score
import matplotlib.pyplot as plt
import pickle
import numpy as np
np.random.seed(42)

%matplotlib inline

In [4]:
X = pd.read_pickle("../datasets/training_data_cleaned_X.pkl")
y = pd.read_pickle("../datasets/training_data_cleaned_y.pkl")
X_train_sc = np.load('../datasets/X_train_sc.npy')
X_test_sc = np.load('../datasets/X_test_sc.npy')
y_train = np.load('../datasets/y_train.npy')
y_test = np.load('../datasets/y_test.npy')

In [5]:
# Set up a list of ridge alphas to check.
r_alphas = np.logspace(0, 100, 100)
# Generates 100 values equally between 0 and 100,
# then converts them to alphas between 10^0 and 10^100.

# Cross-validate over our list of ridge alphas.
ridge_model = RidgeCV(alphas=r_alphas, scoring='r2', cv=5)

# Fit model using best ridge alpha!
ridge_model = ridge_model.fit(X_train_sc, y_train)

In [6]:
ridge_optimal_alpha = ridge_model.alpha_
ridge_optimal_alpha

1072.2672220103232

In [7]:
ridge_model_preds = ridge_model.predict(X_test_sc)
ridge_model_preds_train = ridge_model.predict(X_train_sc)
ridge_model_preds_r2 = r2_score(y_test, ridge_model_preds)
ridge_model_preds_train_r2 = r2_score(y_train, ridge_model_preds_train)
print(f"The r2 score of the training set is {ridge_model_preds_train_r2}.")
print(f"The r2 score of the test set is {ridge_model_preds_r2}.")

The r2 score of the training set is 0.8629541544419306.
The r2 score of the test set is 0.8950290182875826.


In [8]:
def RMSE(true, predicted):
    diff = true - predicted
    squared_diff = np.square(diff)
    return np.mean(squared_diff)**0.5
RMSE(y_test,ridge_model.predict(X_test_sc))

25387.48557221171

As you can see, increasing the range of my alphas made my RMSE larger. This means that if I increase the range of my alphas, then my model will become less accurate. 

Now let's try decreasing their range. 

In [9]:
# Set up a list of ridge alphas to check.
r_alphas = np.logspace(0, 1, 100)
# Generates 100 values equally between 0 and 1,
# then converts them to alphas between 10^0 and 10^1.

# Cross-validate over our list of ridge alphas.
ridge_model = RidgeCV(alphas=r_alphas, scoring='r2', cv=5)

# Fit model using best ridge alpha!
ridge_model = ridge_model.fit(X_train_sc, y_train)

In [10]:
ridge_optimal_alpha = ridge_model.alpha_
ridge_optimal_alpha

10.0

In [11]:
ridge_model_preds = ridge_model.predict(X_test_sc)
ridge_model_preds_train = ridge_model.predict(X_train_sc)
ridge_model_preds_r2 = r2_score(y_test, ridge_model_preds)
ridge_model_preds_train_r2 = r2_score(y_train, ridge_model_preds_train)
print(f"The r2 score of the training set is {ridge_model_preds_train_r2}.")
print(f"The r2 score of the test set is {ridge_model_preds_r2}.")

The r2 score of the training set is 0.8914105528878845.
The r2 score of the test set is 0.8997795421952295.


In [12]:
RMSE(y_test,ridge_model.predict(X_test_sc))

24806.371951665413

Decreasing the range of my alphas made them larger as well, althought not as much as when I increased them. 

It appears that I have to be more careful about how I adjust the range of my alpha values in order to get a lowered MSE. 