# Final Model

This notebook serves to perform a deeper analysis of the best model out of all the ones that were tested in the process of building a model to predict the count of riders on an hourly basis.

In the cell below, we reconstruct that model and store it so we can evaluate its performance and explore the predictions it makes against the actual values. On an initial exploration of that particular model and the predictions that it gave, we noticed that it actually will make negative predictions from time to time. This clearly is not good behavior since in the end the value that the model aims to predict is a count. To combat this issue, we adjust the model's predictions by simply replacing negative values with 0. This resulted in a slight increase in $R^{2}$ from 0.8638 to 0.864, but most importantly ensured that the model will never return invalid values.

In [2]:
import dask.dataframe as dd
import dask.array as da
import featuretools
from sklearn.ensemble import GradientBoostingRegressor
from preprocessing import *
from feature_creation import *
from utils import PATH, SEED, update_df, get_train, get_holdout

# We noticed that the regular GradientBoostingRegressor that was the final model selected was returning negative
# predictions for the count on occasion (~50 predictions). We made a class to return the predictions as normal
# but with negative predictions replaced with 0.

class NonNegativeGBR(GradientBoostingRegressor):
    def predict(self, X):
        predictions = super(NonNegativeGBR, self).predict(X)
        max_zero = da.vectorize(lambda x: max(0, x))
        return max_zero(predictions)
    
# All the preprocessing steps
pp = [drop_registered, drop_casual, drop_date, drop_instant, year_as_bool,
      season_as_category, month_as_category, weekday_as_category,
      hour_as_category, holiday_as_bool, working_day_as_bool, weather_sit_as_category,
     change_season, change_weather_sit]

# The feature creation steps that yielded the best results
fc = [commute_hours, weather_cluster, deep_features]

# The final features that had been selected by the Recursive Feature Elimination
final_features = ['yr', 'workingday', 'temp', 'atemp', 'hum', 'windspeed', 'commute_hours', 'seasons.STD(bikeshare_hourly.hum)', 'seasons.MAX(bikeshare_hourly.temp)', 'seasons.MAX(bikeshare_hourly.windspeed)', 'seasons.MIN(bikeshare_hourly.temp)', 'seasons.MEAN(bikeshare_hourly.temp)', 'seasons.MEAN(bikeshare_hourly.hum)', 'seasons.MEAN(bikeshare_hourly.windspeed)', 'mnth_2', 'mnth_3', 'mnth_4', 'mnth_5', 'mnth_6', 'mnth_8', 'mnth_12', 'hr_0', 'hr_1', 'hr_2', 'hr_3', 'hr_4', 'hr_5', 'hr_6', 'hr_7', 'hr_8', 'hr_9', 'hr_10', 'hr_11', 'hr_12', 'hr_13', 'hr_14', 'hr_15', 'hr_16', 'hr_17', 'hr_18', 'hr_19', 'hr_20', 'hr_21', 'hr_22', 'hr_23', 'weekday_4', 'weekday_5', 'weathersit_Heavy Precipitation', 'weather_cluster_1', 'weather_cluster_2']
data = dd.read_csv(PATH)
target = data["cnt"]

    # Perform all the steps
preprocessed = update_df(data, pp)
with_created = update_df(preprocessed, fc)
dummified = dd.get_dummies(with_created)
with_selected = dummified.loc[:, final_features]

# The model with the chosen parameters from the Grid Search Cross Validation
model = NonNegativeGBR(n_estimators=750, learning_rate=0.1, random_state=SEED)

# Fit the model
model = model.fit(get_train(with_selected), get_train(target))

# Retrieve the score
score = model.score(get_holdout(with_selected), get_holdout(target))

score



AttributeError: 'NoneType' object has no attribute 'copy'

## Predictions vs. Actual Values over time

Next, we make a plot of both the predicted and the actual values over time. We use one week for the example since more than this makes it difficult to see any patterns in the data. The rise and fall in the number of rides over time can clearly be seen in the chart below, and the predictions for these values also manages to keep in time with these fluctuations in the number of rides as hours progress. We were pleased to see that our model was able to capture so well the patterns that naturally exist in the number of rides taken per hour.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# The model's predictions on the holdout
predictions = model.predict(get_holdout(with_selected))

# The actual values
pred_actual = pd.DataFrame({"time_index": list(range(len(predictions))),
                            "predicted": predictions,
                            "actual": get_holdout(target)})

# Just look at one week (more than this is hard to see)
by_time = pred_actual[:118] # one week

# Convert data from tabular to transactional format
melted = pd.melt(by_time, value_vars=["predicted", "actual"], id_vars=["time_index"])

# Generate the graph
fig, ax = plt.subplots(figsize=(20, 10))
g = sns.scatterplot(data=melted, 
                    x="time_index",
                    y="value",
                    hue="variable",
                    palette={"predicted": "#a12113", "actual": "#ff6352"})
plt.title("Predictions and Actual Values vs. Time (One Week Period)")
plt.xlabel("Time Index")
plt.ylabel("Count of Rides")

handles, labels = ax.get_legend_handles_labels()
ax.legend(handles=handles[1:], labels=labels[1:])

## Predictions vs. Actual Values

Last, a scatterplot of predicted values against actual values. Of course, the ideal situation would be a perfect line which would mean that the prediction matches the actual value for all actual values. We don't get quite this, but you can still see the tight linear relationship between our predictions and the actual values. This is the chart that also led us to realize that our original model was making negative predictions. This time it can clearly be seen that the lowest value for our predictions is 0, which is exactly what should be the case when predicting a count.

In [None]:
# Plot of predictions vs. Actual

fig, ax = plt.subplots(figsize=(20, 10))
sns.scatterplot(data=pred_actual, x="predicted", y="actual", color="firebrick")
plt.title("Predictions vs. Actual Values")

### Double-checking the updated model

Lastly, we just wanted to make sure that doing a simple replacement of negative values with zeroes was a valid choice and that it was working properly. First, we check that there are no negative values in the predictions and find none. Next, we wanted to see if there were any hours that were close to zero - seeing that the real minimum of the actual values is 1, we see that 0 is a reasonable prediction to make. Lastly, we check under what circumstances we make a prediction of 0 to see if it makes sense, and we see that it does. The majority of the examples are between midnight and 4 AM, and the exceptions appear to be exclusively on holidays which also logically makes sense. This helps us feel assured that our model is doing the right thing.

In [None]:
# Make sure we don't have any more negative predictions
np.min(predictions)

In [None]:
# How far off could we be?
np.min(target)

In [None]:
# Same check as before
predictions[predictions < 0]

In [None]:
# Out of curiousity, what were the hours where we predicted a negative value or zero?
get_holdout(data)[predictions==0]