# Preciting Movie Rental Durations

A DVD rental company wants to know how many days customers will rent a DVD based on features of the movie and the rental. Our goal is to create a model that achieves a test MSE of less than 3.

We are predicting `"rental_date"` - `"return_date"`, a quantitative value, so we'll create, test, and compare regression models.

The rental_info.csv file has the following variables:
- `"rental_date"`: The date (and time) the customer rents the DVD.
- `"return_date"`: The date (and time) the customer returns the DVD.
- `"amount"`: The amount paid by the customer for renting the DVD.
- `"amount_2"`: The square of `"amount"`.
- `"rental_rate"`: The rate at which the DVD is rented for.
- `"rental_rate_2"`: The square of `"rental_rate"`.
- `"release_year"`: The year the movie being rented was released.
- `"length"`: Lenght of the movie being rented, in minuites.
- `"length_2"`: The square of `"length"`.
- `"replacement_cost"`: The amount it will cost the company to replace the DVD.
- `"special_features"`: Any special features, for example trailers/deleted scenes that the DVD also has.
- `"NC-17"`, `"PG"`, `"PG-13"`, `"R"`: These columns are dummy variables of the rating of the movie. It takes the value 1 if the move is rated as the column name and 0 otherwise.

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_squared_error as MSE
from sklearn.metrics import roc_auc_score

from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor

from sklearn.utils.validation import column_or_1d
from sklearn.tree import DecisionTreeRegressor

In [2]:
rental_info = pd.read_csv('datasets/rental_info.csv') # 15,861 x 15
rental_info.head()

Unnamed: 0,rental_date,return_date,amount,release_year,rental_rate,length,replacement_cost,special_features,NC-17,PG,PG-13,R,amount_2,length_2,rental_rate_2
0,2005-05-25 02:54:33+00:00,2005-05-28 23:40:33+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
1,2005-06-15 23:19:16+00:00,2005-06-18 19:24:16+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
2,2005-07-10 04:27:45+00:00,2005-07-17 10:11:45+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
3,2005-07-31 12:06:41+00:00,2005-08-02 14:30:41+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
4,2005-08-19 12:30:04+00:00,2005-08-23 13:35:04+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401


### Data Cleaning

Sklearn regressors require quantitative features. Thus, we must quantify 3 columns
- `rental_date`
- `return_date`
- `special_features`

We will create 3 new columns and drop the non-numeric columns. We will create:
- `rental_length_days` (y)
- `deleted_scenes` (dummy variable)
- `behind_the_scenes` (dummy variable)

In [3]:
rental_info['rental_length_days'] = (pd.to_datetime(rental_info['return_date']) - 
                                     pd.to_datetime(rental_info['rental_date'])) / pd.Timedelta(days=1)

rental_info['deleted_scenes'] = rental_info['special_features'].str.contains('Deleted Scenes').astype(int)
rental_info['behind_the_scenes'] = rental_info['special_features'].str.contains('Behind the Scenes').astype(int)

rental_info_clean = rental_info.drop(columns = ['rental_date', 'return_date', 'special_features'])

rental_info_clean.head()

Unnamed: 0,amount,release_year,rental_rate,length,replacement_cost,NC-17,PG,PG-13,R,amount_2,length_2,rental_rate_2,rental_length_days,deleted_scenes,behind_the_scenes
0,2.99,2005.0,2.99,126.0,16.99,0,0,0,1,8.9401,15876.0,8.9401,3.865278,0,1
1,2.99,2005.0,2.99,126.0,16.99,0,0,0,1,8.9401,15876.0,8.9401,2.836806,0,1
2,2.99,2005.0,2.99,126.0,16.99,0,0,0,1,8.9401,15876.0,8.9401,7.238889,0,1
3,2.99,2005.0,2.99,126.0,16.99,0,0,0,1,8.9401,15876.0,8.9401,2.1,0,1
4,2.99,2005.0,2.99,126.0,16.99,0,0,0,1,8.9401,15876.0,8.9401,4.045139,0,1


### Train Test Split

We will now split our data, reserving 20% for the test set and setting random_state to 9. 

We also set `rental_length_days` as the target variable, y.

In [4]:
y = rental_info_clean[['rental_length_days']]
y = column_or_1d(y, warn = False)
X = rental_info_clean.drop(columns = ['rental_length_days'])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=9)

### Regression - Decision Tree, Random Forest, Gradient Boosting

Lets try the following regressors:
- `Decision Tree` regressor with max depth = 8 and min_samples_lead = .13
- `Random Forest` regressor with n_estimators = 25
- `Gradient Boosting` regressor with max_depth = 4 and n_estimators=200

In [5]:
# Instantiate dt
dt = DecisionTreeRegressor(max_depth=8, min_samples_leaf=.13, random_state=9)

# Fit dt to the training set
dt.fit(X_train, y_train)

# Compute y_pred
y_pred = dt.predict(X_test)

# Compute mse_dt
mse_dt = MSE(y_test, y_pred)

# Print mse_dt
print("Test set MSE of dt: {:.2f}".format(mse_dt))

# CV MSE
MSE_CV_dt = - cross_val_score(dt, X_train, y_train, cv=10, 
                                scoring='neg_mean_squared_error',
                                n_jobs=-1)
MSE_CV_dt_mean = MSE_CV_dt.mean()
print('Decision Tree CV MSE: {:.2f}'.format(MSE_CV_dt_mean))

Test set MSE of dt: 3.09
CV MSE: 3.05


In [6]:
# Instantiate rf
rf = RandomForestRegressor(n_estimators=25,
                           random_state=9)
            
# Fit rf to the training set    
rf.fit(X_train, y_train)

# Predict the test set labels
y_pred = rf.predict(X_test)

# Evaluate the test set RMSE
mse_rf = MSE(y_test, y_pred) 

# Print rmse_test
print('Test set MSE of rf: {:.2f}'.format(mse_rf))

# CV MSE
MSE_CV_rf = - cross_val_score(rf, X_train, y_train, cv=10, 
                                scoring='neg_mean_squared_error',
                                n_jobs=-1)
MSE_CV_rf_mean = MSE_CV_rf.mean()
print('Random Forest CV MSE: {:.2f}'.format(MSE_CV_rf_mean))

Test set MSE of rf: 1.83
CV MSE: 1.85


In [7]:
# Instantiate gb
gb = GradientBoostingRegressor(max_depth=4, 
                               n_estimators=200,
                               random_state=9)

# Fit gb to the training set
gb.fit(X_train, y_train)

# Predict test set labels
y_pred = gb.predict(X_test)

# Compute MSE
mse_gb = MSE(y_test, y_pred)

# Print RMSE
print('Test set MSE of gb: {:.3f}'.format(mse_gb))

# CV MSE
MSE_CV_gb = - cross_val_score(gb, X_train, y_train, cv=10, 
                                scoring='neg_mean_squared_error',
                                n_jobs=-1)
MSE_CV_gb_mean = MSE_CV_gb.mean()
print('Gradient Boosting CV MSE: {:.2f}'.format(MSE_CV_gb_mean))

Test set MSE of gb: 2.005
CV MSE: 1.94


### Comparing Results

In [9]:
print('Decision Tree CV MSE:     {:.2f}'.format(MSE_CV_dt_mean))
print('Random Forest CV MSE:     {:.2f}'.format(MSE_CV_rf_mean))
print('Gradient Boosting CV MSE: {:.2f}'.format(MSE_CV_gb_mean))

Decision Tree CV MSE:     3.05
Random Forest CV MSE:     1.85
Gradient Boosting CV MSE: 1.94


Without hyperparameter tuning, the Random Forest regressor performs the best, achieiving a CV MSE of 1.85, achieving our goal of producing a model with MSE less than 3!! Woohoo!!

Now, lets see how hyperparameter tuning can improve this result.

### Hyperparameter Tuning

We will use Grid Search Cross Validation to search through possible Random Forest hyperparameters to see what generates the best fit.
- `n_estimators`: 25, 100, 350, 500
- `max_features`: log2, sqrt
- `min_samples_leaf`: 1, 2, 10, 30

In [12]:
params_rf = {'n_estimators': [25, 100, 350, 500],
             'max_features': ['log2', 'sqrt'],
             'min_samples_leaf': [1, 2, 10, 30]}

# Instantiate grid_rf
grid_rf = GridSearchCV(estimator=rf,
                       param_grid=params_rf,
                       scoring='neg_mean_squared_error',
                       cv=3,
                       verbose=1,
                       n_jobs=-1)

grid_rf.fit(X_train, y_train)

# Extract the best estimator
best_model = grid_rf.best_estimator_

# Predict test set labels
y_pred = best_model.predict(X_test)

# Compute rmse_test
mse_test = MSE(y_test, y_pred)

# Print rmse_test
print('Test MSE of best model: {:.3f}'.format(mse_test))

# CV MSE
MSE_CV_rf_tuned = - cross_val_score(best_model, X_train, y_train, cv=10,
                                    scoring='neg_mean_squared_error',
                                    n_jobs=-1)
MSE_CV_rf_tuned_mean = MSE_CV_rf_tuned.mean()
print('Tuned Random Forest CV MSE: {:.2f}'.format(MSE_CV_rf_tuned_mean))

Fitting 3 folds for each of 32 candidates, totalling 96 fits
Test MSE of best model: 1.766
Tuned Random Forest CV MSE: 1.77


### Best Model

The Grid Search CV found optimal performance with the following hyperparameters:
- `n_estimators` = 500
- `max_features` = log2
- `min_samples_leaf` = 1

This achieved a Cross Validation Mean Squared Error of 1.77, achieving our goal of a model with MSE less than 3.