![dvd_image](dvd_image.jpg)

A DVD rental company needs your help! They want to figure out how many days a customer will rent a DVD for based on some features and has approached you for help. They want you to try out some regression models which will help predict the number of days a customer will rent a DVD for. The company wants a model which yeilds a MSE of 3 or less on a test set. The model you make will help the company become more efficient inventory planning.

The data they provided is in the csv file `rental_info.csv`. It has the following features:
- `"rental_date"`: The date (and time) the customer rents the DVD.
- `"return_date"`: The date (and time) the customer returns the DVD.
- `"amount"`: The amount paid by the customer for renting the DVD.
- `"amount_2"`: The square of `"amount"`.
- `"rental_rate"`: The rate at which the DVD is rented for.
- `"rental_rate_2"`: The square of `"rental_rate"`.
- `"release_year"`: The year the movie being rented was released.
- `"length"`: Lenght of the movie being rented, in minuites.
- `"length_2"`: The square of `"length"`.
- `"replacement_cost"`: The amount it will cost the company to replace the DVD.
- `"special_features"`: Any special features, for example trailers/deleted scenes that the DVD also has.
- `"NC-17"`, `"PG"`, `"PG-13"`, `"R"`: These columns are dummy variables of the rating of the movie. It takes the value 1 if the move is rated as the column name and 0 otherwise. For your convinience, the reference dummy has already been dropped.

In [242]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error




## Getting Started
Started by importing the file as a DataFrame and viewing it and its info.

In [243]:
#Importing in the file
rental_df = pd.read_csv('rental_info.csv')
#Viewing the file and its info
print(rental_df.info())
rental_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15861 entries, 0 to 15860
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   rental_date       15861 non-null  object 
 1   return_date       15861 non-null  object 
 2   amount            15861 non-null  float64
 3   release_year      15861 non-null  float64
 4   rental_rate       15861 non-null  float64
 5   length            15861 non-null  float64
 6   replacement_cost  15861 non-null  float64
 7   special_features  15861 non-null  object 
 8   NC-17             15861 non-null  int64  
 9   PG                15861 non-null  int64  
 10  PG-13             15861 non-null  int64  
 11  R                 15861 non-null  int64  
 12  amount_2          15861 non-null  float64
 13  length_2          15861 non-null  float64
 14  rental_rate_2     15861 non-null  float64
dtypes: float64(8), int64(4), object(3)
memory usage: 1.8+ MB
None


Unnamed: 0,rental_date,return_date,amount,release_year,rental_rate,length,replacement_cost,special_features,NC-17,PG,PG-13,R,amount_2,length_2,rental_rate_2
0,2005-05-25 02:54:33+00:00,2005-05-28 23:40:33+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
1,2005-06-15 23:19:16+00:00,2005-06-18 19:24:16+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
2,2005-07-10 04:27:45+00:00,2005-07-17 10:11:45+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
3,2005-07-31 12:06:41+00:00,2005-08-02 14:30:41+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
4,2005-08-19 12:30:04+00:00,2005-08-23 13:35:04+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401


## Pre-processing
First, the data in the 'rental_date' and 'return_date' columns were converted to datetime objects. A new column titled 'rental_length' was created that contained the total length of each rental. The number of days each rental was rented for was then calculated. Next, two columns of dummy variables were created from the 'special_features' column, indicating if the rental contained Deleted Scenes or Behind the Scenes.  

In [244]:
import datetime as dt
#Convert rental_date and return_date to datetime objects
rental_df['rental_date'] = pd.to_datetime(rental_df['rental_date'])
rental_df['return_date'] = pd.to_datetime(rental_df['return_date'])

#Calculate rental length and rental length in days
rental_df['rental_length'] = rental_df['return_date'] - rental_df['rental_date']
rental_df['rental_length_days'] = rental_df['rental_length'].dt.days

print(rental_df[['rental_date', 'return_date', 'rental_length_days']].head())

#Check all possible responses in the special_features column
print(rental_df['special_features'].unique())

#Create two columns of dummy variables for Deleted Scenes and Behind the Scenes
rental_df['deleted_scenes'] = rental_df['special_features'].str.contains('Deleted Scenes').astype('int')
rental_df['behind_the_scenes'] = rental_df['special_features'].str.contains('Behind the Scenes').astype('int')

rental_df.head()

                rental_date               return_date  rental_length_days
0 2005-05-25 02:54:33+00:00 2005-05-28 23:40:33+00:00                   3
1 2005-06-15 23:19:16+00:00 2005-06-18 19:24:16+00:00                   2
2 2005-07-10 04:27:45+00:00 2005-07-17 10:11:45+00:00                   7
3 2005-07-31 12:06:41+00:00 2005-08-02 14:30:41+00:00                   2
4 2005-08-19 12:30:04+00:00 2005-08-23 13:35:04+00:00                   4
['{Trailers,"Behind the Scenes"}' '{Trailers}'
 '{Commentaries,"Behind the Scenes"}' '{Trailers,Commentaries}'
 '{"Deleted Scenes","Behind the Scenes"}'
 '{Commentaries,"Deleted Scenes","Behind the Scenes"}'
 '{Trailers,Commentaries,"Deleted Scenes"}' '{"Behind the Scenes"}'
 '{Trailers,"Deleted Scenes","Behind the Scenes"}'
 '{Commentaries,"Deleted Scenes"}' '{Commentaries}'
 '{Trailers,Commentaries,"Behind the Scenes"}'
 '{Trailers,"Deleted Scenes"}' '{"Deleted Scenes"}'
 '{Trailers,Commentaries,"Deleted Scenes","Behind the Scenes"}']


Unnamed: 0,rental_date,return_date,amount,release_year,rental_rate,length,replacement_cost,special_features,NC-17,PG,PG-13,R,amount_2,length_2,rental_rate_2,rental_length,rental_length_days,deleted_scenes,behind_the_scenes
0,2005-05-25 02:54:33+00:00,2005-05-28 23:40:33+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,3 days 20:46:00,3,0,1
1,2005-06-15 23:19:16+00:00,2005-06-18 19:24:16+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,2 days 20:05:00,2,0,1
2,2005-07-10 04:27:45+00:00,2005-07-17 10:11:45+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,7 days 05:44:00,7,0,1
3,2005-07-31 12:06:41+00:00,2005-08-02 14:30:41+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,2 days 02:24:00,2,0,1
4,2005-08-19 12:30:04+00:00,2005-08-23 13:35:04+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,4 days 01:05:00,4,0,1


## Creating the Models
Before creating and testing various regression models, the dataset was split up into X, the features, and y, the response variables. X and y were then split into testing and training data with 80% of the data being allocating to training and 20% allocated to testing. It was then time to create the models. The first model created was a simple linear regression model. Its mean squared error was calculating by comparing the test responses with the model's predicted responses, and was stored in a dictionary. The next two models created were Range and Lasso models. Five values were tested for alpha for each model, and the best mean squared error for each model was appended to the `model_mse` dictionary. A Decision Tree Regressor model, Random Forests Regressor model, and Gradient Boosting Regressor model were next instantiated. The Random Forests Regressor model and Gradient Boosting Regressor model were fit with an arbitrary n_components value of 200, and the Gradient Boosting Regressor was fit with the arbitrary max_depth of 2.
A for loop was used to loop through all three tree-based models and append their respective mean squared errors to the `model_mse` dictionary. All regression model's fit the criteria of yielding a mean squared error of less than 3, but the model that performed the best was the Random Forest Regressor model, yielding a mean squared error of approximately 2.03.

In [245]:
#Create the feature dataset, X, and response variables, y
X = rental_df.drop(columns = ['rental_date', 'return_date', 'special_features', 'rental_length', 'rental_length_days' ])
y = rental_df['rental_length_days']
print(y)
X.head()

0        3
1        2
2        7
3        2
4        4
        ..
15856    6
15857    4
15858    9
15859    8
15860    6
Name: rental_length_days, Length: 15861, dtype: int64


Unnamed: 0,amount,release_year,rental_rate,length,replacement_cost,NC-17,PG,PG-13,R,amount_2,length_2,rental_rate_2,deleted_scenes,behind_the_scenes
0,2.99,2005.0,2.99,126.0,16.99,0,0,0,1,8.9401,15876.0,8.9401,0,1
1,2.99,2005.0,2.99,126.0,16.99,0,0,0,1,8.9401,15876.0,8.9401,0,1
2,2.99,2005.0,2.99,126.0,16.99,0,0,0,1,8.9401,15876.0,8.9401,0,1
3,2.99,2005.0,2.99,126.0,16.99,0,0,0,1,8.9401,15876.0,8.9401,0,1
4,2.99,2005.0,2.99,126.0,16.99,0,0,0,1,8.9401,15876.0,8.9401,0,1


In [246]:
#Create an empty dictionary that will hold the MSE for each model
model_mse = {}
#Split the data up into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 9)

#Create a linear regression model
from sklearn.linear_model import LinearRegression
reg= LinearRegression()
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)

#Add the model's mean squared error to the model_mse dictionary
model_mse['Linear Regression'] = mean_squared_error(y_test, y_pred)

#Create a Ridge regression model, testing different alphas to find the best
from sklearn.linear_model import Ridge
ridge_scores = []
for alpha in [0.1, 1.0, 10.0, 100.0, 1000.0]:
    ridge=Ridge(alpha=alpha)
    ridge.fit(X_train, y_train)
    y_pred = ridge.predict(X_test)
    ridge_scores.append(mean_squared_error(y_test, y_pred))
print(ridge_scores)

#Append the model's mean square error to the model_mse dictionary
model_mse['Ridge with alpha 0.1'] = np.min(ridge_scores)

#Create and fit a Lasso model, testing different alphas
from sklearn.linear_model import Lasso
lasso_scores = []
for alpha in [0.01, 1.0, 10.0, 20.0, 50.0]:
    lasso = Lasso(alpha=alpha)
    lasso.fit(X_train, y_train)
    y_pred = lasso.predict(X_test)
    lasso_scores.append(mean_squared_error(y_test,  y_pred))
print(lasso_scores)

#Append the model's mean square error to the model_mse dictionary
model_mse['Lasso with alpha 0.01'] = np.min(lasso_scores)

#Instantiate a DecisionTreeRegressor, RandomForestRegressor, and GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor
dt= DecisionTreeRegressor(random_state=9)

from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
rf= RandomForestRegressor(n_estimators=200, random_state= 9)

gbr= GradientBoostingRegressor(n_estimators=200, max_depth=2, random_state= 9)

#Create a dictionary for each model and its name
tree_models = {'Decision Tree Regressor': dt, 
              'Random Forest Regressor' : rf,
              'Gradient Forest Regressor' : gbr}

#Loop through the tree_models dictionary to fit each model, predict y_pred, and calculate the mean squared error of each model, appending its value to the model_mse dictionary
for key, value in tree_models.items():
    value.fit(X_train, y_train)
    y_pred = value.predict(X_test)
    model_mse[key] = mean_squared_error(y_test, y_pred)
    
print(model_mse)


[2.9417273159308, 2.9417585460802016, 2.9420872318411377, 2.9467133542507025, 3.023538481960957]
[2.9502513833772244, 3.8056884092652106, 5.108705300902628, 5.696234753427892, 7.099959795332965]
{'Linear Regression': 2.9417238646975883, 'Ridge with alpha 0.1': 2.9417273159308, 'Lasso with alpha 0.01': 2.9502513833772244, 'Decision Tree Regressor': 2.1675004952579413, 'Random Forest Regressor': 2.027656844752081, 'Gradient Forest Regressor': 2.5056981906117826}


In [247]:
best_model = rf
best_mse = model_mse['Random Forest Regressor']
print(best_model, best_mse)

RandomForestRegressor(n_estimators=200, random_state=9) 2.027656844752081
