A DVD rental company needs your help! They want to figure out how many days a customer will rent a DVD for based on some features and has approached you for help. They want you to try out some regression models which will help predict the number of days a customer will rent a DVD for. The company wants a model which yeilds a MSE of 3 or less on a test set. The model you make will help the company become more efficient inventory planning.

The data they provided is in the csv file `rental_info.csv`. It has the following features:
- `"rental_date"`: The date (and time) the customer rents the DVD.
- `"return_date"`: The date (and time) the customer returns the DVD.
- `"amount"`: The amount paid by the customer for renting the DVD.
- `"amount_2"`: The square of `"amount"`.
- `"rental_rate"`: The rate at which the DVD is rented for.
- `"rental_rate_2"`: The square of `"rental_rate"`.
- `"release_year"`: The year the movie being rented was released.
- `"length"`: Lenght of the movie being rented, in minuites.
- `"length_2"`: The square of `"length"`.
- `"replacement_cost"`: The amount it will cost the company to replace the DVD.
- `"special_features"`: Any special features, for example trailers/deleted scenes that the DVD also has.
- `"NC-17"`, `"PG"`, `"PG-13"`, `"R"`: These columns are dummy variables of the rating of the movie. It takes the value 1 if the move is rated as the column name and 0 otherwise. For your convinience, the reference dummy has already been dropped.

# Importing modules + quick dataset look

In [167]:
import pandas as pd
import numpy as np
import plotly.express as px
import matplotlib.pyplot as plt

from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Lasso
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV

In [143]:
df=pd.read_csv('rental_info.csv')

In [144]:
df.isna().any()

rental_date         False
return_date         False
amount              False
release_year        False
rental_rate         False
length              False
replacement_cost    False
special_features    False
NC-17               False
PG                  False
PG-13               False
R                   False
amount_2            False
length_2            False
rental_rate_2       False
dtype: bool

In [145]:
df.head()

Unnamed: 0,rental_date,return_date,amount,release_year,rental_rate,length,replacement_cost,special_features,NC-17,PG,PG-13,R,amount_2,length_2,rental_rate_2
0,2005-05-25 02:54:33+00:00,2005-05-28 23:40:33+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
1,2005-06-15 23:19:16+00:00,2005-06-18 19:24:16+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
2,2005-07-10 04:27:45+00:00,2005-07-17 10:11:45+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
3,2005-07-31 12:06:41+00:00,2005-08-02 14:30:41+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
4,2005-08-19 12:30:04+00:00,2005-08-23 13:35:04+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401


In [146]:
df.sample(5)

Unnamed: 0,rental_date,return_date,amount,release_year,rental_rate,length,replacement_cost,special_features,NC-17,PG,PG-13,R,amount_2,length_2,rental_rate_2
13219,2005-08-01 10:30:04+00:00,2005-08-06 13:10:04+00:00,0.99,2006.0,0.99,68.0,23.99,"{""Behind the Scenes""}",0,1,0,0,0.9801,4624.0,0.9801
7821,2005-07-11 17:56:46+00:00,2005-07-18 16:53:46+00:00,4.99,2007.0,2.99,89.0,25.99,"{Trailers,Commentaries}",1,0,0,0,24.9001,7921.0,8.9401
7168,2005-08-22 06:57:06+00:00,2005-08-23 02:19:06+00:00,0.99,2009.0,0.99,128.0,9.99,"{Trailers,Commentaries,""Deleted Scenes""}",1,0,0,0,0.9801,16384.0,0.9801
3749,2005-08-21 13:16:19+00:00,2005-08-22 13:28:19+00:00,0.99,2004.0,0.99,168.0,10.99,"{Commentaries,""Deleted Scenes"",""Behind the Sce...",1,0,0,0,0.9801,28224.0,0.9801
11536,2005-08-23 10:19:12+00:00,2005-08-31 08:52:12+00:00,4.99,2008.0,2.99,152.0,9.99,"{Trailers,Commentaries,""Deleted Scenes"",""Behin...",0,0,1,0,24.9001,23104.0,8.9401


In [147]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15861 entries, 0 to 15860
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   rental_date       15861 non-null  object 
 1   return_date       15861 non-null  object 
 2   amount            15861 non-null  float64
 3   release_year      15861 non-null  float64
 4   rental_rate       15861 non-null  float64
 5   length            15861 non-null  float64
 6   replacement_cost  15861 non-null  float64
 7   special_features  15861 non-null  object 
 8   NC-17             15861 non-null  int64  
 9   PG                15861 non-null  int64  
 10  PG-13             15861 non-null  int64  
 11  R                 15861 non-null  int64  
 12  amount_2          15861 non-null  float64
 13  length_2          15861 non-null  float64
 14  rental_rate_2     15861 non-null  float64
dtypes: float64(8), int64(4), object(3)
memory usage: 1.8+ MB


# Data cleaning

### Taking care of dates

In [148]:
#first, converting data type
df['rental_date']=pd.to_datetime(df['rental_date'])
df['return_date']=pd.to_datetime(df['return_date'])

'''I don't want to keep the hours and minutes in those 2 variables, 
but since it can be valuable info, i'm creating a new variable with 
the amount of time the user rented the movie.'''
df['days_rented']=df['return_date']-df['rental_date']
df['days_rented']=df['days_rented'].dt.total_seconds() / 86400

#now, leaving out the hours, minutes, etc
df['rental_date']=df['rental_date'].dt.strftime('%Y-%m-%d')
df['return_date']=df['return_date'].dt.strftime('%Y-%m-%d')

In [149]:
df.head()

Unnamed: 0,rental_date,return_date,amount,release_year,rental_rate,length,replacement_cost,special_features,NC-17,PG,PG-13,R,amount_2,length_2,rental_rate_2,days_rented
0,2005-05-25,2005-05-28,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,3.865278
1,2005-06-15,2005-06-18,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,2.836806
2,2005-07-10,2005-07-17,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,7.238889
3,2005-07-31,2005-08-02,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,2.1
4,2005-08-19,2005-08-23,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,4.045139


### Changing datatypes

In [150]:
df['release_year']=df['release_year'].astype('int')

### Data visualization: Inspecting variables one by one

**Amount**

In [151]:
df.amount.describe()

count    15861.000000
mean         4.217161
std          2.360383
min          0.990000
25%          2.990000
50%          3.990000
75%          4.990000
max         11.990000
Name: amount, dtype: float64

In [152]:
px.histogram(df, x='amount', nbins=20)

**Release year**

In [153]:
df.release_year.value_counts(normalize=True)
#it's balanced (year-wise)

release_year
2004    0.166761
2006    0.157178
2007    0.153395
2010    0.139903
2009    0.138516
2005    0.132715
2008    0.111531
Name: proportion, dtype: float64

In [154]:
df.rental_rate.describe()

count    15861.000000
mean         2.944101
std          1.649766
min          0.990000
25%          0.990000
50%          2.990000
75%          4.990000
max          4.990000
Name: rental_rate, dtype: float64

In [155]:
df.length.describe()

count    15861.000000
mean       114.994578
std         40.114715
min         46.000000
25%         81.000000
50%        114.000000
75%        148.000000
max        185.000000
Name: length, dtype: float64

In [156]:
df.replacement_cost.describe()

count    15861.000000
mean        20.224727
std          6.083784
min          9.990000
25%         14.990000
50%         20.990000
75%         25.990000
max         29.990000
Name: replacement_cost, dtype: float64

In [157]:
px.histogram(df,x='replacement_cost',nbins=100)

In [158]:
df.special_features.value_counts()

special_features
{Trailers,Commentaries,"Behind the Scenes"}                     1308
{Trailers}                                                      1139
{Trailers,Commentaries}                                         1129
{Trailers,"Behind the Scenes"}                                  1122
{"Behind the Scenes"}                                           1108
{Commentaries,"Deleted Scenes","Behind the Scenes"}             1101
{Commentaries}                                                  1089
{Commentaries,"Behind the Scenes"}                              1078
{Trailers,"Deleted Scenes"}                                     1047
{"Deleted Scenes","Behind the Scenes"}                          1035
{"Deleted Scenes"}                                              1023
{Commentaries,"Deleted Scenes"}                                 1011
{Trailers,Commentaries,"Deleted Scenes","Behind the Scenes"}     983
{Trailers,Commentaries,"Deleted Scenes"}                         916
{Trailers,"Delete

In [159]:
mlb= MultiLabelBinarizer()

In [160]:
# first, preparing the variable values by taking care of spaces,{}, making everything lowercase...
df['special_features']=[c.lower().replace('"','').strip('{}').replace(' ','_') for c in df.special_features]

In [161]:
# second, converting the variable values into lists, so i can apply the mlb
df['special_features']=df.special_features.str.split(',')

In [162]:
# third, creating a df that has as many columns as different features are in 'special_features'
set_features=pd.DataFrame(mlb.fit_transform(df['special_features']),columns=mlb.classes_)

In [163]:
# and finally, merging those to my original df
df=df.join(set_features,how='left')

In [164]:
df=df.drop(['special_features','rental_date','return_date'], axis=1)

# Train-test split

In [165]:
x=df.drop('days_rented',axis=1)
y=df['days_rented']
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.1,random_state=42)

# Feature selection

In [168]:
from sklearn.preprocessing import StandardScaler

# Transforming data to have N(0,1)
scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

#instantiating the model
lasso = Lasso(alpha=0.1, random_state=42)
#fitting model to training data
lasso.fit(x_train_scaled, y_train)

# Access feature importance (coefficients)
lasso_coef = lasso.coef_
# Perform feature selectino by choosing columns with positive coefficients
x_lasso_train, x_lasso_test = x_train.iloc[:, lasso_coef > 0], x_test.iloc[:, lasso_coef > 0]

# Run OLS models on lasso chosen regression
ols = LinearRegression()
ols = ols.fit(x_lasso_train, y_train)
y_test_pred = ols.predict(x_lasso_test)
mse_lin_reg_lasso = mean_squared_error(y_test, y_test_pred)

# Hyperparameter tuning

In [169]:
# Random forest hyperparameter space
param_dist = {'n_estimators': np.arange(1,101,1),
            'max_depth':np.arange(1,11,1)}

In [170]:
# Create a random forest regressor
rf = RandomForestRegressor()

# Use random search to find the best hyperparameters
rand_search = RandomizedSearchCV(rf, 
                                param_distributions=param_dist, 
                                cv=5, 
                                random_state=9)

# Fit the random search object to the data
rand_search.fit(x_train, y_train)

# Creating a variable for the best hyper param
hyper_params = rand_search.best_params_

# Run the random forest on the chosen hyper parameters
rf = RandomForestRegressor(n_estimators=hyper_params["n_estimators"], 
                        max_depth=hyper_params["max_depth"], 
                        random_state=9)
rf.fit(x_train,y_train)
rf_pred = rf.predict(x_test)
mse_random_forest= mean_squared_error(y_test, rf_pred)

In [171]:
# Random forest gives lowest MSE so:
best_model = rf
best_mse = mse_random_forest