![dvd_image](dvd_image.jpg)

A DVD rental company needs your help! They want to figure out how many days a customer will rent a DVD for based on some features and has approached you for help. They want you to try out some regression models which will help predict the number of days a customer will rent a DVD for. The company wants a model which yeilds a MSE of 3 or less on a test set. The model you make will help the company become more efficient inventory planning.

The data they provided is in the csv file `rental_info.csv`. It has the following features:
- `"rental_date"`: The date (and time) the customer rents the DVD.
- `"return_date"`: The date (and time) the customer returns the DVD.
- `"amount"`: The amount paid by the customer for renting the DVD.
- `"amount_2"`: The square of `"amount"`.
- `"rental_rate"`: The rate at which the DVD is rented for.
- `"rental_rate_2"`: The square of `"rental_rate"`.
- `"release_year"`: The year the movie being rented was released.
- `"length"`: Lenght of the movie being rented, in minuites.
- `"length_2"`: The square of `"length"`.
- `"replacement_cost"`: The amount it will cost the company to replace the DVD.
- `"special_features"`: Any special features, for example trailers/deleted scenes that the DVD also has.
- `"NC-17"`, `"PG"`, `"PG-13"`, `"R"`: These columns are dummy variables of the rating of the movie. It takes the value 1 if the move is rated as the column name and 0 otherwise. For your convinience, the reference dummy has already been dropped.

In [12]:
# Import essential packages
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Import any additional modules and start coding below
from sklearn.model_selection import RandomizedSearchCV
import sklearn.linear_model as LM
import sklearn.tree as T
import sklearn.ensemble as E
import sklearn.preprocessing as PP
import matplotlib.pyplot as plt

In [13]:
# Read CSV file
rental_info = pd.read_csv('rental_info.csv', parse_dates=['rental_date', 'return_date'])
print(rental_info.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15861 entries, 0 to 15860
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype              
---  ------            --------------  -----              
 0   rental_date       15861 non-null  datetime64[ns, UTC]
 1   return_date       15861 non-null  datetime64[ns, UTC]
 2   amount            15861 non-null  float64            
 3   release_year      15861 non-null  float64            
 4   rental_rate       15861 non-null  float64            
 5   length            15861 non-null  float64            
 6   replacement_cost  15861 non-null  float64            
 7   special_features  15861 non-null  object             
 8   NC-17             15861 non-null  int64              
 9   PG                15861 non-null  int64              
 10  PG-13             15861 non-null  int64              
 11  R                 15861 non-null  int64              
 12  amount_2          15861 non-null  float64            
 13  l

In [14]:
# Converting release_year to integer
rental_info['release_year'] = rental_info['release_year'].astype(int)

In [15]:
# Calculating rental_length_days by subtracting the return_date and rental_date, converted into days
rental_info['rental_length_days'] = (rental_info['return_date'] - rental_info['rental_date']).dt.days

In [16]:
# Transforming deleted_scenes and behind_the_scenes to dummy features, and dropping special_features
rental_info['deleted_scenes'] = np.where(rental_info['special_features'].str.contains("Deleted Scenes"), 1, 0)
rental_info['behind_the_scenes'] = np.where(rental_info['special_features'].str.contains("Behind the Scenes"), 1, 0)
rental_info = rental_info.drop(['special_features'], axis=1)

In [17]:
# Initialising variables for features and target sets
X = rental_info.drop(['rental_date', 'return_date', 'rental_length_days'], axis=1)
y = rental_info['rental_length_days']
print(X.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15861 entries, 0 to 15860
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   amount             15861 non-null  float64
 1   release_year       15861 non-null  int64  
 2   rental_rate        15861 non-null  float64
 3   length             15861 non-null  float64
 4   replacement_cost   15861 non-null  float64
 5   NC-17              15861 non-null  int64  
 6   PG                 15861 non-null  int64  
 7   PG-13              15861 non-null  int64  
 8   R                  15861 non-null  int64  
 9   amount_2           15861 non-null  float64
 10  length_2           15861 non-null  float64
 11  rental_rate_2      15861 non-null  float64
 12  deleted_scenes     15861 non-null  int64  
 13  behind_the_scenes  15861 non-null  int64  
dtypes: float64(7), int64(7)
memory usage: 1.7 MB
None


In [18]:
# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=9)

In [19]:
# Initialising and training a Lasso regression analysis and evaluating coefficients
lasso = LM.Lasso(alpha=0.3, random_state=9)
lasso_coef = lasso.fit(X_train, y_train).coef_

# Printing set of positive coefficients for training and test sets
X_train_l = X_train.iloc[:, lasso_coef > 0]
X_test_l = X_test.iloc[:, lasso_coef > 0]
print(X_train_l, X_test_l)

       amount  amount_2  length_2
6682     2.99    8.9401    8100.0
8908     4.99   24.9001    2809.0
11827    6.99   48.8601   29241.0
6153     2.99    8.9401    5329.0
10713    5.99   35.8801   14884.0
...       ...       ...       ...
6200     1.99    3.9601   24649.0
501      6.99   48.8601   18225.0
6782     5.99   35.8801   31329.0
4444     2.99    8.9401   28561.0
8574     5.99   35.8801   27556.0

[12688 rows x 3 columns]        amount  amount_2  length_2
15067    4.99   24.9001   33856.0
3808     4.99   24.9001   32041.0
1015     4.99   24.9001    5329.0
12617    4.99   24.9001   29584.0
1711     4.99   24.9001    8281.0
...       ...       ...       ...
2828     2.99    8.9401   16641.0
8917     9.99   99.8001   17956.0
13592    0.99    0.9801   17956.0
7739     2.99    8.9401   19881.0
1768     3.99   15.9201    8649.0

[3173 rows x 3 columns]


In [20]:
# Creating a dictionary of models to train and evaluate
models = {
    "lr": LM.LinearRegression(),
    'dtr': T.DecisionTreeRegressor(),
    'rfr': E.RandomForestRegressor()
}

# Creating parameter grids for each model
params = {
    'lr': {},
    'dtr': {
        'max_depth': np.arange(1, 8, 1),
        'min_samples_leaf': np.linspace(0.01, 0.5, 50)
    },
    'rfr': {
        'n_estimators': np.arange(1, 101, 1),
        'max_depth': np.arange(1, 11, 1),
    }
}

# Looping through the models and performing Randomized Search cross-validation on each model
for k, v in models.items():
    grid_model = RandomizedSearchCV(v,
                                    param_distributions=params[k],
                                    cv=5,
                                    random_state=9)
    
    # Training every model with Lasso-extracted observations
    grid_model.fit(X_train_l, y_train)
    
    # Printing the best parameters for each model
    best_params = grid_model.best_params_
    print("For {}, the best parameters are {}".format(k, best_params))


For lr, the best parameters are {}
For dtr, the best parameters are {'min_samples_leaf': 0.01, 'max_depth': 5}
For rfr, the best parameters are {'n_estimators': 19, 'max_depth': 10}


In [21]:
# Creating a list of tuned models to evaluate MSE scores
tuned_models = {
    'lr': LM.LinearRegression(),
    'dtr': T.DecisionTreeRegressor(max_depth=5, min_samples_leaf=0.01),
    'rfr': E.RandomForestRegressor(n_estimators=51, max_depth=10)
}

# Empty list of MSE scores for every model
mse_results = []

# Looping through every model and printing MSE scores
for k, v in tuned_models.items():
    v.fit(X_train, y_train)
    y_pred = v.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    print('For {}, MSE is {}'.format(k, mse))
    mse_results.append(mse)

For lr, MSE is 2.9417238646975883
For dtr, MSE is 2.5186285339846175
For rfr, MSE is 2.214202327737217


In [22]:
# Printing the best model
best_model = tuned_models['rfr']
best_mse = mse_results[2]
best_mse

2.214202327737217