A DVD rental company requires assistance in predicting rental durations based on specified features. Regression models must be developed to forecast the number of rental days, aiming to achieve a Mean Squared Error (MSE) of 3 or less on a test dataset. Successful model implementation will facilitate inventory planning optimization for the company.

Requirements:

- Develop regression models for DVD rental duration prediction.
- Ensure models achieve a Mean Squared Error (MSE) of 3 or lower on the test dataset.
- This task entails constructing predictive models tailored to DVD rental durations, with an emphasis on attaining actionable insights for inventory management. Successful completion will contribute to operational efficiency and decision-making processes for the company.

The data they provided is in the csv file `rental_info.csv`. It has the following features:
- `"rental_date"`: The date (and time) the customer rents the DVD.
- `"return_date"`: The date (and time) the customer returns the DVD.
- `"amount"`: The amount paid by the customer for renting the DVD.
- `"amount_2"`: The square of `"amount"`.
- `"rental_rate"`: The rate at which the DVD is rented for.
- `"rental_rate_2"`: The square of `"rental_rate"`.
- `"release_year"`: The year the movie being rented was released.
- `"length"`: Lenght of the movie being rented, in minuites.
- `"length_2"`: The square of `"length"`.
- `"replacement_cost"`: The amount it will cost the company to replace the DVD.
- `"special_features"`: Any special features, for example trailers/deleted scenes that the DVD also has.
- `"NC-17"`, `"PG"`, `"PG-13"`, `"R"`: These columns are dummy variables of the rating of the movie. It takes the value 1 if the move is rated as the column name and 0 otherwise. For your convinience, the reference dummy has already been dropped.

In [16]:
# Importing required modules
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import mean_squared_error as MSE
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler

# Creating a dataframe
rental_df = pd.read_csv('rental_info.csv', parse_dates=['rental_date', 'return_date'])

# Examining the dataframe for correct values, missing values and numerical distributions
print(rental_df.head())
print(rental_df.info())
print(rental_df.describe())
print(rental_df[['release_year']].head(3))
print(rental_df['special_features'].unique())
print(rental_df.isna().sum())

                rental_date               return_date  ...  length_2  rental_rate_2
0 2005-05-25 02:54:33+00:00 2005-05-28 23:40:33+00:00  ...   15876.0         8.9401
1 2005-06-15 23:19:16+00:00 2005-06-18 19:24:16+00:00  ...   15876.0         8.9401
2 2005-07-10 04:27:45+00:00 2005-07-17 10:11:45+00:00  ...   15876.0         8.9401
3 2005-07-31 12:06:41+00:00 2005-08-02 14:30:41+00:00  ...   15876.0         8.9401
4 2005-08-19 12:30:04+00:00 2005-08-23 13:35:04+00:00  ...   15876.0         8.9401

[5 rows x 15 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15861 entries, 0 to 15860
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype              
---  ------            --------------  -----              
 0   rental_date       15861 non-null  datetime64[ns, UTC]
 1   return_date       15861 non-null  datetime64[ns, UTC]
 2   amount            15861 non-null  float64            
 3   release_year      15861 non-null  float64            
 4   renta

In [17]:
# Assigning the correct format to the release_year column
rental_df['release_year'] = rental_df['release_year'].astype('int')

# Creating a column for the number of rental days, which will be the dependent variable
rental_df['rental_length_days'] = rental_df['return_date'] - rental_df['rental_date']
rental_df['rental_length_days'] = rental_df['rental_length_days'].dt.days

# Verification of previous operations
print(rental_df[['rental_date', 'return_date', 'rental_length_days']].head())

# Creating dummy variables for "Behind the Scenes" and "Deleted Scenes"
rental_df['behind_the_scenes'] = np.where(rental_df['special_features'].str.contains('"Behind the Scenes"'), 1, 0)
rental_df['deleted_scenes'] = np.where(rental_df['special_features'].str.contains('"Deleted Scenes"'), 1, 0)

# Verification of previous operations
print(rental_df[['special_features', 'behind_the_scenes', 'deleted_scenes']])

# Deleting the original column from the dataframe
rental_df = rental_df.drop('special_features', axis=1)

                rental_date               return_date  rental_length_days
0 2005-05-25 02:54:33+00:00 2005-05-28 23:40:33+00:00                   3
1 2005-06-15 23:19:16+00:00 2005-06-18 19:24:16+00:00                   2
2 2005-07-10 04:27:45+00:00 2005-07-17 10:11:45+00:00                   7
3 2005-07-31 12:06:41+00:00 2005-08-02 14:30:41+00:00                   2
4 2005-08-19 12:30:04+00:00 2005-08-23 13:35:04+00:00                   4
                                      special_features  ...  deleted_scenes
0                       {Trailers,"Behind the Scenes"}  ...               0
1                       {Trailers,"Behind the Scenes"}  ...               0
2                       {Trailers,"Behind the Scenes"}  ...               0
3                       {Trailers,"Behind the Scenes"}  ...               0
4                       {Trailers,"Behind the Scenes"}  ...               0
...                                                ...  ...             ...
15856  {Trailers,"Delete

In [18]:
# Creating a dataframe with independent variables
X = rental_df.drop(['rental_length_days', 'rental_date', 'return_date'], axis=1)

# Creating a series with a dependent variable
y = rental_df['rental_length_days']

# Setting a random seed, for repeatability of results
random_seed = 9

# Splitting the data into training and test data in 80/20 proportion
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=random_seed)

In [19]:
# Creating a dictinary of the models I will use
models = {
    'knr': KNeighborsRegressor(n_jobs=-1),
    'line_reg': LinearRegression(n_jobs=-1),
    'ridge': Ridge(random_state=random_seed),
    'lasso': Lasso(random_state=random_seed),
    'dtr': DecisionTreeRegressor(random_state=random_seed),
    'rfr': RandomForestRegressor(n_jobs=-1, random_state=random_seed),
    'gbr': GradientBoostingRegressor(random_state=random_seed)
}

# Creating a dictinary of the hyperparameters for models
params = {
    'knr': {'n_neighbors': list(np.arange(2, 51))},
    'line_reg': {}, # LinearRegressor has no hyperparams
    'ridge': {'alpha': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10]},
    'lasso': {'alpha': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10]},
    'dtr': {'max_depth': list(np.arange(1, 51))},
    'rfr': {'max_depth': list(np.arange(1, 51)), 'n_estimators': [100, 200, 300, 400, 500, 600]},
    'gbr': {'n_estimators': [100, 200, 300, 400, 500, 600], 'learning_rate': [0.01, 0.05, 0.1, 0.5]}
}

# Initialization of variables of best model, best model name and best MSE value
best_model = None
best_model_name = ''
best_mse = float('inf')

# Some regression models perform better with standardized data, some like decision tree don't care, however I decided to standardize them to equalize all models
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Creating a loop of enumerating models and their hyperparameters to get the best model and the best MSE value
for name, model in models.items():
    search = RandomizedSearchCV(model, params[name], n_iter=20, scoring='neg_mean_squared_error', cv=5, random_state=random_seed, n_jobs=-1)
    search.fit(X_train_scaled, y_train)
    mse = MSE(y_test, search.predict(X_test_scaled))
    
    # Output of the best parameters of each model and its result
    print(f'for {name} best params is {search.best_params_}, MSE = {mse}')
    
    # Finding and assigning the best MSE result, the best model and its name to the corresponding variables
    if mse < best_mse:
        best_mse = mse
        best_model = search.best_estimator_
        best_model_name = name
        
# Output results        
print(f'The best model is {best_model} with MSE result {mse.round(2)} which is equal to RMSE of {np.sqrt(mse).round(2)}')

for knr best params is {'n_neighbors': 4}, MSE = 2.748266624645446
for line_reg best params is {}, MSE = 2.9417238646975976
for ridge best params is {'alpha': 0.5}, MSE = 2.9417965311524554
for lasso best params is {'alpha': 0.001}, MSE = 2.9417134890202408
for dtr best params is {'max_depth': 26}, MSE = 2.1620462887289804
for rfr best params is {'n_estimators': 100, 'max_depth': 16}, MSE = 2.022496586366581
for gbr best params is {'n_estimators': 600, 'learning_rate': 0.5}, MSE = 1.8992597665407769
The best model is GradientBoostingRegressor(learning_rate=0.5, n_estimators=600, random_state=9) with MSE result 1.9 which is equal to RMSE of 1.38
