# Applying model to a new dataset (Deployment)

**From Model development part, we obtained the best model to be Gradient boosting with the following parameters:<br>
learning_rate: 0.1<br>
n_estimators = 700<br>
max_features = 300<br>
max_depth = 7**

Taking into consideration the long run-time while optimizing the model, the final model was not saved in the model development jupyter notebook and decided to be saved in this notebook. The process was carried out in the following cells:

In [1]:
import numpy as np
import pandas as pd
import os
import pickle
import datetime
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline 
from sklearn.metrics import r2_score
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import r2_score

In [2]:
# The final best Gradient Boosting model from model deveopment notebook
gb6 = GradientBoostingRegressor(learning_rate = 0.1, n_estimators = 700,  max_features = 300, max_depth = 7)

In [3]:
# Import the dataset "IMDB_only" on which gb6 was trained
IMDB_only = pd.read_csv('pre-processed_dataset/IMDB_only.csv', index_col = 0)

# Split the IMDB_only dataset into X and y for model development 
y = IMDB_only['avg_vote']
X = IMDB_only.drop(columns = 'avg_vote')

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

start = datetime.datetime.now()
gb6.fit(X_train, y_train)
end = datetime.datetime.now()
print(start, end)

2021-09-10 13:46:34.891257 2021-09-10 13:59:20.262978


In [4]:
y_test_predicted = gb6.predict(X_test)
r2_score(y_test, y_test_predicted)

0.5061722343580888

# Save the model 

In [5]:
best_model = gb6
best_model.version = '1.0'
best_model.pandas_version = pd.__version__
best_model.numpy_version = np.__version__
best_model.X_columns = [col for col in X_train.columns]

In [6]:
pickle.dump(gb6, open('Best_model.pkl', 'wb'))

# Import the model

In [7]:
model_in = open('Best_model.pkl', 'rb')
model = pickle.load(model_in)

# Import the test dataset

In [8]:
data2 = pd.read_csv('pre-processed_dataset/IMDB_Kaggle_common.csv', index_col = 0)

In [9]:
y = data2['avg_vote']
X = data2.drop(columns = 'avg_vote')

In [10]:
X.shape, y.shape

((2585, 1088), (2585,))

In [11]:
# predict the target 
y_predict = model.predict(X)

In [12]:
# Calculate r2_score
r2_score(y, y_predict)

0.4745713164626669

# Summary:

The final Gradient Boostimng model with r2_score : 0.50 was applied on the unseen movies to determine their ratings. The model performance was very close to train dataset with r2_score: 0.47. <br><br>

**Final Note:** 
In this project, all the possible ML models (Simple linear Regression, Lasso Regression, Ridge Regression, Random Forest Regressor, Gardient Boosting Regressor) have been applied to obtain the best model performance with maximum possible hyperparameter tuning. However as the dataset is lacking some import features like quality of music, quality of picture, richness of language used, chorography etc, hence the most efficient model couldn't be built. <br><br>This is a project to show how a Data science project is developed and how a Data scientist needs to be very vigilant in every steps i.e acquring the data, cleaning the data, processing it with appropriate feature engineering and finally developing various models on it, choosing the best one, save it and appying it in unseen data.   