# DATA 643 Project 1 | Global Baseline Predictors and RMSE

In this project I am trying to use Recommender system based on Root Mean Square Error (RMSE). I have intentionally choosen an ultra small dataset so that the calculation can be understood easily. For better understanding, please refer to the following youtube playlist: [Netflix_V2](https://www.youtube.com/playlist?list=PLuKhJYywjDe96T2L0-zXFU5Up2jqXlWI9), courtesy Network20Q.

In [245]:
import pandas as pd
import numpy as np
from math import sqrt

ratings_df = pd.read_csv("userratings.csv")
movies_df = pd.read_csv("movies.csv")
movies_df['MovieID'] = movies_df['MovieID'].apply(pd.to_numeric)

### Load Movie Data

Here we have 5 movies for which we have unique Movie IDs

In [246]:
movies_df.head()

Unnamed: 0,MovieID,Title,Genres
0,1,Titanic,Romance
1,2,Inception,Thriller
2,3,Toy Story,Animaton
3,4,Jumanji,Comedy
4,5,Pink Panther,Comedy


### Load User Ratings

The first 10 users with their respective movie ratings. The MovieID is the key to connect User Ratings with Movie data.

In [247]:
ratings_df.head(10)

Unnamed: 0,UserID,MovieID,Rating
0,A,1,5
1,A,3,4
2,A,5,4
3,B,1,4
4,B,2,3
5,B,3,5
6,B,4,3
7,B,5,4
8,C,1,4
9,C,2,2


### User-Movie Rating Matrix

Below displayed is the User-Movie Rating Matrix which shows user ratings for respective movies. This is the overall data we have. This data is the superset of Test and Training data. 

Youtube: [PartJ: Main Ideas](https://youtu.be/KbOcvEVNTp0?list=PLuKhJYywjDe96T2L0-zXFU5Up2jqXlWI9), courtesy Network20Q. 

In [248]:
## Overall Raw Data in User-Movie Rating Matrix

R_df = ratings_df.pivot(index = 'UserID', columns ='MovieID', values = 'Rating').fillna(0)
R_df = R_df.replace(0, '?')
R_df.head(6)

MovieID,1,2,3,4,5
UserID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A,5.0,?,4,?,4.0
B,4.0,3,5,3,4.0
C,4.0,2,?,?,3.0
D,2.0,2,3,1,2.0
E,4.0,?,5,4,5.0
F,4.0,2,5,4,4.0


### Load Training Data

Lets load training data which would be used to train our prediction model. The question mark "?" is the missing data for which ratings are unavailable.

In [249]:
## Load Training Data - User-Movie Rating Matrix

trainig_df = pd.read_csv("training_userratings.csv")
Tr_df = trainig_df.pivot(index = 'UserID', columns ='MovieID', values = 'Rating').fillna(0)
Tr_df = Tr_df.replace(0, '?')
Tr_df.head(6)

MovieID,1,2,3,4,5
UserID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A,5,?,4,?,4
B,4,3,5,?,4
C,?,2,?,?,3
D,2,?,3,1,2
E,4,?,?,4,5
F,4,2,5,4,?


### Load Test Data

Similar to training data, lets load our test data. We can see our test data in the below User-Movie Rating Matrix. 

In [250]:
## Load Test Data - User-Movie Rating Matrix

test_df = pd.read_csv("test_userratings.csv")
Te_df = test_df.pivot(index = 'UserID', columns ='MovieID', values = 'Rating').fillna(0)
Te_df = Te_df.replace(0, '?')
Te_df.head(6)

MovieID,1,2,3,4,5
UserID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
B,?,?,?,3,?
C,4,?,?,?,?
D,?,2,?,?,?
E,?,?,5,?,?
F,?,?,?,?,4


### Raw average : Training

Below we are calculating the raw average of the training data and using this raw average we are creating another User-Rating Matrix which just shows the raw average of the movie ratings.

Youtube: [PartK: Raw Average](https://youtu.be/0-o9VgOxe9Y?list=PLuKhJYywjDe96T2L0-zXFU5Up2jqXlWI9), courtesy Network20Q. 

In [251]:
## Calculate Training Data Raw Average

Tr_df = Tr_df.replace('?', np.NaN)
tr_raw_avg = Tr_df.stack().mean()
print("Raw Average:",tr_raw_avg)

('Raw Average:', 3.5)


In [252]:
## Calculate Test Data Raw Average

Te_df = Te_df.replace('?', np.NaN)
te_raw_avg = Te_df.stack().mean()


In [253]:
## Matrix of mean taken from Training Data

Mtr_df[Tr_df != 0] = tr_raw_avg
Mtr_df.head(6)

MovieID,1,2,3,4,5
UserID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A,3.5,3.5,3.5,3.5,3.5
B,3.5,3.5,3.5,3.5,3.5
C,3.5,3.5,3.5,3.5,3.5
D,3.5,3.5,3.5,3.5,3.5
E,3.5,3.5,3.5,3.5,3.5
F,3.5,3.5,3.5,3.5,3.5


### RMSE: Training Data

Lets calculate RMSE for the Training Data. RMSE is very commonly used and makes for an excellent general purpose error metric for numerical predictions.Compared to the similar Mean Absolute Error, RMSE amplifies and severely punishes large errors. You may refer to the following URL to understand how RMSE is calculated :[RMSE](https://www.kaggle.com/wiki/RootMeanSquaredError)


Youtube: [Part L RMSE](https://youtu.be/prVRuPezW3Q?list=PLuKhJYywjDe96T2L0-zXFU5Up2jqXlWI9), courtesy Network20Q. 

In [254]:
## Root Mean Square Error for Training Data

tr_RMSE = sqrt(((Tr_df.stack() - tr_raw_avg) ** 2).mean())
print("Training Data:", tr_RMSE)

('Training Data:', 1.161895003862225)


### RMSE: Test Data

In similar fashion, lets calculate RMSE for Test Data.

Youtube: [Part L RMSE](https://youtu.be/prVRuPezW3Q?list=PLuKhJYywjDe96T2L0-zXFU5Up2jqXlWI9), courtesy Network20Q.

In [255]:
## Root Mean Square Error for Test Data

te_RMSE = sqrt(((Te_df.stack() - tr_raw_avg) ** 2).mean())
print("Test Data:",te_RMSE)

('Test Data:', 1.02469507659596)


### User Bias

Below we are calculating and displaying the User Bias. User Bias tells us how harsh or lenient a user is, while rating the Movies. The below list tell us that the User C and D are relatively harsh users.

Youtube: [Part M UserBias](https://youtu.be/Fl7liZEJ4_U?list=PLuKhJYywjDe96T2L0-zXFU5Up2jqXlWI9), courtesy Network20Q.

In [256]:
## Calculate user bias from traing data mean & raw average

ub_tr = Tr_df.mean(1) - tr_raw_avg
print(ub_tr)

UserID
A    0.833333
B    0.500000
C   -1.000000
D   -1.500000
E    0.833333
F    0.250000
dtype: float64


### Movie Bias

Below we are calculating and displaying the Movie Bias. Movie bias tell us how positively or negatively a movie is likely to be rated.

Youtube: [Part N Bias Values](https://youtu.be/dGM4bNQcVKI?list=PLuKhJYywjDe96T2L0-zXFU5Up2jqXlWI9), courtesy Network20Q.

In [257]:
## Calculate movie bias from traing data mean & raw average

mb_tr = Tr_df.mean() - tr_raw_avg
print(mb_tr)

MovieID
1    0.300000
2   -1.166667
3    0.750000
4   -0.500000
5    0.100000
dtype: float64


### Baseline Predictor

Below we are calculating and displaying the baseline predictor matrix. The baseline predictor is calculated using the following formula:

Raw Average + User Bias + Movie Bias

Youtube: [Part O Baseline Predictor](https://youtu.be/4RSigTais8o?list=PLuKhJYywjDe96T2L0-zXFU5Up2jqXlWI9), courtesy Network20Q.

In [258]:
## Create predicted value User-Movie Rating DF using training user bias & movie bias data
## Any value greater than 5 should be reduced to 5 & any value less than 1 should be rounded off to 1

pred_tr = [i + j + tr_raw_avg for i in ub_tr for j in mb_tr]
pred_tr_Df = pd.DataFrame(np.array(pred_tr).reshape(len(ub_tr), len(mb_tr)))
pred_tr_Df.columns = [list(Tr_df)]
pred_tr_Df.columns.name = 'MovieID'
pred_tr_Df.index = [Tr_df.index.values.tolist()]
pred_tr_Df.index.name = 'UserID'
pred_tr_Df[pred_tr_Df<1] = 1
pred_tr_Df[pred_tr_Df>5] = 5
pred_tr_Df.head(6)

MovieID,1,2,3,4,5
UserID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A,4.633333,3.166667,5.0,3.833333,4.433333
B,4.3,2.833333,4.75,3.5,4.1
C,2.8,1.333333,3.25,2.0,2.6
D,2.3,1.0,2.75,1.5,2.1
E,4.633333,3.166667,5.0,3.833333,4.433333
F,4.05,2.583333,4.5,3.25,3.85


### Baseline Predictor RMSE: Training

Here we are calculating RMSE for training data obtained from Baseline Predictor matrix calculated above.

Youtube: [Part P Baseline Predictor RMSE](https://youtu.be/lppNpLFelOc?list=PLuKhJYywjDe96T2L0-zXFU5Up2jqXlWI9), courtesy Network20Q.

In [259]:
## Calculate RMSE for Training Predicted Data

pred_tr_RMSE = sqrt(((Tr_df.stack() - pred_tr_Df.stack()) **2).mean())
print("Baseline Predictor RMSE: Training ",pred_tr_RMSE)

('Baseline Predictor RMSE: Training ', 0.47088863981955553)


### Baseline Predictor RMSE: Test

Here we are calculating RMSE for test data obtained from Baseline Predictor matrix calculated above.

Youtube: [Part P Baseline Predictor RMSE](https://youtu.be/lppNpLFelOc?list=PLuKhJYywjDe96T2L0-zXFU5Up2jqXlWI9), courtesy Network20Q.

In [260]:
## Calculate RMSE for Test Predicted Data
## The dimension of test dataframe Te_df is not same that of pred_tr_Df
## Remove first row of pred_tr_Df to make it similar to Te_df

pred_tr_Df2 = pred_tr_Df.iloc[1:]
pred_te_RMSE = sqrt(((Te_df.stack() - pred_tr_Df2.stack()) **2).mean())
print("Baseline Predictor RMSE: Test ",pred_te_RMSE)

('Baseline Predictor RMSE: Test ', 0.7365459931328119)


### Summary 1

Below is the improvement in percentage in predicting Test Data.


In [261]:
## Improvement in percentage in predicting Test Data
imp_pr_test = (1-pred_te_RMSE/te_RMSE)*100
print(imp_pr_test)

28.1204711572


### Summary 2

Below is the improvement in percentage in predicting Training Data.

In [262]:
## Improvement in percentage in predicting Training Data
imp_pr_training = (1-pred_tr_RMSE/tr_RMSE)*100
print(imp_pr_training)

59.472358668
