<a href="https://colab.research.google.com/github/brianckau/Coding-Projects/blob/main/Movie_Recommendation_System.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Movie Recommendation with Collaborative Filtering

In this exercise, we’ll use the **MovieLens movie rating dataset** to build a recommendation system. You’ll start with a linear regression model and then explore collaborative filtering techniques, comparing their performance and combining them into an ensemble model for improved recommendations.

MovieLens DataSet is one of the most commonly used datasets for building and evaluating recommender systems. For this exercise, we'll use a light version.

In [None]:
# restart the session after running this code
!pip install "numpy<2.0"

Collecting numpy<2.0
  Downloading numpy-1.26.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-1.26.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.0/18.0 MB[0m [31m58.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 2.0.2
    Uninstalling numpy-2.0.2:
      Successfully uninstalled numpy-2.0.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pytensor 2.35.1 requires numpy>=2.0, but you have numpy 1.26.4 which is incompatible.
opencv-python 4.12.0.88 requires numpy<2.3.0,>=2; python_version >= "3.9", but y

In [None]:
# install surprise library

# Type this command in terminal / anaconda prompt
#conda install -c conda-forge scikit-surprise

# Install scikit-surprise library
!pip install scikit-surprise

Collecting scikit-surprise
  Using cached scikit_surprise-1.1.4.tar.gz (154 kB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (pyproject.toml) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.4-cp312-cp312-linux_x86_64.whl size=2555144 sha256=c8f5c97d1d948b2c82c2117d5c3d283202ef54a239d9ab8089d38f43a9ce2f5a
  Stored in directory: /root/.cache/pip/wheels/75/fa/bc/739bc2cb1fbaab6061854e6cfbb81a0ae52c92a502a7fa454b
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Successfully installed scikit-surprise-1.1.4


## Restart the session before moving on.

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import os
warnings.filterwarnings('ignore')

from sklearn import linear_model
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score

# import functions from surprise
from surprise import KNNBasic
from surprise import Dataset
from surprise import accuracy
from surprise import BaselineOnly, Dataset, Reader
from surprise.model_selection import train_test_split as surprise_train_test_split
from surprise.model_selection import cross_validate

## Task 1: Import Dataset and Explore the Data


In [None]:
# Import the provided movies ratings dataset
df = pd.read_csv('movies_ratings_100k.csv')

In [None]:
# Explore the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 9 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   user_id   100000 non-null  int64 
 1   movie_id  100000 non-null  int64 
 2   ratings   100000 non-null  int64 
 3   title     98883 non-null   object
 4   genres    98883 non-null   object
 5   gender    100000 non-null  object
 6   zipcode   100000 non-null  object
 7   age_desc  100000 non-null  object
 8   occ_desc  100000 non-null  object
dtypes: int64(3), object(6)
memory usage: 6.9+ MB


In [None]:
df.describe()

Unnamed: 0,user_id,movie_id,ratings
count,100000.0,100000.0,100000.0
mean,362.87784,431.54505,3.52986
std,249.779236,336.405024,1.125674
min,0.0,0.0,1.0
25%,147.0,164.0,3.0
50%,325.0,350.5,4.0
75%,540.0,633.0,4.0
max,943.0,1679.0,5.0


In [None]:
df.head(5)

Unnamed: 0,user_id,movie_id,ratings,title,genres,gender,zipcode,age_desc,occ_desc
0,0,0,3,Toy Story (1995),Animation|Children's|Comedy,F,48067,Under 18,K-12 student
1,160,0,2,Toy Story (1995),Animation|Children's|Comedy,M,98107-2117,45-49,self-employed
2,537,0,5,Toy Story (1995),Animation|Children's|Comedy,M,95407,56+,self-employed
3,400,0,2,Toy Story (1995),Animation|Children's|Comedy,M,55129,18-24,other or not specified
4,84,0,2,Toy Story (1995),Animation|Children's|Comedy,M,94945,18-24,college/grad student


**Question 1:** Perform an initial analysis to uncover key insights from the dataset and prepare for subsequent modelling tasks. Specifically, report by answering the following questions:
#### User-related
* How many unique users are in the dataset?
* What is the user age distribution?
#### Movie-related
* How many unique movies are included?
* Which movie has the highest number of ratings, and how many ratings does it have?
#### Rating-related
* How many total ratings are recorded?
* How is the rating distribution?
* Which user has rated the most movies, and how many - - ratings have they given?
* How many missing values exist, and in which columns?

In [None]:
# User-related
# number of unique users in the dataset
unique_users = df['user_id'].nunique()
print("Number of unique users in the dataset:", unique_users, "\n")

# users' age distribution
age_distribution = df['age_desc'].value_counts()
age_distribution

Number of unique users in the dataset: 944 



Unnamed: 0_level_0,count
age_desc,Unnamed: 1_level_1
25-34,32545
18-24,22208
35-44,19369
45-49,8838
50-55,7804
56+,5979
Under 18,3257


In [None]:
# Movie-related
# number of unique movies included
unique_movies = df['movie_id'].nunique()
print("Number of unique movies included:", unique_movies)

# movie having highest number of ratings
top_movie = df['title'].value_counts().idxmax()
print("The movie that contains the highest number of ratings is", top_movie, ".")

# number of ratings it has
top_movie_count = df['title'].value_counts().max()
print("The number of ratings that " + str(top_movie) + " has is", top_movie_count, ".")

Number of unique movies included: 1663
The movie that contains the highest number of ratings is Dunston Checks In (1996) .
The number of ratings that Dunston Checks In (1996) has is 496 .


In [None]:
# Rating-related
# number of recorded total ratings
total_ratings = df['ratings'].count()
print("The number of recorded total ratings is", total_ratings, ".")

# rating distribution
rating_distribution = df['ratings'].value_counts()
rating_distribution

# user that rated the most movies
most_active_user_id = df['user_id'].value_counts().idxmax()
# number of ratings they have given
most_active_user_count = df['user_id'].value_counts().max()
print(most_active_user_id, "has rated the most movies.")
print(most_active_user_id, "has given", most_active_user_count, "ratings.")

# missing values
missing_summary = df.isnull().sum()
print("\nSummary of Missing Values: ")
missing_summary

The number of recorded total ratings is 100000 .
66 has rated the most movies.
66 has given 591 ratings.

Summary of Missing Values: 


Unnamed: 0,0
user_id,0
movie_id,0
ratings,0
title,1117
genres,1117
gender,0
zipcode,0
age_desc,0
occ_desc,0


In [None]:
print("Missing values are in column \"title\" and \"genres\".")

Missing values are in column "title" and "genres".


## Task 2: Data Preprocessing

In [None]:
# Write your code below to precess necessary steps
# Make a copy of data
df_copy = df.copy()
df_copy

Unnamed: 0,user_id,movie_id,ratings,title,genres,gender,zipcode,age_desc,occ_desc
0,0,0,3,Toy Story (1995),Animation|Children's|Comedy,F,48067,Under 18,K-12 student
1,160,0,2,Toy Story (1995),Animation|Children's|Comedy,M,98107-2117,45-49,self-employed
2,537,0,5,Toy Story (1995),Animation|Children's|Comedy,M,95407,56+,self-employed
3,400,0,2,Toy Story (1995),Animation|Children's|Comedy,M,55129,18-24,other or not specified
4,84,0,2,Toy Story (1995),Animation|Children's|Comedy,M,94945,18-24,college/grad student
...,...,...,...,...,...,...,...,...,...
99995,881,77,2,"Crossing Guard, The (1995)",Drama,M,1720,18-24,artist
99996,291,97,4,Shopping (1994),Action|Thriller,M,19406,35-44,executive/managerial
99997,833,202,4,"To Wong Foo, Thanks for Everything! Julie Newm...",Comedy,F,78640,35-44,programmer
99998,392,197,5,Strange Days (1995),Action|Crime|Sci-Fi,M,55402,35-44,technician/engineer


In [None]:
# Handle missing values
df_copy = df_copy.dropna()

# Encode categorical features
df_copy = pd.get_dummies(df_copy)


# Create features and target lists
# movie-id and title are the same thing so drop it
features = [x for x in df_copy.columns if x != 'ratings'
            and x!= 'movie_id' and x!= 'title' and x!= 'user_id']
target = ['ratings']

X = df_copy[features]
Y = df_copy[target]

# Data splitting by 80/20, random state = 100
xtrain, xtest, ytrain, ytest = train_test_split(X,Y,test_size=0.2,random_state=100)

# Standard Scaling on features
xtrain_scaled_df = pd.DataFrame(preprocessing.StandardScaler().fit(xtrain).transform(xtrain),
                             columns=xtrain.columns,index=xtrain.index)
xtest_scaled_df = pd.DataFrame(preprocessing.StandardScaler().fit(xtest).transform(xtest),
                            columns=xtest.columns,index=xtest.index)

In [None]:
xtest_scaled_df

Unnamed: 0,title_'Til There Was You (1997),title_1-900 (1994),title_101 Dalmatians (1996),title_12 Angry Men (1957),title_187 (1997),title_2 Days in the Valley (1996),"title_20,000 Leagues Under the Sea (1954)",title_2001: A Space Odyssey (1968),"title_301, 302 (1995)","title_39 Steps, The (1935)",...,occ_desc_other or not specified,occ_desc_programmer,occ_desc_retired,occ_desc_sales/marketing,occ_desc_scientist,occ_desc_self-employed,occ_desc_technician/engineer,occ_desc_tradesman/craftsman,occ_desc_unemployed,occ_desc_writer
82837,-0.025647,-0.017421,-0.007111,-0.010057,0.0,-0.014223,-0.012317,-0.014223,-0.030182,-0.014223,...,-0.355772,-0.301902,-0.122417,-0.190703,-0.120919,-0.20344,3.209539,-0.148905,-0.112229,-0.239161
73897,-0.025647,-0.017421,-0.007111,-0.010057,0.0,-0.014223,-0.012317,-0.014223,-0.030182,-0.014223,...,-0.355772,3.312333,-0.122417,-0.190703,-0.120919,-0.20344,-0.311571,-0.148905,-0.112229,-0.239161
16641,-0.025647,-0.017421,-0.007111,-0.010057,0.0,-0.014223,-0.012317,-0.014223,-0.030182,-0.014223,...,2.810790,-0.301902,-0.122417,-0.190703,-0.120919,-0.20344,-0.311571,-0.148905,-0.112229,-0.239161
19479,-0.025647,-0.017421,-0.007111,-0.010057,0.0,-0.014223,-0.012317,-0.014223,-0.030182,-0.014223,...,-0.355772,-0.301902,-0.122417,-0.190703,-0.120919,-0.20344,-0.311571,-0.148905,-0.112229,4.181289
55209,-0.025647,-0.017421,-0.007111,-0.010057,0.0,-0.014223,-0.012317,-0.014223,-0.030182,-0.014223,...,-0.355772,-0.301902,-0.122417,-0.190703,-0.120919,-0.20344,-0.311571,-0.148905,-0.112229,-0.239161
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22514,-0.025647,-0.017421,-0.007111,-0.010057,0.0,-0.014223,-0.012317,-0.014223,-0.030182,-0.014223,...,-0.355772,-0.301902,-0.122417,-0.190703,-0.120919,-0.20344,-0.311571,-0.148905,-0.112229,-0.239161
77653,-0.025647,-0.017421,-0.007111,-0.010057,0.0,-0.014223,-0.012317,-0.014223,-0.030182,-0.014223,...,-0.355772,-0.301902,-0.122417,-0.190703,-0.120919,-0.20344,-0.311571,-0.148905,-0.112229,-0.239161
13676,-0.025647,-0.017421,-0.007111,-0.010057,0.0,-0.014223,-0.012317,-0.014223,-0.030182,-0.014223,...,-0.355772,-0.301902,-0.122417,-0.190703,-0.120919,-0.20344,-0.311571,-0.148905,-0.112229,-0.239161
89132,-0.025647,-0.017421,-0.007111,-0.010057,0.0,-0.014223,-0.012317,-0.014223,-0.030182,-0.014223,...,-0.355772,-0.301902,-0.122417,-0.190703,-0.120919,-0.20344,-0.311571,-0.148905,-0.112229,-0.239161


**Question 2**: Write the corresponding code with the hints (as # comments) provided above. Feel free to add more if you think it helps for the modelling. Justify the preprocessing decisions made, explaining how each step contributes to the quality and reliability of the modelling process.

## Task 3.1: Model Building

In [None]:
# Build Linear Regression Model
model = LinearRegression()
model.fit(xtrain, ytrain)

y_train_pred = model.predict(xtrain)
y_test_pred = model.predict(xtest)

y_train_pred = np.clip(y_train_pred, 1, 5)
y_test_pred = np.clip(y_test_pred, 1, 5)

# Generate R sqr & mse score on training set
r2_train = r2_score(ytrain, y_train_pred)
mse_train = mean_squared_error(ytrain, y_train_pred)

# Generate MSE on test set
mse_test = mean_squared_error(ytest, y_test_pred)

print("Training R^2:", r2_train)
print("Training MSE:", mse_train)
print("Testing MSE:", mse_test)

Training R^2: 0.2563749112933775
Training MSE: 0.9404002377045603
Testing MSE: 1.0129259377628768


In [None]:
# What other evlaution matrics you would like to use? why?

# Mean Absolute Error (MAE) could also be used
mae_train = mean_absolute_error(ytrain, y_train_pred)
mae_test = mean_absolute_error(ytest, y_test_pred)
print("Training MAE:", mae_train)
print("Testing MAE:", mae_test)
# MAE is simple and less sensitive to outliers than MSE
# R-squared shows proportion of explained variance which would be useful for regression
# It could also help to interpret model error for bounded ratings

Training MAE: 0.7709848581694659
Testing MAE: 0.7978064333103109


## Task 3.2: Model Fine-tuning


In [None]:
# Use Ridge Regularization for fine-tuning
from sklearn.linear_model import Ridge

ridge = Ridge(alpha=1.0)
ridge.fit(xtrain, ytrain)
y_test_pred_ridge = np.clip(ridge.predict(xtest), 1, 5)
ridge_mse = mean_squared_error(ytest, y_test_pred_ridge)
ridge_mae = mean_absolute_error(ytest, y_test_pred_ridge)

print("Ridge Test MSE:", ridge_mse)
print("Ridge Test MAE:", ridge_mae)

Ridge Test MSE: 1.0082323807642806
Ridge Test MAE: 0.79703207720225


**Question 3**: Build a simple linear regression and report the modelling approach, rationale for metric selection, fine-tuning process, and observations on the results.

## Task 4: Build User-based and Item-based Collaborative Filtering Models

In [None]:
# Download sample movie, user and the rating data
data = Dataset.load_builtin('ml-100k')

Dataset ml-100k could not be found. Do you want to download it? [Y/n] y
Trying to download dataset from https://files.grouplens.org/datasets/movielens/ml-100k.zip...
Done! Dataset ml-100k has been saved to /root/.surprise_data/ml-100k


In [None]:
# Split data using function in surprise library
trainset, testset = surprise_train_test_split(data, test_size = 0.2, random_state = 100)

In [None]:
# Explore trainset and testset
type(trainset)

In [None]:
# User ID, Movie ID, Rating
[*trainset.all_ratings()][:10]

[(0, 0, 3.0),
 (0, 72, 3.0),
 (0, 355, 3.0),
 (0, 32, 5.0),
 (0, 49, 5.0),
 (0, 469, 3.0),
 (0, 8, 3.0),
 (0, 604, 4.0),
 (0, 697, 2.0),
 (0, 557, 4.0)]

In [None]:
testset[:10]

[('543', '249', 2.0),
 ('402', '12', 4.0),
 ('49', '52', 2.0),
 ('425', '529', 4.0),
 ('321', '30', 4.0),
 ('474', '316', 5.0),
 ('458', '301', 1.0),
 ('551', '721', 5.0),
 ('233', '192', 5.0),
 ('532', '708', 4.0)]

In [None]:
train_df = pd.DataFrame([*trainset.all_ratings()], columns=['user_id', 'movie_id', 'ratings']).sort_values(by=['movie_id'], key=lambda col: pd.to_numeric(col, errors='coerce'))
train_df

Unnamed: 0,user_id,movie_id,ratings
0,0,0,3.0
23463,160,0,2.0
62448,537,0,5.0
51049,400,0,2.0
12473,84,0,2.0
...,...,...,...
13156,89,1652,3.0
34452,256,1653,4.0
10525,68,1653,2.0
11737,77,1654,1.0


In [None]:
test_df = pd.DataFrame(testset, columns=['user_id', 'movie_id', 'ratings'])
test_df

Unnamed: 0,user_id,movie_id,ratings
0,543,249,2.0
1,402,12,4.0
2,49,52,2.0
3,425,529,4.0
4,321,30,4.0
...,...,...,...
19995,881,77,2.0
19996,291,97,4.0
19997,833,202,4.0
19998,392,197,5.0


**Question 4.1**: How train_test_split in Surprise package different from the train_test_split in Sklearn?

Answer:

1) Because collaborative filtering is a type of unsupervised learning, the model does not involve a target variable y (like that of linear regression), the Surprise package splits the data only into train and test instead of xtrain, xtest, ytrain, ytest like that in sklearn used to preprocess data for supervised learning models.

2) Data is split into trainset and testset with trainset having no raw_ratings

### User-based CF includes 2 steps:

- User-based CF calculates Pearson similarity between users.
- The predicted user rating on movie is the average rating of most similar users, weighted by the similarity

In [None]:
# Check out the documentation of Surprise library
# https://surprise.readthedocs.io/en/stable/knn_inspired.html

algo_user = KNNBasic(k=50, sim_options={'name': 'pearson_baseline', 'user_based': True})
algo_user.fit(trainset)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBasic at 0x7c99163f18b0>

#### Evaluate User-based CF model on test set. Report MAE (mean absolute error)

In [None]:
# run the trained model against the testset
test_pred_user = algo_user.test(testset)

## measure RMSE on testing data
print("User-based Model : Test Set")
accuracy.mae(test_pred_user, verbose=True)

User-based Model : Test Set
MAE:  0.7892


0.7891509486453208

### Item-based CF includes 2 steps:

- Item-based CF calculates Pearson similarity between items.
- The predicted user rating on movie is the average rating of most similar items, weighted by the similarity

In [None]:
# Change the argument input for sim_options to build an item-based CF model
algo_item = KNNBasic(k=50, sim_options={'name': 'pearson_baseline', 'user_based': False})
algo_item.fit(trainset)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBasic at 0x7c98e4eb1220>

#### Evaluate Item-based CF model on test set. Report MAE (mean absolute error)

In [None]:
# run the trained model against the testset
test_pred_item = algo_item.test(testset)

# get MAE
print("Item-based Model : Test Set")
accuracy.mae(test_pred_item,verbose=True)

Item-based Model : Test Set
MAE:  0.7807


0.7807021805032013

Question 4.2: What is the meaning of k, and what is meaning of sim_options.
Follow the examples to build both user-based and item-based CF models. **Fine-tune the hyperparameters and achieve the best possible results.**

4.2 1)Fine tune hyperparameters for User-Based CF

In [None]:
#from surprise.model_selection import GridSearchCV

#param_grid_user = {'k':[x for x in range(1,62,10)],
#              'sim_options':{'name':['msd', 'cosine', 'pearson',
#                                    'pearson_baseline'],'user_based':[True]}}
#user_grid = GridSearchCV(KNNBasic,
#                         param_grid=param_grid_user,
#                         measures=['mae'],cv=5)
#
#user_grid.fit(data)

4.2 2)Fine tune hyperparameters for Item-Based CF

In [None]:
#param_grid_item = {'k':[x for x in range(1,62,10)],
#             'sim_options':{'name':['msd', 'cosine', 'pearson',
#                                     'pearson_baseline'],'user_based':[False]}}
#item_grid = GridSearchCV(KNNBasic,
#                       param_grid=param_grid_item,
#                         measures=['mae'],cv=5)
#
#item_grid.fit(data)

4.2 3) Report optimal hyperparameters for User-Based CF

In [None]:
#print('Best hyperparameters for User-Based Collaborative Filtering using GridSearchCV')
#print('--------------------------------------------------------------------------------')
#print(f"The best k is: {user_grid.best_params['mae']['k']}")
#print(f"The best sim_option is: {user_grid.best_params['mae']['sim_options']['name']}")
#print(f"The best MAE Score is : {user_grid.best_score['mae']}")

4.2 4) Report optimal hyperparameters and performance for Item-Based CF

In [None]:
#print('Best hyperparameters for Item-Based Collaborative Filtering using GridSearchCV')
#print('--------------------------------------------------------------------------------')
#print(f"The best k is: {item_grid.best_params['mae']['k']}")
#print(f"The best sim_option is: {item_grid.best_params['mae']['sim_options']['name']}")
#print(f"The best MAE Score is : {item_grid.best_score['mae']}")

Optimal User-based KNN Model : Test Set
-----------------------------------------
Hyperparameter k used: 20
Hyperparameter sim_option used: msd
Mean Absolute Error: 0.765
Mean Squared Error: 0.938


Optimal Item-based KNN Model : Test Set
-----------------------------------------
Hyperparameter k used: 40
Hyperparameter sim_option used: msd
Mean Absolute Error: 0.77
Mean Squared Error: 0.953

4.2 5) Build models using best hyperparameters

In [None]:
#User-Based Model
#User_KNN_model = KNNBasic(k=user_grid.best_params['mae']['k'],
#                          sim_options={'name': user_grid.best_params['mae']['sim_options']['name']
#                                                                            , 'user_based': True})
#User_KNN_model.fit(trainset)
#
#Item-Based Model
#Item_KNN_model = KNNBasic(k=item_grid.best_params['mae']['k'],
#                          sim_options={'name': item_grid.best_params['mae']['sim_options']['name']
#                                                                            , 'user_based': False})
#Item_KNN_model.fit(trainset)

In [None]:
#Avoid loading for hyperparameter matching
User_KNN_model = KNNBasic(k=21,
                          sim_options={'name': 'msd', 'user_based': True})
User_KNN_model.fit(trainset)

#Item-Based
Item_KNN_model = KNNBasic(k=41,sim_options={'name': 'msd', 'user_based': False})
Item_KNN_model.fit(trainset)

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBasic at 0x7c9916dac770>

4.3 Report Model Performance with corresponding hyperparameters

In [None]:
User_KNN_pred = User_KNN_model.test(testset)
Item_KNN_pred = Item_KNN_model.test(testset)

In [None]:
#print("Optimal User-based KNN Model : Test Set")
#print("-----------------------------------------")
#print(f"Hyperparameter k used: {user_grid.best_params['mae']['k']}")
#print(f"Hyperparameter sim_option used: {user_grid.best_params['mae']['sim_options']['name']}")
print(f'Mean Absolute Error: {round(accuracy.mae(User_KNN_pred,verbose=False),3)}')
print(f'Mean Squared Error: {round(accuracy.mse(User_KNN_pred,verbose=False),3)}')

#print('\n')
#
#print("Optimal Item-based KNN Model : Test Set")
#print("-----------------------------------------")
#print(f"Hyperparameter k used: {item_grid.best_params['mae']['k']}")
#print(f"Hyperparameter sim_option used: {item_grid.best_params['mae']['sim_options']['name']}")
print(f'Mean Absolute Error: {round(accuracy.mae(Item_KNN_pred,verbose=False),3)}')
print(f'Mean Squared Error: {round(accuracy.mse(Item_KNN_pred,verbose=False),3)}')


Mean Absolute Error: 0.765
Mean Squared Error: 0.938
Mean Absolute Error: 0.771
Mean Squared Error: 0.953


# **Task 5: Combining models: Explore ensemble techniques to integrate multiple models**

# **Ensemble Method 1: Simple Averaging**

In [None]:
#Fill in all predictions in the test set of CF
optimal_user_knn_pred = []
optimal_item_knn_pred = []
for index, row in test_df.iterrows():
    usr_id = str(row['user_id'])
    mov_id = str(row['movie_id'])

    # get a prediction for specific users and items.
    pred_ub = User_KNN_model.predict(usr_id, mov_id, r_ui=3)   # prediction by user-based cf
    pred_ib = Item_KNN_model.predict(usr_id, mov_id, r_ui=3)   # prediction by item-based cf

    optimal_user_knn_pred.append(pred_ub[3])
    optimal_item_knn_pred.append(pred_ib[3])

In [None]:
optimal_movies_df_with_cf = test_df.copy()

# Below codes create and name new columns in the dataframe
optimal_movies_df_with_cf['usr_based_cf'] = optimal_user_knn_pred
optimal_movies_df_with_cf['itm_based_cf'] = optimal_item_knn_pred
optimal_movies_df_with_cf

Unnamed: 0,user_id,movie_id,ratings,usr_based_cf,itm_based_cf
0,543,249,2.0,3.302012,3.408572
1,402,12,4.0,4.451559,4.205679
2,49,52,2.0,3.601548,2.980157
3,425,529,4.0,4.069651,3.021450
4,321,30,4.0,3.924126,3.996532
...,...,...,...,...,...
19995,881,77,2.0,3.440217,3.375521
19996,291,97,4.0,3.867159,3.933309
19997,833,202,4.0,3.308080,2.945259
19998,392,197,5.0,4.648198,4.420601


In [None]:
#Make a df for L2 results with user and movie id to compare with CF
xtest_evaluate_df = xtest.copy()
xtest_evaluate_df['L2 Pred'] = y_test_pred_ridge
xtest_evaluate_df['user_id'] = xtest_evaluate_df.index.map(df_copy['user_id'])
xtest_evaluate_df['movie_id'] = xtest_evaluate_df.index.map(df_copy['movie_id'])
xtest_evaluate_df.drop([x for x in xtest_evaluate_df.columns if x != 'L2 Pred' and x!= 'user_id' and x!= 'movie_id'],axis=1,inplace=True)
xtest_evaluate_df

Unnamed: 0,L2 Pred,user_id,movie_id
82837,4.368197,923,181
73897,2.205555,546,997
16641,3.569834,493,121
19479,3.726365,600,154
55209,4.130559,605,551
...,...,...,...
22514,3.761154,686,181
77653,3.481035,534,1210
13676,3.527830,697,98
89132,3.790969,59,1116


In [None]:
optimal_movies_df_with_cf['user_id'] = optimal_movies_df_with_cf['user_id'].astype(int)
optimal_movies_df_with_cf['movie_id'] = optimal_movies_df_with_cf['movie_id'].astype(int)
xtest_evaluate_df['user_id'] = xtest_evaluate_df['user_id'].astype(int)
xtest_evaluate_df['movie_id'] = xtest_evaluate_df['movie_id'].astype(int)

#Find the identical test sets that exists in both Linear Regression and CF
test_exist_both = optimal_movies_df_with_cf.merge(
    xtest_evaluate_df[['user_id', 'movie_id', 'L2 Pred']],
    on=['user_id', 'movie_id'],
    how='inner'
)
combined_df = test_exist_both[['user_id', 'movie_id', 'usr_based_cf', 'itm_based_cf', 'L2 Pred','ratings']]
combined_df

Unnamed: 0,user_id,movie_id,usr_based_cf,itm_based_cf,L2 Pred,ratings
0,402,12,4.451559,4.205679,4.271517,4.0
1,49,52,3.601548,2.980157,3.911017,2.0
2,738,747,3.577887,3.372048,3.165503,4.0
3,94,644,3.990185,3.670345,4.379531,5.0
4,389,654,4.657013,4.274899,3.666525,5.0
...,...,...,...,...,...,...
4283,709,182,4.156885,3.950892,3.937390,4.0
4284,308,475,3.928004,3.875733,2.993230,4.0
4285,760,71,3.441790,3.178097,3.668110,4.0
4286,796,78,2.454093,3.116510,3.586284,3.0


In [None]:
combined_df['Average Ensemble Score'] = np.average(combined_df[['usr_based_cf', 'itm_based_cf', 'L2 Pred']],axis=1)
combined_df
#combined_df['user_id'].dtype

Unnamed: 0,user_id,movie_id,usr_based_cf,itm_based_cf,L2 Pred,ratings,Average Ensemble Score
0,402,12,4.451559,4.205679,4.271517,4.0,4.309585
1,49,52,3.601548,2.980157,3.911017,2.0,3.497574
2,738,747,3.577887,3.372048,3.165503,4.0,3.371813
3,94,644,3.990185,3.670345,4.379531,5.0,4.013353
4,389,654,4.657013,4.274899,3.666525,5.0,4.199479
...,...,...,...,...,...,...,...
4283,709,182,4.156885,3.950892,3.937390,4.0,4.015056
4284,308,475,3.928004,3.875733,2.993230,4.0,3.598989
4285,760,71,3.441790,3.178097,3.668110,4.0,3.429333
4286,796,78,2.454093,3.116510,3.586284,3.0,3.052296


In [None]:
combined_df[combined_df['user_id'] == 49]

Unnamed: 0,user_id,movie_id,usr_based_cf,itm_based_cf,L2 Pred,ratings,Average Ensemble Score
1,49,52,3.601548,2.980157,3.911017,2.0,3.497574
112,49,813,3.938342,2.719776,3.67889,3.0,3.445669
666,49,300,3.194766,2.144199,3.67075,1.0,3.003239
1741,49,1074,2.80566,2.429258,3.355364,2.0,2.863428
2603,49,235,2.804214,2.346947,3.351778,2.0,2.834313
4190,49,312,2.624192,2.354843,3.559014,3.0,2.846016


# **Ensemble Method 2: Linear Regression with CF Predictions as Features**

First trial on ensemble

In [None]:
# Ensemble 2: Feed in CF results into linear regression

features_en = ['itm_based_cf', 'usr_based_cf']
target_en = 'ratings'

X_en = combined_df[features_en]
Y_en = combined_df[target_en]

xtrain_en,xtest_en,ytrain_en,ytest_en = train_test_split(X_en,Y_en,test_size=0.2,random_state=100)

xtrain_en_scaled_df = pd.DataFrame(preprocessing.StandardScaler().fit(xtrain_en).transform(xtrain_en),
                             columns=xtrain_en.columns,index=xtrain_en.index)
xtest_en_scaled_df = pd.DataFrame(preprocessing.StandardScaler().fit(xtest_en).transform(xtest_en),
                             columns=xtest_en.columns,index=xtest_en.index)

ensemble_linearmodel = LinearRegression()
ensemble_linearmodel.fit(xtrain_en_scaled_df,ytrain_en)

ensemble_linear_test_pred = np.clip(ensemble_linearmodel.predict(xtest_en_scaled_df),1,5)

print(f'MSE for ensemble linear regression is: {round(mean_squared_error(ytest_en,ensemble_linear_test_pred),4)}')
print(f'MAE for ensemble linear regression is: {round(mean_absolute_error(ytest_en,ensemble_linear_test_pred),4)}')

MSE for ensemble linear regression is: 0.8669
MAE for ensemble linear regression is: 0.7273


# **Performance Comparison**

In [None]:
print('Ensemble: Simple Average')
print(f'MSE of Simple-Average Ensemble is: {round(mean_squared_error(combined_df['ratings'],combined_df['Average Ensemble Score']),4)}')
print(f'MAE of Simple-Average Ensemble is: {round(mean_absolute_error(combined_df['ratings'],combined_df['Average Ensemble Score']),4)}')
print('----------------------------------------------------')
print('Ensemble: CF fed into Linear Regression')
print(f'MSE for ensemble linear regression is: {round(mean_squared_error(ytest_en,ensemble_linear_test_pred),4)}')
print(f'MAE for ensemble linear regression is: {round(mean_absolute_error(ytest_en,ensemble_linear_test_pred),4)}')
print('----------------------------------------------------')
print('Linear Regression')
print(f'MSE of L2 Linear Regression is: {round(ridge_mse,4)}')
print(f'MAE of L2 Linear Regression is: {round(ridge_mae,4)}')
print('----------------------------------------------------')
print('Collaborative Filtering')
print(f'MSE of User-based CF is: {round(accuracy.mse(User_KNN_pred,verbose=False),4)}')
print(f'MAE of User-based CF is: {round(accuracy.mae(User_KNN_pred,verbose=False),4)}')
print(f'MSE of Item-Based CF is: {round(accuracy.mse(Item_KNN_pred,verbose=False),4)}')
print(f'MAE of Item-Based CF is: {round(accuracy.mae(Item_KNN_pred,verbose=False),4)}')

Ensemble: Simple Average
MSE of Simple-Average Ensemble is: 0.943
MAE of Simple-Average Ensemble is: 0.7754
----------------------------------------------------
Ensemble: CF fed into Linear Regression
MSE for ensemble linear regression is: 0.8669
MAE for ensemble linear regression is: 0.7273
----------------------------------------------------
Linear Regression
MSE of L2 Linear Regression is: 1.0082
MAE of L2 Linear Regression is: 0.797
----------------------------------------------------
Collaborative Filtering
MSE of User-based CF is: 0.9384
MAE of User-based CF is: 0.7651
MSE of Item-Based CF is: 0.9531
MAE of Item-Based CF is: 0.7705


# **Task 6: Model Deployment Strategy**

In [None]:
#Task 6
#Simple prediction
#Pass in prediction data
#predict('user_id','movie_id')

pred_userid = input('Enter the user id: ')
pred_movieid = input('Enter the movie id: ')

#The third item in the Prediction() list is estimated ratings
userknnpredictedrating = User_KNN_model.predict(pred_userid,pred_movieid)[3]
itemknnpredictedrating = Item_KNN_model.predict(pred_userid,pred_movieid)[3]

userknnmean = np.mean(xtrain_en['usr_based_cf'])
userknnsd = np.std(xtrain_en['usr_based_cf'], ddof=1)

itemknnmean = np.mean(xtrain_en['itm_based_cf'])
itemknnsd = np.std(xtrain_en['itm_based_cf'], ddof=1)

userknnzscore = (float(userknnpredictedrating) - userknnmean) / userknnsd
itemknnzscore = (float(userknnpredictedrating) - userknnmean) / userknnsd

#Run final prediction with linear regression ensemble
prediction = ensemble_linearmodel.predict(pd.DataFrame([[itemknnzscore, userknnzscore]],
             columns=['itm_based_cf', 'usr_based_cf']))
prediction = float(np.clip(prediction,1,5))
print(f'The expected rating is: {round(prediction,4)}')

Enter the user id: 435
Enter the movie id: 456
The expected rating is: 2.3984


In [None]:
#List production
pred_userid = input('Enter the user id: ')
number_of_suggestions = int(input("Enter the number of movies needed: "))


allmovies = df_copy['movie_id'].astype(str).unique()

watched_movies = (df_copy.loc[df_copy['user_id'].astype(str) == str(pred_userid), 'movie_id'].astype(str).unique())

# Remove already seen items
not_watched_movies = [i for i in allmovies if not i in set(watched_movies)]

user_knn_pred = []
for movieid in not_watched_movies:
    pred = User_KNN_model.predict(pred_userid, movieid)
    user_knn_pred.append({'movie_id': movieid, 'usr_based_cf': pred.est})
userknnpredictedrating_df = pd.DataFrame(user_knn_pred)

item_knn_pred = []
for movieid in not_watched_movies:
    pred = Item_KNN_model.predict(pred_userid, movieid)
    item_knn_pred.append({'movie_id': movieid, 'itm_based_cf': pred.est})
itemknnpredictedrating_df = pd.DataFrame(item_knn_pred)

prediction_df = userknnpredictedrating_df.merge(itemknnpredictedrating_df, on='movie_id', how='outer')

prediction_df['userknnzscore'] = (prediction_df['usr_based_cf'] - userknnmean) / userknnsd
prediction_df['itemknnzscore'] = (prediction_df['itm_based_cf'] - userknnmean) / userknnsd

X = prediction_df[['itemknnzscore', 'userknnzscore']].copy()
X.rename(columns={'itemknnzscore': 'itm_based_cf', 'userknnzscore': 'usr_based_cf'}, inplace=True)

final_prediction = ensemble_linearmodel.predict(X)
prediction_df['Final Prediction'] = final_prediction

prediction_df = prediction_df.sort_values(by='Final Prediction', ascending=False)

movies_recommend_list = prediction_df['movie_id'].iloc[:number_of_suggestions]
movies_recommend_list

Enter the user id: 653
Enter the number of movies needed: 66


Unnamed: 0,movie_id
314,1293
682,1653
368,1342
205,1191
642,1613
...,...
427,1396
136,1125
481,1452
429,1398
