# MOVIE RECOMMENDATION
**Introduction**:-
Welcome to my movie recommendation project! In this notebook, I will be exploring the Movielens dataset and using Microsoft Recommenders library to build a movie recommendation system based on historical user behavior. The algorithm of choice for this project is SAR (Simple Algorithm for Recommendation), which is known for its ability to provide accurate recommendations for both popular and niche items. By the end of this project, you will have a good recommendation system using SAR that will recommend a movie to a user based on their past movie preferences. So, let's get started and dive into the exciting world of movie recommendations!

## Installing required packages

In [None]:
# installing microsoft recommenders library 
!pip install recommenders 

## Importing Required Packages

In [2]:
import sys
import itertools
import logging
import os
import numpy as np
import pandas as pd
from recommenders.datasets import movielens
from recommenders.utils.timer import Timer
from recommenders.datasets.python_splitters import python_stratified_split
from recommenders.evaluation.python_evaluation import map_at_k, ndcg_at_k, precision_at_k, recall_at_k
from recommenders.models.sar import SAR
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
sns.set()

## Loading Data

In [3]:
# top k items to recommend
TOP_K = 10

# Select MovieLens data size: 100k, 1m, 10m, or 20m
MOVIELENS_DATA_SIZE = '100k'

In [4]:
# downloading the dataset
data = movielens.load_pandas_df(
    size=MOVIELENS_DATA_SIZE,
    header=['UserId', 'MovieId', 'Rating', 'Timestamp'],
    title_col='Title'
)

100%|██████████| 4.81k/4.81k [00:00<00:00, 16.0kKB/s]


## Exploratory Data Analysis

In [5]:
# Display top 5 rows
data.head()

Unnamed: 0,UserId,MovieId,Rating,Timestamp,Title
0,196,242,3.0,881250949,Kolya (1996)
1,63,242,3.0,875747190,Kolya (1996)
2,226,242,5.0,883888671,Kolya (1996)
3,154,242,3.0,879138235,Kolya (1996)
4,306,242,5.0,876503793,Kolya (1996)


In [6]:
# information about the data
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   UserId     100000 non-null  int64  
 1   MovieId    100000 non-null  int64  
 2   Rating     100000 non-null  float64
 3   Timestamp  100000 non-null  int64  
 4   Title      100000 non-null  object 
dtypes: float64(1), int64(3), object(1)
memory usage: 4.6+ MB


In [7]:
# display descriptive statistics
data.describe(include = 'all')

Unnamed: 0,UserId,MovieId,Rating,Timestamp,Title
count,100000.0,100000.0,100000.0,100000.0,100000
unique,,,,,1664
top,,,,,Star Wars (1977)
freq,,,,,583
mean,462.48475,425.53013,3.52986,883528900.0,
std,266.61442,330.798356,1.125674,5343856.0,
min,1.0,1.0,1.0,874724700.0,
25%,254.0,175.0,3.0,879448700.0,
50%,447.0,322.0,4.0,882826900.0,
75%,682.0,631.0,4.0,888260000.0,


In [8]:
# Group the data by movie title and compute the average rating
grouped_data = data.groupby('Title').mean().reset_index()

# Use plotly to create an interactive version of the scatter plot
fig = px.scatter(grouped_data, x='Rating', y='Title', hover_data=['Rating'], color='Rating')

# Show the plot
fig.show()

## Feature Engineering

In [9]:
# Convert the float precision to 32-bit in order to reduce memory consumption 
data.loc[:, 'Rating'] = data['Rating'].astype(np.float32)

data.head()

Unnamed: 0,UserId,MovieId,Rating,Timestamp,Title
0,196,242,3.0,881250949,Kolya (1996)
1,63,242,3.0,875747190,Kolya (1996)
2,226,242,5.0,883888671,Kolya (1996)
3,154,242,3.0,879138235,Kolya (1996)
4,306,242,5.0,876503793,Kolya (1996)


In [10]:
header = {
    "col_user": "UserId",
    "col_item": "MovieId",
    "col_rating": "Rating",
    "col_timestamp": "Timestamp",
    "col_prediction": "Prediction",
}
# Stratified spliting is used to have the same user in both train and test dataset
train, test = python_stratified_split(data, ratio=0.75, col_user=header["col_user"], col_item=header["col_item"], seed=42)

# printing the results to get the insights of the splitted data
print("""
Train:
Total Ratings: {train_total}
Unique Users: {train_users}
Unique Items: {train_items}

Test:
Total Ratings: {test_total}
Unique Users: {test_users}
Unique Items: {test_items}
""".format(
    train_total=len(train),
    train_users=len(train['UserId'].unique()),
    train_items=len(train['MovieId'].unique()),
    test_total=len(test),
    test_users=len(test['UserId'].unique()),
    test_items=len(test['MovieId'].unique()),
))


Train:
Total Ratings: 74992
Unique Users: 943
Unique Items: 1601

Test:
Total Ratings: 25008
Unique Users: 943
Unique Items: 1532



## Model Training & Prediction

In [11]:
# set log level to INFO
logging.basicConfig(level=logging.DEBUG, 
                    format='%(asctime)s %(levelname)-8s %(message)s')

# Create SAR model with specified parameters
model = SAR(
    similarity_type="jaccard", 
    time_decay_coefficient=30, 
    time_now=None, 
    timedecay_formula=True, 
    **header
)

# Train the model and time the training process
with Timer() as train_time:
    model.fit(train)

# Print the time taken for training
print("Took {} seconds for training.".format(train_time.interval))


Took 0.4072623800002475 seconds for training.


In [12]:
# Time the prediction process using the testing set
with Timer() as test_time:
    # Generate top-k recommendations for each user in the testing set
    top_k = model.recommend_k_items(test, top_k=TOP_K, remove_seen=True)
    
# Print the time taken for prediction
print("Took {} seconds for prediction.".format(test_time.interval))


Took 0.1103548929995668 seconds for prediction.


In [13]:
# Join the top-k recommendations with the movie titles using the MovieId column
# from the original dataset, and sort the resulting dataframe by UserId and Prediction
top_k_with_titles = (top_k.join(data[['MovieId', 'Title']].drop_duplicates().set_index('MovieId'), 
                                on='MovieId', 
                                how='inner')
                     .sort_values(by=['UserId', 'Prediction'], ascending=False))

# Display the top-k recommendations with titles
display(top_k_with_titles.head(10))

Unnamed: 0,UserId,MovieId,Prediction,Title
9420,943,82,21.313228,Jurassic Park (1993)
9421,943,403,21.15884,Batman (1989)
9422,943,568,20.962922,Speed (1994)
9423,943,423,20.16217,E.T. the Extra-Terrestrial (1982)
9424,943,89,19.890512,Blade Runner (1982)
9425,943,393,19.832944,Mrs. Doubtfire (1993)
9426,943,11,19.570244,Seven (Se7en) (1995)
9427,943,71,19.553877,"Lion King, The (1994)"
9428,943,202,19.422129,Groundhog Day (1993)
9429,943,238,19.115604,Raising Arizona (1987)


## Model Evaluation

In [14]:
# Define the arguments that will be used for all ranking metrics
args = [test, top_k]
kwargs = dict(
    col_user='UserId', 
    col_item='MovieId', 
    col_rating='Rating', 
    col_prediction='Prediction', 
    relevancy_method='top_k', 
    k=TOP_K
)

# Compute the Mean Average Precision (MAP) metric
eval_map = map_at_k(*args, **kwargs)

# Compute the Normalized Discounted Cumulative Gain (NDCG) metric
eval_ndcg = ndcg_at_k(*args, **kwargs)

# Compute the Precision at K (Precision@K) metric
eval_precision = precision_at_k(*args, **kwargs)

# Compute the Recall at K (Recall@K) metric
eval_recall = recall_at_k(*args, **kwargs)


In [15]:
# Print the evaluation results
print(f"Model:",
      f"Top K:\t\t {TOP_K}",
      f"MAP:\t\t {eval_map:f}",
      f"NDCG:\t\t {eval_ndcg:f}",
      f"Precision@K:\t {eval_precision:f}",
      f"Recall@K:\t {eval_recall:f}", sep='\n')

Model:
Top K:		 10
MAP:		 0.095544
NDCG:		 0.350232
Precision@K:	 0.305726
Recall@K:	 0.164690


## Conclusion:
In conclusion, this notebook explored the Movielens dataset and utilized the Microsoft Recommenders library to build a movie recommendation system using the SAR algorithm. The SAR algorithm is known for its ability to provide accurate recommendations for both popular and niche items. The final evaluation results show that the SAR model performed reasonably well with a MAP of 0.095544, NDCG of 0.350232, Precision@K of 0.305726, and Recall@K of 0.164690 for Top K=10.

Overall, this notebook demonstrates the effectiveness of using SAR for movie recommendations and provides a foundation for further exploration and optimization of the recommendation system. The movie recommendation system developed in this notebook can be used to suggest movies to users based on their past movie preferences, which could lead to improved user engagement and satisfaction. With further refinement and testing, this recommendation system could potentially be implemented in real-world scenarios, benefiting both users and businesses in the movie industry.