# **4. Neighborhood Collaborative Filtering**

---
## Outline:

1. Background
2. Simplified Workflows.
3. Importing Data
4. Data Preparations
5. Data Preprocessing
6. Modeling
7. Hyperparameter Tuning
8. Evaluation
9. Decision Process (Recommendation Process)

# **Background**
---

## Problem Description
---

- A streaming platform **nonton-yuk.com** are having a problem with its user retention.
- In 3 months, the user retention rate dropped almost 15% which really affects **nonton-yuk.com** revenues.
- After doing an urgent user research, **nonton-yuk.com** teams found that **users find it difficult** to browse movie in **nonton-yuk.com** which has nearly ~7,000 movies.

## Business Objective
---

Our business objective would be **increasing user retention** to **15%** (assumed ofcourse) in 3 months.

## Solution
---


- We can create a **movie recommendation** to help **users browse** the movie **easily** --> remove the users difficulty in using **nonton-yuk.com** platform.

The goal, of our recommendation is to recommend movies that user might like, however we can't directly measure how like user to a movies, thus we need to define what's called as **proxy** label.

To approach those, some appropriate proxy labels are :    
- Scale of rating (star) user  given to a movie
- User click the movie
- etc

Considering the data we have, we have only records of **rating** data given from user to certain movies, thus we will choose **ratings given** as proxy label from item liked

We can move further into machine learning task.

**Our task** is to predict number of stars given from user to a movie.

With  stars itself is in continous value, hence we can conclude it as **regression task**

We now have a clearer picture what we should do, However we need more precise solution in recommender system context.

Some recommendations approach:
1. **Non-personalized**: recommendation by popularity
2. **Personalized**: collaborative filtering

Approach in Personalized Recommender System can be divided based on the presence of interaction data (explicit / implicit) data:     

1. When the interaction data is not exists, the solution that can be implemented is using content feature, **Content Based** Filtering

2. When the interaction data is exists, we can use **Collaborative** Filtering

<center>
<img src ="../assets/Content-based-filtering-and-Collaborative-filtering-recommendation.png" >

<a href=https://www.researchgate.net/publication/331063850/figure/fig3/AS:729493727621125@1550936266704/Content-based-filtering-and-Collaborative-filtering-recommendation.ppm>Source</a> </center>

Due to presence of interaction, in this case rating data, we will not using **Content Based** filtering, instead we will use collaborative filtering

## Model Metrics
---
We have already established some points :
- Our task is to predict stars that will be given by users to certain movies
- We will use Collaborative Filtering approach

Regarding those, we need to measure the success of our model ( metrics), based on the points mentioned, our goal is to predict as close as possible the predicted rating to user true rating,



We want to minimize $(\text{True Rating - Predicted Rating})$, some choices of appropriate metrics are :     
- Mean Absolute Error
- Mean Square Error
- Root Mean Squared Error

Due to its `differentiable` property , we will choose **MSE/RMSE** as our model metrics

## Data Description
---

- The data is obtained from [Movielens dataset](https://grouplens.org/datasets/movielens/).
- It contains ~100K ratings from 1,000 users and 1,700 movies.

There are two files that we use:

**The movie rating data** : `rating.csv`

<center>

|Features|Descriptions|Data Type|
|:--|:--|:--:|
|`userId`|The user ID|`int`|
|`movieId`|The movie ID|`int`|
|`rating`|Rating given from user to movie. Ranging from `0` to `5`|`float`|



**The movie metadata** : `movies.csv`

<center>

|Features|Descriptions|Data Type|
|:--|:--|:--:|
|`movieId`|The movie ID|`int`|
|`title`|The movie ID title|`str`|
|`genres`|The movie ID genres|`str`|

# **Recommender System Workflow** (Simplified)
---

## 1. Importing Data

1. Load the data.
2. Check the shape & type of data.
3. Handle the duplicates data to maintain data validity.

## 2.Modelling : Collaborative Filtering Approach

1. Creating Utility Matrix
2. Training + Model Selection  :     
    - Baseline Approach
    - User to User Collaborative Filtering
    - Item to Item Collaborative Filtering
4. Evaluating Model

## 3. Generating Recommendation / Predictions

1. Predict recommendation of user-i to unrated item-j
2. Predict recommendation of user-i to all their unrated items

# **1. Importing Data**
---

What do we do?
1. Load the data.
2. Check the shape & type of data.
3. Handle the duplicates data to maintain data validity.

## Load the data

In [1]:
# Load this library
import numpy as np
import pandas as pd

Load the data from given data path

In [2]:
rating_path = '../data/ratings.csv'

rating_data = pd.read_csv(rating_path,
                          delimiter = ',')

rating_data.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


## Check data shapes & types

In [3]:
# Check data shapes
rating_data.shape

(100836, 4)

The data has 4 feature with ~100,000 user rating recorded.

In [4]:
# Check data types
rating_data.dtypes

userId         int64
movieId        int64
rating       float64
timestamp      int64
dtype: object

Correct features type: `userId`, `movieId`, and `rating`

Wait, we do not need `timestamp` features for now, so we drop it.

In [5]:
# Drop timestamp
rating_data_dropped = rating_data.drop(columns=['timestamp'], axis=1)

# Validate
assert len(rating_data_dropped.columns) == 3
assert rating_data_dropped.columns.tolist() == ['userId', 'movieId', 'rating']
assert len(rating_data_dropped) == len(rating_data)

In [6]:
rating_data_dropped.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0


In [7]:
rating_data_dropped.shape

(100836, 3)

## Handling duplicates data


We need to check that there is no user ID that rates similar movie ID more than one.

In [8]:
# Check duplicate data
rating_data_dropped.duplicated(subset=['userId', 'movieId']).sum()

0

Great! Our data is free from duplicate rates.

**Note**
- If you have a user ID rates similar movie ID more than one, you can keep the most up to date ones & drop the rest.

## Create load function

Finally, we can create load data function

In [9]:
def load_rating_data(rating_path,sampling_frac= 0.01):
    """
    Function to load data & remove from duplicates

    Parameters
    ----------
    rating_path : str
        The path of rating data

    Returns
    -------
    rating_data : pandas DataFrame
        The sample of rating data
    """
    # Load data
    rating_data_raw = pd.read_csv(rating_path, delimiter=',')
    print('Original data shape :', rating_data_raw.shape)

    # Drop timestamp
    rating_data = rating_data_raw.drop(columns=['timestamp'], axis=1)
    print('Dropped data shape  :', rating_data.shape)

    # sample movie
    # collect unique  movieid to sample
    movie_id_take = rating_data['movieId'].sample(frac=0.01)
    sampled_data = rating_data.loc[rating_data['movieId'].isin(movie_id_take)]

    return sampled_data


In [10]:
# Load rating data
rating_data = load_rating_data(rating_path = rating_path)

Original data shape : (100836, 4)
Dropped data shape  : (100836, 3)


In [11]:
rating_data.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
3,1,47,5.0
4,1,50,5.0
5,1,70,3.0
6,1,101,5.0


# **2. Modelling**: Collaborative Filtering Approach
---

## Background
---

<center>
<img src="../assets/collaborative_user.jpg" width=600>


In this section we will focus on Neighborhood Collaborative Filtering Approach.

Neighborhood Collaborative Filtering Approach work by finding similarities either in users or items and calculate predicted rating by averaging rating from its neighbor.

Based on the object to search similarity, Neighborhood CF can be divided as :    
1. User to User Collaborative Filtering
2. Item to Item Collaborative Filtering

### User to User  / Item to Item Collaborative Filtering
---

<center>
<img src="../assets/collaborative_full_flow.jpg" width=600>

### Item  to Item  / Item to Item Collaborative Filtering
---

<center>
<img src="../assets/collaborative_full_flow.jpg" width=600>

## Workflow
---

To create a personalized RecSys, we can follow these steps:

1. Data Preparation --> Create utility matrix & Split Train-Test
2. Train recommendation model --> Baseline, User to User CF (KNN) & Item to Item CF (KNN)
3. Choosing Best Model
5. Evaluate Final Model

## Implementing Model From Scratch
---

Now, we are going to demonstrate  how  neighborhood collaborative filtering works in terms of modelling and giving recommendation.


During training process we only calculate similarity between users / items , however if the computation is not feasible we can calculate similarity later during prediction process.

During prediction / recommendation we will find nearest neighbor to predict rating that will be given from user

Before preceeding to the next step , we need to convert our rating_data into format that can be used to calculate similarity

One suitable format to calculate similarity is by converting / fixing our dataframe into **utility matrix** like dataframe


| User/Item | Item A | .. | Item Nth |
|:---------:|--------|----|----------|
| User A    |        |    |          |
| ..        |        |    |          |
| User Nth  |        |    |          |



We can achieve those by using `pd.pivot`

In [12]:
rating_data

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
3,1,47,5.0
4,1,50,5.0
5,1,70,3.0
6,1,101,5.0
...,...,...,...
100780,610,139385,4.5
100792,610,142488,3.5
100799,610,147657,4.0
100814,610,158238,5.0


In [13]:
rating_data_pivot = rating_data.pivot(index= 'userId', columns= 'movieId', values= 'rating')

In [14]:
#take a look after pivoted
rating_data_pivot.head()

movieId,1,2,7,10,16,25,26,27,29,31,...,161354,164909,165639,166643,167790,168252,168254,171751,176371,183897
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,0.5,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,,,,,,,,,,...,,,,,,,,,,


In [15]:
#check shape
rating_data_pivot.shape

(610, 807)

Now, our data shape become (610, 9724)

In [16]:
rating_data_pivot.isnull().sum().sum()

456220

we can see that after pivoting data, there are lot of missing values, however we can't measure similarity using missing data, we need imputation later

Now, our data is free of missing values

For easier preparing data we will create a function `prepare_utility_dataframe`

In [17]:
def prepare_utility_dataframe(rating_path) :

    """
    Function to prepare rating data into pivoted rating_data (utility matrix form)

    Parameters
    ----------
    rating_path : str
        The path of rating data

    Returns
    -------
    rating_data_pivot : pandas DataFrame
        rating data in pivoted format





    """

    # load data
    rating_data = load_rating_data(rating_path)

    # perform pivot
    rating_data_pivot = rating_data.pivot(index= 'userId', columns= 'movieId', values= 'rating')

    # print pivoted data shape
    print('Data Shaped After Pivot', rating_data_pivot.shape)

    # checking missing values
    print('Number of missing values after pivot',rating_data_pivot.isnull().sum().sum() )


    # return data
    return rating_data_pivot


In [18]:
# check function
rating_data_pivot = prepare_utility_dataframe(rating_path = rating_path)

Original data shape : (100836, 4)
Dropped data shape  : (100836, 3)
Data Shaped After Pivot (610, 800)
Number of missing values after pivot 452143


In [19]:
rating_data_pivot.head()

movieId,1,2,4,5,6,11,17,21,23,30,...,162478,165549,165639,166643,171695,177615,178615,179817,189381,193587
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,,,4.0,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,3.0,,,...,,,,,,,,,,
5,4.0,,,,,,,4.0,,,...,,,,,,,,,,


**Great!**, our function works properly

Now, we are ready to move to the next step

### User to User  / Item to Item Collaborative Filtering
---

To illustrate on how create prediction on neighborhood collaborative filtering, we will only higlight user to user collaborative filtering only, in item to item collaborative filtering the principle is the same,

During the training process we only calculate similarity between all users, however this process is expensive, we can move this process to be perform only on recommendation / prediction phase

Thus, we will directly demonstrate how to generate recommendation by using user to user collaborative filtering

For  example, we will use recommendation that will be given to `userId=1`

To generate prediction / recommendation to a user, we need to predict **rating** that will be given by **userId=1** to an items

Recall again the prediction function


$$
\begin{align*}
\hat{r_{ui}} = \text{baseline}_{ui} + \frac{\sum_{j \in N(u)} (\text{Similarity}(u,j) \cdot (r_{ji}-\text{baseline}_{ji}) }{\sum_{j \in N(u)} \text{Similarity}(u,j)}
& \\ \\
\text{baseline}_{ui} = \mu + \text{userbias}_{u} + \text{itembias}_{i}
\end{align*}
$$


with :    

- $\text{baseline}_{ui}$ : baseline ratings from user **u** on item **i**
- $\hat{r_{ui}}$ : predicted ratings from user **u** on item **i**
- $N(u)$ : Neighbors from user **u**


$$
\begin{align*}
\text{userbias}~{u} = \mu - \text{user-average}~{u}
& \\ \\
\text{itembias}~{i} = \mu - \text{item-average}~{i}
\end{align*}
$$

with :    

- $\mu$ : global mean
- $\text{user-average}~{u}$ : average of rating from user **u**
- $\text{item-average}~{i}$ : average of rating from item **i**


For easier example, we will try to predict rating from **userId=1** on **itemId=1**

We calculate the baseline first

#### Calculate Baseline

In [20]:
# calculate baseline rating on user 1 and item 2
userid = 1
movieid = 2

# calculate global mean
global_mean = rating_data['rating'].mean()

# calculate user mean
user_mean = rating_data_pivot.loc[userid,:].mean()

# calculate item mean
item_mean = rating_data_pivot.loc[:, movieid].mean()

# print all
print(f'userId 1 mean {user_mean}, movieId 1 mean {item_mean} , global mean {global_mean}')

userId 1 mean 4.401960784313726, movieId 1 mean 3.4318181818181817 , global mean 3.665353675450763


In [21]:
# calculate user bias
user_bias = global_mean - user_mean

# calculate item bias
item_bias = global_mean - item_mean

# print all
print(f'userId 1 bias {user_bias}, movieId 2 bias {item_bias} ')

userId 1 bias -0.7366071088629629, movieId 2 bias 0.2335354936325813 


In [22]:
# calculate baseline
baseline_ui = global_mean + user_bias + item_bias

# print
print('Baseline rating prediction for user 1 and movies 2 ',baseline_ui)

Baseline rating prediction for user 1 and movies 2  3.1622820602203814


We have seen that calculating baseline require some steps, we will create function to help the process

In [23]:
def baseline_prediction(rating_data_pivot,userid,movieid,
                        rating_data=rating_data) :
    """Function to calculate baseline prediction from user and movie """

    # calculate global mean
    global_mean = rating_data['rating'].mean()

    # calculate user mean
    user_mean = rating_data_pivot.loc[userid,:].mean()

    # calculate item mean
    item_mean = rating_data_pivot.loc[:,movieid].mean()

    # calculate user bias
    user_bias = global_mean - user_mean

    # calculate item bias
    item_bias = global_mean - item_mean

    # calculate baseline
    baseline_ui = global_mean + user_bias + item_bias

    return baseline_ui


Now, it's time to validate our function, if it works properly

In [24]:
baseline_ui = baseline_prediction(rating_data_pivot= rating_data_pivot,
                                  userid = 1, movieid= 2)

Our result on baseline prediction on user 1 and item 2 is 2.217361407947919, same as our manual steps, Great!

#### Find Closest Neighbor

Now we will find closest neighbour to predict the value of the rating

To find closest neighbor, we need to calculate similarities between users,
- We will use cosine similarity as function
- Before calculating similarity we need to remove user mean



- Since our data contain, missing value we need to impute those, to be able to calculate similarities
- we will replace missing value with 0


- Number of closest neighbor we choose is **5** , later we will experiment this

In [25]:
user_mean = rating_data_pivot.mean(axis=0 )
user_mean

movieId
1         3.920930
2         3.431818
4         2.357143
5         3.071429
6         3.946078
            ...   
177615    3.333333
178615    3.500000
179817    3.833333
189381    2.500000
193587    3.500000
Length: 800, dtype: float64

In [26]:
user_removed_mean_rating = (rating_data_pivot - user_mean).fillna(0)
user_removed_mean_rating.head()

movieId,1,2,4,5,6,11,17,21,23,30,...,162478,165549,165639,166643,171695,177615,178615,179817,189381,193587
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.07907,0.0,0.0,0.0,0.053922,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.494382,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.07907,0.0,0.0,0.0,0.0,0.0,0.0,0.505618,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


loop all user to calculate all similarity


In [27]:
# we will use sklearn to measure cosine similarity
from sklearn.metrics.pairwise import cosine_similarity

# add a progress bar, it need some times to finish.
from tqdm import tqdm

In [28]:
# Generate the similarity score
n_users = len(user_removed_mean_rating.index)
similarity_score = np.zeros(n_users)

# get user 1 rating vector
user_target = user_removed_mean_rating.loc[userid].values.reshape(1,-1)

# Iterate all users
for i, neighbor in enumerate(tqdm(user_removed_mean_rating.index)):
    # Extract neighbor user vector
    user_neighbor = user_removed_mean_rating.loc[neighbor].values.reshape(1,-1)

    # 2. Calculate the similarity (we use Cosine Similarity)
    sim_i = cosine_similarity(user_target, user_neighbor)

    # Append
    similarity_score[i] = sim_i

  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similari

In [29]:
# Sort in descending orders of similarity_score
sorted_idx = np.argsort(similarity_score)[::-1]

# Return the n top similar track_id
n = 5

# Get user closest neighbor
closest_neighbor = user_removed_mean_rating.index[sorted_idx[1:n+1]]
closest_neighbor

Index([597, 380, 452, 171, 414], dtype='int64', name='userId')

Now, we already have user 1 closest neigbor,
- User 555
- User 597
- User 171
- User 452
- User 414



We will create function to find closest neighbor from given user id

In [30]:
def find_neighbor(user_removed_mean_rating,userid,k=5) :
    # Generate the similarity score
    n_users = len(user_removed_mean_rating.index)
    similarity_score = np.zeros(n_users)

    # get user 1 rating vector
    user_target = user_removed_mean_rating.loc[userid].values.reshape(1,-1)

    # Iterate all users
    for i, neighbor in enumerate(tqdm(user_removed_mean_rating.index)):
        # Extract neighbor user vector
        user_neighbor = user_removed_mean_rating.loc[neighbor].values.reshape(1,-1)

        # Calculate the similarity (we use Cosine Similarity)
        sim_i = cosine_similarity(user_target, user_neighbor)

        # Append
        similarity_score[i] = sim_i

    # Sort in descending orders of similarity_score
    sorted_idx = np.argsort(similarity_score)[::-1]

    # sort similarity score , descending
    similarity_score = np.sort(similarity_score)[::-1]

    # get user closest neighbor
    closest_neighbor = user_removed_mean_rating.index[sorted_idx[1:k+1]].tolist()

    # slice neighbour similarity
    neighbor_similarity = list(similarity_score[1:k+1])

    # return closest_neighbor
    return {
        'closest_neighbor' : closest_neighbor,
        'closest_neighbor_similarity' :neighbor_similarity
    }

In [31]:
find_neighbor(user_removed_mean_rating= user_removed_mean_rating, userid= 1, k=5)

  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similari

{'closest_neighbor': [597, 380, 452, 171, 414],
 'closest_neighbor_similarity': [0.22278337731102887,
  0.21308726565455455,
  0.20536297009014445,
  0.2047715075047761,
  0.1901201938510903]}

We already have the closest neighbor, and its similarity score

#### Predict Rating

It's time to calculate rating

In [32]:
rating_data_pivot

movieId,1,2,4,5,6,11,17,21,23,30,...,162478,165549,165639,166643,171695,177615,178615,179817,189381,193587
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,,,4.0,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,3.0,,,...,,,,,,,,,,
5,4.0,,,,,,,4.0,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,2.5,,,,,2.5,4.0,,,,...,,,,,,,,,,
607,4.0,,,,,3.0,,,,,...,,,,,,,,,,
608,2.5,2.0,,,,,,3.5,,,...,,,,,,,,,,
609,3.0,,,,,,,,,,...,,,,,,,,,,


$$
\begin{align*}
\hat{r_{ui}} = \text{baseline}_{ui} + \frac{\sum_{j \in N(u)} (\text{Similarity}(u,j) \cdot (r_{ji}-\text{baseline}_{ji}) }{\sum_{j \in N(u)} \text{Similarity}(u,j)}
& \\ \\
\text{baseline}_{ui} = \mu + \text{userbias}_{u} + \text{itembias}_{i}
\end{align*}
$$

In [33]:

n_neighbors = 5
neighbor_data = find_neighbor(user_removed_mean_rating= user_removed_mean_rating, userid= 1, k=n_neighbors)

# calculate baseline (u,j)

# for sum
sim_rating_total = 0
similarity_sum = 0
# loop all over neighbor
for i in range(n_neighbors) :
    # retrieve rating from neighbor
    neighbour_rating  = rating_data_pivot.loc[neighbor_data['closest_neighbor'][i],2]
    print(neighbour_rating)
    # skip if nan
    if np.isnan(neighbour_rating) :
        continue

    # calculate baseline (ji)
    baseline = baseline_prediction(rating_data_pivot= rating_data_pivot,
                                  userid = neighbor_data['closest_neighbor'][i], movieid= 2)

    # substract baseline from rating
    adjusted_rating = neighbour_rating - baseline

    # multiply by similarity
    sim_rating = neighbor_data['closest_neighbor_similarity'][i]*adjusted_rating

    # sum similarity * rating
    sim_rating_total+= sim_rating

    # sum similarity
    similarity_sum += neighbor_data['closest_neighbor_similarity'][i]

user_item_predicted_rating = baseline + (sim_rating_total / similarity_sum)

print('Predicted rating for user 1, and item 1' ,user_item_predicted_rating)

  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similari

nan
5.0
nan
nan
3.0
Predicted rating for user 1, and item 1 4.15113489834491





Now, we can predict the rating of single item from user, however to generate prediction we need to predict rating for more than 1 items, hence we need to create another function

In [34]:
def predict_item_rating(userid,movieid,rating_data_pivot,neighbor_data,k,
                        max_rating= 5,min_rating= 1) :
    """Function to predict rating on userid and movieid"""

    # remove user mean from rating
    user_mean = rating_data_pivot.mean(axis=0 )
    user_removed_mean_rating = (rating_data_pivot - user_mean).fillna(0)


    # calculate baseline (u,i)
    baseline = baseline_prediction(rating_data_pivot= rating_data_pivot,
                                      userid = userid, movieid= movieid)
    # for sum
    sim_rating_total = 0
    similarity_sum = 0
    # loop all over neighbor
    for i in range(k) :
        # retrieve rating from neighbor
        neighbour_rating  = rating_data_pivot.loc[neighbor_data['closest_neighbor'][i],movieid]

        # skip if nan
        if np.isnan(neighbour_rating) :
            continue

        # calculate baseline (ji)
        baseline = baseline_prediction(rating_data_pivot= rating_data_pivot,
                                      userid = neighbor_data['closest_neighbor'][i], movieid= 2)

        # substract baseline from rating
        adjusted_rating = neighbour_rating - baseline

        # multiply by similarity
        sim_rating = neighbor_data['closest_neighbor_similarity'][i]*adjusted_rating

        # sum similarity * rating
        sim_rating_total+= sim_rating

        #
        similarity_sum += neighbor_data['closest_neighbor_similarity'][i]

    # avoiding ZeroDivisionError
    try :
        user_item_predicted_rating = baseline + (sim_rating_total / similarity_sum)

    except ZeroDivisionError :
        user_item_predicted_rating = baseline

    # checking the boundaries of rating,
    if user_item_predicted_rating > max_rating :
        user_item_predicted_rating = max_rating

    elif user_item_predicted_rating <  min_rating :
        user_item_predicted_rating = min_rating

    return user_item_predicted_rating

we will test the function

In [35]:
predict_item_rating(userid= 1, movieid= 2,
                    rating_data_pivot = rating_data_pivot,
                    neighbor_data= neighbor_data, k= 5)

4.15113489834491

The result is the same like the previous one, great !

#### Generate Recommendation

Now, we will generate recommendation, to generate recomendation we will iterate all over movieId and predict the rating

In [36]:
user_id = 1
# create empty dataframe to store prediction result
prediction_df = pd.DataFrame()
# create list to store prediction result
predicted_ratings = []
# loop all over unrated_movies
mask = np.isnan(rating_data_pivot.loc[user_id])
rating_data_pivot
for movie in rating_data_pivot.columns[mask] :
    # predict rating
    preds = predict_item_rating(userid= user_id, movieid= movie,
                    rating_data_pivot = rating_data_pivot,
                    neighbor_data= neighbor_data, k= 5)

    # append
    predicted_ratings.append(preds)

# assign movieId
prediction_df['movieId'] = rating_data_pivot.columns[mask]

# assign prediction result
prediction_df['predicted_ratings'] = predicted_ratings


We have predicted ratings given from user id, however the the ratings is not ordered yet

The users may have not enough time to watch all movies, usually the recommendation is in Top N recommendation, such as Top 5 items.

In [37]:
# sort values of rating descending
n_items = 5
prediction_df = (prediction_df
                 .sort_values('predicted_ratings',ascending=False)
                 .head(n_items))
prediction_df

Unnamed: 0,movieId,predicted_ratings
44,353,5.0
406,5952,5.0
113,1079,5.0
55,431,5.0
633,94777,5.0


now we will create function to generate recommendation

In [38]:
def recommend_items(rating_data_pivot, userid, n_neighbor, n_items,
                    recommend_seen = False ) :
    """ Function to generate recommendation on given user_id """

    # find neighbor
    neighbor_data = find_neighbor(user_removed_mean_rating= user_removed_mean_rating,
                                  userid= userid, k=n_neighbor)


    # create empty dataframe to store prediction result
    prediction_df = pd.DataFrame()
    # create list to store prediction result
    predicted_ratings = []

    # mask seen item
    mask = np.isnan(rating_data_pivot.loc[user_id])
    item_to_predict = rating_data_pivot.columns[mask]

    if recommend_seen :
      item_to_predict = rating_data_pivot.columns

    # loop all over movie
    for movie in tqdm(item_to_predict) :
        # predict rating
        preds = predict_item_rating(userid= user_id, movieid= movie,
                        rating_data_pivot = rating_data_pivot,
                        neighbor_data= neighbor_data, k= 5)

        # append
        predicted_ratings.append(preds)

    # assign movieId
    prediction_df['movieId'] = rating_data_pivot.columns[mask]

    # assign prediction result
    prediction_df['predicted_ratings'] = predicted_ratings

    #
    prediction_df = (prediction_df
                 .sort_values('predicted_ratings',ascending=False)
                 .head(n_items))

    return prediction_df


In [39]:
recommend_items(rating_data_pivot=rating_data_pivot, userid=1, n_neighbor=5, n_items=5,
                    recommend_seen = False )

  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similari

Unnamed: 0,movieId,predicted_ratings
44,353,5.0
406,5952,5.0
113,1079,5.0
55,431,5.0
633,94777,5.0


It's nice, but we do not know what movie on each `movieId`. We need to extract the movie title for each `movieId`

Let's load the movie metadata files

In [40]:
def load_movie_data(movie_path):
    """
    Load movie data from the given path

    Parameters
    ----------
    movie_path : str
        The movie data path

    Returns
    -------
    movie_data : pandas DataFrame
        The movie metadata
    """
    # Load data
    movie_data = pd.read_csv(movie_path,
                             index_col='movieId',
                             delimiter=',')

    print('Movie data shape :', movie_data.shape)
    return movie_data


In [41]:
# Define the movie path
movie_path = 'https://raw.githubusercontent.com/fakhrirobi/recsys_dataset/main/ml-latest-small/movies.csv'

In [42]:
# Load movie data
movie_data = load_movie_data(movie_path = movie_path)

movie_data.head()

Movie data shape : (9742, 2)


Unnamed: 0_level_0,title,genres
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
5,Father of the Bride Part II (1995),Comedy


In [43]:
user_1_recommendation = recommend_items(rating_data_pivot=rating_data_pivot, userid=1, n_neighbor=5, n_items=5,
                    recommend_seen = False )

  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similarity_score[i] = sim_i
  similari

In [44]:
user_1_recommendation['title'] = movie_data.loc[user_1_recommendation['movieId'], 'title'].values
user_1_recommendation['genres'] = movie_data.loc[user_1_recommendation['movieId'], 'genres'].values

user_1_recommendation

Unnamed: 0,movieId,predicted_ratings,title,genres
44,353,5.0,"Crow, The (1994)",Action|Crime|Fantasy|Thriller
406,5952,5.0,"Lord of the Rings: The Two Towers, The (2002)",Adventure|Fantasy
113,1079,5.0,"Fish Called Wanda, A (1988)",Comedy|Crime
55,431,5.0,Carlito's Way (1993),Crime|Drama
633,94777,5.0,Men in Black III (M.III.B.) (M.I.B.³) (2012),Action|Comedy|Sci-Fi|IMAX


Great!, now we can see its movie title with its genres

## Train Recommender System Models
---

**Note on Surprise Library**

To model the personalized RecSys, we will use a well-defined library called with `surprise`. See the [Surprise Docs.](https://surprise.readthedocs.io/en/stable/index.html)

Then load the library

In [45]:
import surprise

Now let's start modeling.

### Load the Data
---

Why we do this **again**? Because we works on a specific library that need specific input format.

So, let's do this

In [46]:
# Import some library
from surprise import Dataset, Reader

Initiate the rating scale

In [47]:
reader = Reader(rating_scale = (1, 5))
reader

<surprise.reader.Reader at 0x7f8c583473d0>

Initiate the data. It must be on format `userId`, `itemId`, and `ratings`, respectively.

### Data Preparation
---

In [48]:
utility_data = Dataset.load_from_df(
                    df = rating_data[['userId', 'movieId', 'rating']].copy(),
                    reader = reader
                )

utility_data

<surprise.dataset.DatasetAutoFolds at 0x7f8c58ed88b0>

Let's print the data

In [49]:
utility_data.df.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
3,1,47,5.0
4,1,50,5.0
5,1,70,3.0
6,1,101,5.0


In [50]:
rating_data.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
3,1,47,5.0
4,1,50,5.0
5,1,70,3.0
6,1,101,5.0


Nice! we have similar dataset

### Split Train-Test
---

We then split the train-test data. We use the similar logic from previous section

In [51]:
# Load library for deep copy
import copy

In [52]:
# Create a function
def train_test_split(utility_data, test_size, random_state):
    """
    Train test split the data
    ref: https://surprise.readthedocs.io/en/stable/FAQ.html#split-data-for-unbiased-estimation-py

    Parameters
    ----------
    utility_data : Surprise utility data
        The sample of whole data set

    test_size : float, default=0.2
        The test size

    random_state : int, default=42
        For reproducibility

    Returns
    -------
    full_data : Surprise utility data
        The new utility data

    train_data : Surprise format
        The train data

    test_data : Surprise format
        The test data
    """
    # Deep copy the utility_data
    full_data = copy.deepcopy(utility_data)

    # Generate random seed
    np.random.seed(random_state)

    # Shuffle the raw_ratings for reproducibility
    raw_ratings = full_data.raw_ratings
    np.random.shuffle(raw_ratings)

    # Define the threshold
    threshold = int((1-test_size) * len(raw_ratings))

    # Split the data
    train_raw_ratings = raw_ratings[:threshold]
    test_raw_ratings = raw_ratings[threshold:]

    # Get the data
    full_data.raw_ratings = train_raw_ratings
    train_data = full_data.build_full_trainset()
    test_data = full_data.construct_testset(test_raw_ratings)

    return full_data, train_data, test_data


In [53]:
# Split the data
full_data, train_data, test_data = train_test_split(utility_data,
                                                    test_size = 0.2,
                                                    random_state = 42)

In [54]:
# Validate the splitting
train_data.n_ratings, len(test_data)

(28840, 7210)

Great! The test size is around 20% of all dataset

Now we are ready to create the model.

### Create the Model
---

### Experiment
---

We want to train all model candidate with its hyperparameter so that we can compare which model + settings yield good result

<center>
<img src="../assets/model_config_parameter.png">

Since *Hyperparameter* Is not yielded through learning process, we have to find it / set it to yield optimal model performance

Some methods  Hyperparameter Tuning :    

- GridSearchCV

  Fitting model through all combinations of hyperparameter values and compare each fit → which combinations yield the best objective

- RandomizedSearchCV

  Fitting Model only through sampled hyperparameter candidates. Much more efficient than GridSearchCV


Hyperparameter Tuning require **Cross Validation** , making sure the during hyperparameter selection the performance measured is unbias

<center>Cross Validation</center>
<br>
<center>
<img src="../assets/cross_validation.png" width=600>
<center><a href="https://scikit-learn.org/stable/modules/cross_validation.html">Source</a></center>


**Hyperparameter** in best models
to identify the hyperparameters in models, we need to read its documentation/ paper first,for **surprise** model the model documentation is [here](https://surprise.readthedocs.io/en/stable/knn_inspired.html#surprise.prediction_algorithms.knns.KNNBaseline)
Hyperparameters :    

- *k* (Number of Neigbors)
- similarity function

**Cross Validation** Method

for this we will choose the **K-Fold** Cross Validation

**Hyperparameter Method** : *RandomizedSearchCV*

due to high computational cost if we use *GridSearchCV* we we will use *RandomizedSearchCV* .

To run hyperparameter tuning, **surprise** package already have builtin function to run hyperparameter tuning

**Model Candidate**

1. Baseline Model ( Mean Prediction)  (does not have hyperparameter)
2. User to User CF
3. Item to Item CF

In [55]:
from surprise.model_selection.search import RandomizedSearchCV


In [56]:
# Load the model library
# i.e. Baseline, KNN
from surprise import AlgoBase, KNNBaseline

Our Baseline Model just simply predict ratings using mean from all training data provided to model .

Since `surprise` library does not provide mean prediction model , we have to create custom algorithm first.

Guide `https://surprise.readthedocs.io/en/stable/building_custom_algo.html`

In [57]:
class MeanPrediction(AlgoBase):
    '''Baseline prediction. Return global mean as prediction'''
    def __init__(self):
        AlgoBase.__init__(self)

    def fit(self, trainset):
        '''Fit the train data'''
        AlgoBase.fit(self, trainset)

    def estimate(self, u, i):
        '''Perform the estimation/prediction.'''
        est = self.trainset.global_mean
        return est

**Train Baseline Model**

Since our baseline model does not have hyperparameter, we will only cross validate the model

In [58]:
# Creating baseline model instance
model_baseline = MeanPrediction()
model_baseline

<__main__.MeanPrediction at 0x7f8c58ed9360>

In [59]:
# Import the cross validation module
from surprise.model_selection import cross_validate

To perform cross validate
`cross_validate(algo, data, cv, measures)`

1. `algo` = Surprise model
2. `data` = Surprise format data
3. `cv` = number of fold of cross validation
4. `measures` = metric to measure model performance



In [60]:
# Use full_data for cross validation
# Your results could be different because
# there is no random seed stated within this functions
cv_baseline = cross_validate(algo = model_baseline,
                             data = full_data,
                             cv = 5,
                             measures = ['rmse'])

In [61]:
# Extract CV results
cv_baseline_rmse = cv_baseline['test_rmse'].mean()
cv_baseline_rmse

1.0086984029220138

to perform RandomizedSearchCV, `RandomizedSearchCV(algo_class, param_distributions,cv)`

- **algo_class** : surprise model class, in our case, we are using **KNNBaseline** class, so algo_class=KNNBaseline

- **param_distributions** :  dictionary containing param grid, the values should be in `list`

- **cv** : number of fold to split, commonly 5


KNN model object cannot be allowed to be directly used as **algo_class** parameter, we also need to add previous model setting such as :    
- sim_options to param grid

**Hyperpamater Candidate**

we have two hyperparameters
- k
- similarity functions
- approach : user cf or item cf

candidate values
- k = [5,10,15,..,40]
- similarity function = ['cosine','pearson_baseline']
- user_based : [True, False]


if user_based = True, User to User Collaborative Filtering

if user_based = False, Item to Item Collaborative Filtering

**Train KNN Model**

In [62]:
#create dictionary of parameter
params = {'k':list(np.arange(start=5, stop=40, step=5)),
          'sim_options':{'name':['cosine','pearson_baseline'],'user_based':[True,False]}}

In [63]:
tuning = RandomizedSearchCV(algo_class=KNNBaseline, param_distributions = params,
                   cv=5
                   )

In [64]:
tuning.fit(data=full_data)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline si

**Performance comparison**

We can summarize the performance

In [65]:
summary_df = pd.DataFrame({'Model': ['Baseline', 'Neighborhood Collaborative Filtering'],
                           'CV Performance - RMSE': [cv_baseline_rmse,tuning.best_score['rmse'] ],
                           'Model Condiguration':['N/A',f'{tuning.best_params["rmse"]}']})

summary_df

Unnamed: 0,Model,CV Performance - RMSE,Model Condiguration
0,Baseline,1.008698,
1,Neighborhood Collaborative Filtering,0.857737,"{'k': 35, 'sim_options': {'name': 'pearson_bas..."


**Best Hyperparameter Combination**

Our best params :     
- k = 35
- similarity function : pearson_baseline
- user_based : False

Finally, we retrain the best model with tuned parameters

In [66]:
best_params = tuning.best_params['rmse']

In [67]:
# Create object
model_best = KNNBaseline(**best_params)

# Retrain on whole train dataset
model_best.fit(train_data)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBaseline at 0x7f8c58239bd0>

### Evaluate the Best Model
---

After finding the best model, we can sanity check the performance on the test dataset

In [68]:
# import performance library
from surprise import accuracy

Next, we predict the test set using our best model

In [69]:
test_pred = model_best.test(test_data)
test_rmse = accuracy.rmse(test_pred)
test_rmse

RMSE: 0.8365


0.8364711429421693

To summarize

In [70]:
summary_test_df = pd.DataFrame({'Model' : ['User to User CF'],
                                'RMSE-Tuning': [tuning.best_score['rmse']],
                                'RMSE-Test': [test_rmse]})

summary_test_df

Unnamed: 0,Model,RMSE-Tuning,RMSE-Test
0,User to User CF,0.857737,0.836471


Great predictions!

# **3.Predictions / Generating Recommendation**
---

Decision Process is to recommend items to user
from our trained model

<image src="../assets/applsci-10-05510-g001.webp" image>




1. User to User Collaborative Filtering

  We already trained our best model on full dataset, now its time to utilize our model to generate recommendation.



How do we generate recommendation ?


To generate recommendation :
- predict all movies or only unseen movies rating from given users
- followed by ordering the movies by its predicted rat

## Predict with Best Model : User to User Collaborative Filtering

In [71]:
# Recommendation based on Best Models
# We will try to recommend on sample userid ,userId 1 & 99

# We can use model_best.predict method
help(model_best.predict)

Help on method predict in module surprise.prediction_algorithms.algo_base:

predict(uid, iid, r_ui=None, clip=True, verbose=False) method of surprise.prediction_algorithms.knns.KNNBaseline instance
    Compute the rating prediction for given user and item.
    
    The ``predict`` method converts raw ids to inner ids and then calls the
    ``estimate`` method which is defined in every derived class. If the
    prediction is impossible (e.g. because the user and/or the item is
    unknown), the prediction is set according to
    :meth:`default_prediction()
    <surprise.prediction_algorithms.algo_base.AlgoBase.default_prediction>`.
    
    Args:
        uid: (Raw) id of the user. See :ref:`this note<raw_inner_note>`.
        iid: (Raw) id of the item. See :ref:`this note<raw_inner_note>`.
        r_ui(float): The true rating :math:`r_{ui}`. Optional, default is
            ``None``.
        clip(bool): Whether to clip the estimation into the rating scale.
            For example, if :m

`model_best.predict` has argument
- `uid` (i.e., the `userId`) and
- `iid` (i.e., the item ID or `movieId`)

### Let's predict what is the rating of user 9 to movie 10

In [72]:
sample_prediction = model_best.predict(uid = 9,
                                      iid = 10)

In [73]:
sample_prediction

Prediction(uid=9, iid=10, r_ui=None, est=3.9856376442383743, details={'actual_k': 5, 'was_impossible': False})

The results tell us
- `r_ui` : actual rating --> `None`, means user 9 have yet rated movie 10
- `est` : the estimated rating from our model
- `details` : whether prediction is impossible or not. So it's possible to predict.

### Let's predict all the unseen/unrated movie by userId 9

**First, we find the unrated movie id from user id 9**

In [74]:
# Get unique movieId
unique_movie_id = set(rating_data['movieId'])
print(unique_movie_id)

{1, 2, 2052, 2054, 7, 45062, 10, 6157, 16, 122898, 2067, 4121, 25, 26, 27, 30749, 2078, 31, 32, 2081, 34, 2082, 36, 29, 122918, 39, 77866, 4138, 44, 2094, 47, 2096, 4144, 50, 45106, 2104, 116797, 88125, 4161, 71745, 6213, 70, 26694, 4168, 30793, 2122, 2119, 88140, 2124, 2131, 2134, 89, 2139, 95, 2144, 6239, 6242, 88163, 92259, 101, 6241, 4191, 92264, 4205, 110, 116, 49272, 139385, 122, 2174, 4223, 4226, 135, 2186, 4235, 6283, 5297, 144, 2193, 145, 2194, 150, 153, 4254, 4256, 161, 4262, 8360, 170, 172, 30894, 96432, 8371, 57528, 61628, 4289, 6339, 198, 147657, 204, 207, 208, 4306, 4310, 216, 6365, 80094, 223, 224, 225, 59615, 4321, 6370, 231, 4327, 6377, 6378, 141544, 237, 2288, 2291, 6387, 176371, 246, 8190, 133365, 249, 253, 2302, 2301, 2311, 88327, 266, 2314, 2316, 135436, 8459, 272, 2321, 2324, 276, 84246, 2329, 282, 86298, 2331, 106782, 4383, 2336, 288, 290, 293, 294, 125221, 296, 300, 76077, 303, 308, 8507, 316, 168252, 318, 168254, 57669, 86345, 96588, 59725, 102735, 2384, 337, 8

In [75]:
# Get movieId that is rated by user id 9
rated_movie_id = set(rating_data.loc[rating_data['userId']==9, 'movieId'])
print(rated_movie_id)

{3328, 4993, 5952, 5481, 4558, 5872, 923, 2012, 223}


In [76]:
# Find unrated movieId
# Use set operation
# unrateddId = wholeId - ratedId
unrated_movie_id = unique_movie_id.difference(rated_movie_id)
print(unrated_movie_id)

{1, 2, 2052, 2054, 7, 45062, 10, 6157, 16, 122898, 2067, 4121, 25, 26, 27, 30749, 2078, 31, 32, 2081, 34, 2082, 36, 29, 122918, 39, 77866, 4138, 44, 2094, 47, 2096, 4144, 50, 45106, 2104, 116797, 88125, 4161, 71745, 6213, 70, 26694, 4168, 30793, 2122, 2119, 88140, 2124, 2131, 2134, 89, 2139, 95, 2144, 6239, 6242, 88163, 92259, 101, 6241, 4191, 92264, 4205, 110, 116, 49272, 139385, 122, 2174, 4223, 4226, 135, 2186, 4235, 6283, 5297, 144, 2193, 145, 2194, 150, 153, 4254, 4256, 161, 4262, 8360, 170, 172, 30894, 96432, 8371, 57528, 61628, 4289, 6339, 198, 147657, 204, 207, 208, 4306, 4310, 216, 6365, 80094, 224, 225, 59615, 4321, 6370, 231, 4327, 6377, 6378, 141544, 237, 2288, 2291, 6387, 176371, 246, 8190, 133365, 249, 253, 2302, 2301, 2311, 88327, 266, 2314, 2316, 135436, 8459, 272, 2321, 2324, 276, 84246, 2329, 282, 86298, 2331, 106782, 4383, 2336, 288, 290, 293, 294, 125221, 296, 300, 76077, 303, 308, 8507, 316, 168252, 318, 168254, 57669, 86345, 96588, 59725, 102735, 2384, 337, 8529, 

In [77]:
# Let's create a function
def get_unrated_item(userid, rating_data):
    """
    Get unrated item id from a user id

    Parameters
    ----------
    userid : int
        The user id

    rating_data : pandas DataFrame
        The rating data

    Returns
    -------
    unrated_item_id : set
        The unrated item id
    """
    # Find the whole item id
    unique_item_id = set(rating_data['movieId'])

    # Find the item id that was rated by user id
    rated_item_id = set(rating_data.loc[rating_data['userId']==userid, 'movieId'])

    # Find the unrated item id
    unrated_item_id = unique_item_id.difference(rated_item_id)

    return unrated_item_id


In [78]:
unrated_movie_id = get_unrated_item(userid=9, rating_data=rating_data)
print(unrated_movie_id)

{1, 2, 2052, 2054, 7, 45062, 10, 6157, 16, 122898, 2067, 4121, 25, 26, 27, 30749, 2078, 31, 32, 2081, 34, 2082, 36, 29, 122918, 39, 77866, 4138, 44, 2094, 47, 2096, 4144, 50, 45106, 2104, 116797, 88125, 4161, 71745, 6213, 70, 26694, 4168, 30793, 2122, 2119, 88140, 2124, 2131, 2134, 89, 2139, 95, 2144, 6239, 6242, 88163, 92259, 101, 6241, 4191, 92264, 4205, 110, 116, 49272, 139385, 122, 2174, 4223, 4226, 135, 2186, 4235, 6283, 5297, 144, 2193, 145, 2194, 150, 153, 4254, 4256, 161, 4262, 8360, 170, 172, 30894, 96432, 8371, 57528, 61628, 4289, 6339, 198, 147657, 204, 207, 208, 4306, 4310, 216, 6365, 80094, 224, 225, 59615, 4321, 6370, 231, 4327, 6377, 6378, 141544, 237, 2288, 2291, 6387, 176371, 246, 8190, 133365, 249, 253, 2302, 2301, 2311, 88327, 266, 2314, 2316, 135436, 8459, 272, 2321, 2324, 276, 84246, 2329, 282, 86298, 2331, 106782, 4383, 2336, 288, 290, 293, 294, 125221, 296, 300, 76077, 303, 308, 8507, 316, 168252, 318, 168254, 57669, 86345, 96588, 59725, 102735, 2384, 337, 8529, 

**Then, we create the prediction from the unrated movie**

In [79]:
# Initialize dict
predicted_unrated_movie = {
    'userId': 9,
    'movieId': [],
    'predicted_rating': []
}

predicted_unrated_movie

{'userId': 9, 'movieId': [], 'predicted_rating': []}

In [80]:
# Loop for over all unrated movie Id
for id in unrated_movie_id:
    # Create a prediction
    pred_id = model_best.predict(uid = predicted_unrated_movie['userId'],
                                 iid = id)

    # Append
    predicted_unrated_movie['movieId'].append(id)
    predicted_unrated_movie['predicted_rating'].append(pred_id.est)

In [81]:
# Convert to dataframe
predicted_unrated_movie = pd.DataFrame(predicted_unrated_movie)
predicted_unrated_movie

Unnamed: 0,userId,movieId,predicted_rating
0,9,1,4.322471
1,9,2,3.857446
2,9,2052,3.367633
3,9,2054,3.550524
4,9,7,3.912928
...,...,...,...
793,9,2033,4.760632
794,9,30707,5.000000
795,9,2040,4.119967
796,9,47099,3.571716


Nice! Let's sort the values

In [82]:
# Sort the predicted rating values
predicted_unrated_movie = predicted_unrated_movie.sort_values('predicted_rating',
                                                              ascending = False)

predicted_unrated_movie

Unnamed: 0,userId,movieId,predicted_rating
14,9,27,5.000000
758,9,40819,5.000000
32,9,4144,5.000000
33,9,50,5.000000
712,9,1704,5.000000
...,...,...,...
275,9,2605,1.988313
446,9,3035,1.930859
615,9,3438,1.881759
187,9,2409,1.871528


In [83]:
# Let's create this into a function
def get_pred_unrated_item(userid, estimator, unrated_item_id):
    """
    Get the predicted unrated item id from user id

    Parameters
    ----------
    userid : int
        The user id

    estimator : Surprise object
        The estimator

    unrated_item_id : set
        The unrated item id

    Returns
    -------
    pred_data : pandas Dataframe
        The predicted rating of unrated item of user id
    """
    # Initialize dict
    pred_dict = {
        'userId': userid,
        'movieId': [],
        'predicted_rating': []
    }

    # Loop for over all unrated movie Id
    for id in unrated_item_id:
        # Create a prediction
        pred_id = estimator.predict(uid = pred_dict['userId'],
                                    iid = id)

        # Append
        pred_dict['movieId'].append(id)
        pred_dict['predicted_rating'].append(pred_id.est)

    # Create a dataframe
    pred_data = pd.DataFrame(pred_dict).sort_values('predicted_rating',
                                                     ascending = False)

    return pred_data

In [84]:
predicted_unrated_movie = get_pred_unrated_item(userid = 9,
                                                estimator = model_best,
                                                unrated_item_id = unrated_movie_id)

predicted_unrated_movie

Unnamed: 0,userId,movieId,predicted_rating
14,9,27,5.000000
758,9,40819,5.000000
32,9,4144,5.000000
33,9,50,5.000000
712,9,1704,5.000000
...,...,...,...
275,9,2605,1.988313
446,9,3035,1.930859
615,9,3438,1.881759
187,9,2409,1.871528


And then we create the top movie predictions

It's nice, but we do not know what movie on each `movieId`. We need to extract the movie title for each `movieId`

In [85]:
# Load movie data using upper function
movie_data = load_movie_data(movie_path = movie_path)

movie_data.head()

Movie data shape : (9742, 2)


Unnamed: 0_level_0,title,genres
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
5,Father of the Bride Part II (1995),Comedy


In [86]:
# Pick top k biggest rating
k = 5
top_movies = predicted_unrated_movie.head(k).copy()
top_movies

Unnamed: 0,userId,movieId,predicted_rating
14,9,27,5.0
758,9,40819,5.0
32,9,4144,5.0
33,9,50,5.0
712,9,1704,5.0


Finally, we can add the Movie Title

In [87]:
# Add the movie title
top_movies['title'] = movie_data.loc[top_movies['movieId'], 'title'].values
top_movies['genres'] = movie_data.loc[top_movies['movieId'], 'genres'].values

top_movies

Unnamed: 0,userId,movieId,predicted_rating,title,genres
14,9,27,5.0,Now and Then (1995),Children|Drama
758,9,40819,5.0,Walk the Line (2005),Drama|Musical|Romance
32,9,4144,5.0,In the Mood For Love (Fa yeung nin wa) (2000),Drama|Romance
33,9,50,5.0,"Usual Suspects, The (1995)",Crime|Mystery|Thriller
712,9,1704,5.0,Good Will Hunting (1997),Drama|Romance


Great!
Lets wrap all to a function

In [88]:
def get_top_highest_unrated(estimator, k, userid, rating_data, movie_data):
    """
    Get top k highest of unrated movie from a Surprise estimator RecSys

    Parameters
    ----------
    estimator : Surprise model
        The RecSys model

    k : int
        The number of Recommendations

    userid : int
        The user Id to recommend

    rating_data : pandas Data Frame
        The rating data

    movie_data : pandas DataFrame
        The movie meta data

    Returns
    -------
    top_item_pred : pandas DataFrame
        The top items recommendations
    """
    # 1. Get the unrated item id of a user id
    unrated_item_id = get_unrated_item(userid=userid, rating_data=rating_data)

    # 2. Create prediction from estimator to all unrated item id
    predicted_unrated_item = get_pred_unrated_item(userid = userid,
                                                   estimator = estimator,
                                                   unrated_item_id = unrated_item_id)

    # 3. Sort & add meta data
    top_item_pred = predicted_unrated_item.head(k).copy()
    top_item_pred['title'] = movie_data.loc[top_item_pred['movieId'], 'title'].values
    top_item_pred['genres'] = movie_data.loc[top_item_pred['movieId'], 'genres'].values

    return top_item_pred


In [89]:
# Generate 10 recommendation for user 100
get_top_highest_unrated(estimator=model_best,
                        k=10,
                        userid=100,
                        rating_data=rating_data,
                        movie_data=movie_data)

Unnamed: 0,userId,movieId,predicted_rating,title,genres
126,100,135436,5.0,The Secret Life of Pets (2016),Animation|Comedy
222,100,4572,5.0,Black Rain (1989),Action|Crime|Drama
151,100,318,4.639857,"Shawshank Redemption, The (1994)",Crime|Drama
499,100,58559,4.601764,"Dark Knight, The (2008)",Action|Crime|Drama|IMAX
381,100,904,4.568348,Rear Window (1954),Mystery|Thriller
323,100,750,4.548584,Dr. Strangelove or: How I Learned to Stop Worr...,Comedy|War
241,100,527,4.541843,Schindler's List (1993),Drama|War
533,100,60684,4.53925,Watchmen (2009),Action|Drama|Mystery|Sci-Fi|Thriller|IMAX
712,100,40819,4.535367,Walk the Line (2005),Drama|Musical|Romance
386,100,912,4.514804,Casablanca (1942),Drama|Romance


In [90]:
# Generate 10 recommendation for user 500
get_top_highest_unrated(estimator=model_best,
                        k=10,
                        userid=500,
                        rating_data=rating_data,
                        movie_data=movie_data)

Unnamed: 0,userId,movieId,predicted_rating,title,genres
658,500,183897,4.906477,Isle of Dogs (2018),Animation|Comedy
232,500,4572,4.781762,Black Rain (1989),Action|Crime|Drama
53,500,92259,4.679185,Intouchables (2011),Comedy|Drama
411,500,7084,4.605374,"Play It Again, Sam (1972)",Comedy|Romance
689,500,50872,4.549718,Ratatouille (2007),Animation|Children|Drama
605,500,7566,4.505847,28 Up (1985),Documentary
573,500,1354,4.485286,Breaking the Waves (1996),Drama|Mystery
498,500,1193,4.483065,One Flew Over the Cuckoo's Nest (1975),Drama
499,500,1196,4.470256,Star Wars: Episode V - The Empire Strikes Back...,Action|Adventure|Sci-Fi
216,500,104879,4.469839,Prisoners (2013),Drama|Mystery|Thriller
