# **Model Based Collaborative Filtering**

Modeling for Structured Data

---
## Outline:

1. Background
2. Simplified Workflows.
3. Importing Data
4. Data Preparations
5. Data Preprocessing
6. Modeling
7. Hyperparameter Tuning
8. Evaluation
9. Decision Process (Recommendation Process)

# **Background**
---

## Problem Description


- A streaming platform **nonton-yuk.com** are having a problem with its user retention.
- In 3 months, the user retention rate dropped almost 15% which really affects **nonton-yuk.com** revenues.
- After doing an urgent user research, **nonton-yuk.com** teams found that **users find it difficult** to browse movie in **nonton-yuk.com** which has nearly ~7,000 movies.

## Business Objective

- Our business objective would be **increasing user retention** to **15%** (assumed ofcourse) in 3 months.

## Solution

- We can create a **movie recommendation** to help **users browse** the movie **easily** --> remove the users difficulty in using **nonton-yuk.com** platform.

The goal, of our recommendation is to recommend movies that user might like, however we can't directly measure how like user to a movies, thus we need to define what's called as **proxy** label.

To approach those, some appropriate proxy labels are :    
- Scale of rating (star) user  given to a movie
- User click the movie
- etc

Considering the data we have, we have only records of **rating** data given from user to certain movies, thus we will choose **ratings given** as proxy label from item liked

We can move further into machine learning task.

**Our task** is to predict number of stars given from user to a movie.

With  stars itself is in continous value, hence we can conclude it as **regression task**
We now have a clearer picture what we should do, However we need more precise solution in recommender system context.

Some recommendations approach:
1. **Non-personalized**: recommendation by popularity
2. **Personalized**: collaborative filtering

Approach in Personalized Recommender System can be divided based on the presence of interaction data (explicit / implicit) data:     
1. When the interaction data is not exists, the solution that can be implemented is using content feature, **Content Based** Filtering
2. When the interaction data is exists, we can use **Collaborative** Filtering

<img src =https://www.researchgate.net/publication/331063850/figure/fig3/AS:729493727621125@1550936266704/Content-based-filtering-and-Collaborative-filtering-recommendation.ppm >

<center><a href=https://www.researchgate.net/publication/331063850/figure/fig3/AS:729493727621125@1550936266704/Content-based-filtering-and-Collaborative-filtering-recommendation.ppm>Source</a> </center>

Due to presence of interaction, in this case rating data, we will not using **Content Based** filtering, instead we will use collaborative filtering

<img src="https://drive.google.com/uc?export=view&id=1x16ea0zXDGsefpj6aLkre83wquIWedhI" width=600>


Previously we have already used approach neighborhood collaborative filtering, however there are some caveats  :  
- The similarity measure is arbitrary, (measured) and cannot be optimized
- The similarity measure only care about pairwise between items / users

What is the solution ?

We can use **Model Based Collaborative Filtering** to encounter weight that previously have not been optimized.

## Model Metrics

We have already established some points :
- Our task is to predict stars that will be given by users to certain movies
- We will use Collaborative Filtering approach

Regarding those, we need to measure the success of our model ( metrics), based on the points mentioned, our goal is to predict as close as possible the predicted rating to user true rating,

We want to minimize $(\text{True Rating - Predicted Rating})$, some choices of appropriate metrics are :     
- Mean Absolute Error
- Mean Square Error
- Root Mean Squared Error

Due to its `differentiable` property , we will choose **MSE/RMSE** as our model metrics

## Data Description

- The data is obtained from [Movielens dataset](https://grouplens.org/datasets/movielens/).
- It contains ~100K ratings from 1,000 users and 1,700 movies.

There are two files that we use:

**The movie rating data** : `rating.csv`

<center>

|Features|Descriptions|Data Type|
|:--|:--|:--:|
|`userId`|The user ID|`int`|
|`movieId`|The movie ID|`int`|
|`rating`|Rating given from user to movie. Ranging from `0` to `5`|`float`|



**The movie metadata** : `movies.csv`

<center>

|Features|Descriptions|Data Type|
|:--|:--|:--:|
|`movieId`|The movie ID|`int`|
|`title`|The movie ID title|`str`|
|`genres`|The movie ID genres|`str`|

_______________

# **Recommender System Workflow** (Simplified)

## <font color='blue'>1. Importing Data</font>

```
1. Load the data.
2. Check the shape & type of data.
3. Handle the duplicates data to maintain data validity.
```

## <font color='blue'>2.Modelling : Model Based Collaborative Filtering</font>

```
1. Creating Utility Matrix
2. Training + Model Selection  :     
    - Baseline Approach
    - SVD

4. Evaluating Model
  - Rating Prediction Task

```


## <font color='blue'>3. Generating Recommendation / Predictions</font>

```
1. Predict recommendation of user-i to unrated item-j
2. Predict recommendation of user-i to all their unrated items
```

# **1. Importing Data**

What do we do?
1. Load the data.
2. Check the shape & type of data.
3. Handle the duplicates data to maintain data validity.

## Load the data

In [1]:
# Load this library
import numpy as np
import pandas as pd

In [2]:
rating_path = r"D:\Daniel\PACMANN\RecSys\rating_sample.csv"

rating_data = pd.read_csv(rating_path,
                          delimiter = ',')

rating_data.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,70,3.0


In [3]:
# Check data shapes
rating_data.shape

(20760, 3)

In [4]:
# Check data types
rating_data.dtypes

userId       int64
movieId      int64
rating     float64
dtype: object

In [5]:
# Check duplicate data
rating_data.duplicated(subset=['userId', 'movieId']).sum()

0

**Note**
- If you have a user ID rates similar movie ID more than one, you can keep the most up to date ones & drop the rest.

## Create load function

Finally, we can create load data function

In [6]:
def load_rating_data(rating_path):
    """
    Function to load data & remove from duplicates

    Parameters
    ----------
    rating_path : str
        The path of rating data

    Returns
    -------
    rating_data : pandas DataFrame
        The sample of rating data
    """
    # Load data
    rating_data_raw = pd.read_csv(rating_path, delimiter=',')
    print('Original data shape :', rating_data_raw.shape)


    return rating_data_raw


In [7]:
# Load rating data
rating_data = load_rating_data(rating_path = rating_path)

Original data shape : (20760, 3)


In [8]:
rating_data.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,70,3.0


________________

# **2. Modelling**: Model Based Collaborative Filtering

## Background

Previously we have tried similarity based / neighborhood collaborative filtering models, We encountered a problem.
We define heavily on **similarity**, Even though we cannot optimize the similarity.
That means,if we fail to craft similarity, it may fail on prediction.
Another problem, is that whenever user is new we can't measure similarity due to not enough rating data available.
We need another solution!

**Latent Models**

Recall again, our utility matrix has component of user and item, which some of it are interacted and yield utility value, e.g. watching movies user give rating to express their likeliness. The solution is we can infer two main factor, user factor and item factor that are behind the process of how utility value / rating are made



<img src="https://drive.google.com/uc?export=view&id=1f7MoLGctHdM9SV1nGT8qZ0zFSzXg7YKo" width=600>


We need **Matrix Decomposition**

Some approach :    
1. Eigendecomposition
2. Cholesky Decomposition
3. Singular Decomposition

Our constrain :    

Utility matrix rarely have n_users = n_items, which mean our utility matrix rarely be a `square matrix`

Solution :     
We can use Singular Value Decomposition which does not require square matrix

## Workflow
---

To create a personalized RecSys, we can follow these steps:

```
1. Data Preparation --> Create utility matrix & Split Train-Test
2. Train recommendation model --> Baseline, User to User CF (KNN) & Item to Item CF (KNN)
3. Choosing Best Model
5. Evaluate Final Model
```

## Implementing Model From Scratch
---

### Pure Singular Value Decomposition
---

<img src="https://drive.google.com/uc?export=view&id=1TPbt_6viDrGzBzNQMQxVnsq380FfPTbV" width=600>


#### Data Preparation

Why we need to prepare the data ?
Because previously our data is in dataframe which have
 `userId`,`movieId`,`rating`

| userId | movieId | rating |
|:------:|---------|--------|
| 1      | 1       | 4      |
| ..     |         |        |
| 600    | 1       | 5      |

However we want our dataframe become same as utility matrix shape

| userId | movieId1     | ..           | movieIdNth   |
|:------:|--------------|--------------|--------------|
| 1      | rating value | rating value | rating value |
| ..     | rating value | rating value | ratig value  |
| 600    | rating value | rating value | rating value |

We can achieve those by using `pd.pivot`

In [9]:
# pivot data
rating_data_pivot = rating_data.pivot(index= 'userId', columns= 'movieId', values= 'rating')

# take a look after pivoted
rating_data_pivot.head()

movieId,1,3,6,32,34,39,47,48,69,70,...,76077,91529,91630,103253,106696,108190,109487,111362,152081,160438
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,4.0,4.0,,,,5.0,,,3.0,...,,,,,,,,,,
2,,,,,,,,,,,...,,3.5,,,,,3.0,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,2.0,,,2.0,,,,...,,,,,,,,,,
5,4.0,,,,4.0,3.0,,,,,...,,,,,,,,,,


In [10]:
#check shape
rating_data_pivot.shape

(602, 248)

In [11]:
rating_data_pivot.isnull().sum().sum()

128536

Se can see that after pivoting data, there are lot of missing values, however we can't measure similarity using missing data, we need imputation later. 

For easier preparing data we will create a function `prepare_utility_dataframe`

In [12]:
def prepare_utility_dataframe(rating_path) :

    """
    Function to prepare rating data into pivoted rating_data (utility matrix form)

    Parameters
    ----------
    rating_path : str
        The path of rating data

    Returns
    -------
    rating_data_pivot : pandas DataFrame
        rating data in pivoted format





    """

    # load data
    rating_data = load_rating_data(rating_path)

    # perform pivot
    rating_data_pivot = rating_data.pivot(index= 'userId', columns= 'movieId', values= 'rating')

    # print pivoted data shape
    print('Data Shaped After Pivot', rating_data_pivot.shape)

    # checking missing values
    print('Number of missing values after pivot',rating_data_pivot.isnull().sum().sum() )


    # return data
    return rating_data_pivot


In [13]:
# check function
rating_data_pivot = prepare_utility_dataframe(rating_path = rating_path)

Original data shape : (20760, 3)
Data Shaped After Pivot (602, 248)
Number of missing values after pivot 128536


In [14]:
rating_data_pivot.head()

movieId,1,3,6,32,34,39,47,48,69,70,...,76077,91529,91630,103253,106696,108190,109487,111362,152081,160438
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,4.0,4.0,,,,5.0,,,3.0,...,,,,,,,,,,
2,,,,,,,,,,,...,,3.5,,,,,3.0,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,2.0,,,2.0,,,,...,,,,,,,,,,
5,4.0,,,,4.0,3.0,,,,,...,,,,,,,,,,


#### Fill Missing Values

To perform Singular Value Decomposition we can use `scipy.sparse.linalg.svd`

In [15]:
# # import package
# from scipy.linalg import svd

# u,s,v = svd(rating_data_pivot)

ValueError: array must not contain infs or NaNs

Ups!, remember that we cannot use `SVD` if we have missing values

To demonstrate how Singular Value Decomposition work we will impute first the data, using **0**

In [16]:
# imputing nan
imputed_rating = rating_data_pivot.fillna(0)

# convert our imputed ratings into matrix
ratings_matrix = imputed_rating.to_numpy()

In [17]:
# checking new imputed matrix
ratings_matrix

array([[4. , 4. , 4. , ..., 0. , 0. , 0. ],
       [0. , 0. , 0. , ..., 0. , 0. , 0. ],
       [0. , 0. , 0. , ..., 0. , 0. , 0. ],
       ...,
       [2.5, 2. , 0. , ..., 0. , 0. , 0. ],
       [3. , 0. , 0. , ..., 0. , 0. , 0. ],
       [5. , 0. , 5. , ..., 4. , 4. , 0. ]])

#### Performing SVD

In [18]:
# perform svd, extracting 3 components, u,s,v

U,S,Vt = svd(ratings_matrix)

# check shape
print(U.shape)

(602, 602)


In [19]:
# check shape
print(S.shape)

(248,)


In [20]:
# check shape
print(Vt.shape)

(248, 248)


Next, we will take a look at each resulting component

In [21]:
U

array([[-6.10986995e-02, -1.40406053e-02,  8.87529304e-02, ...,
        -5.05838395e-03,  4.42649553e-04, -5.08079222e-02],
       [-6.58970610e-03,  9.00114922e-03, -2.52556453e-02, ...,
        -6.28828187e-02,  1.30559298e-02,  6.86276468e-02],
       [-5.00299782e-04,  2.28932072e-04,  2.72181322e-04, ...,
        -1.15192235e-02,  7.63933037e-02, -3.47478786e-02],
       ...,
       [-9.97047068e-02,  2.28888773e-02, -4.09372077e-03, ...,
         2.77367170e-01,  5.92809407e-03,  1.31309046e-02],
       [-1.54748796e-02, -5.56241558e-02, -2.90079969e-02, ...,
        -2.93211927e-02,  8.13520054e-01,  2.92182716e-03],
       [-1.03603767e-01,  9.34901281e-02, -3.60039556e-02, ...,
         2.36520822e-02,  6.07308755e-03,  1.83231528e-01]])

In [22]:
U.shape

(602, 602)

In [23]:
S

array([339.21215777, 131.75155703, 102.97079724,  94.39134529,
        85.7511904 ,  71.6504792 ,  66.19318185,  58.39552719,
        56.39213009,  55.67866648,  53.04557521,  51.87756309,
        50.28978355,  49.99515341,  49.27438074,  47.31526156,
        46.62374948,  46.14490778,  45.40407384,  44.51548227,
        43.45653386,  43.19464136,  42.87798748,  41.92636037,
        41.75426494,  40.99269952,  40.70671585,  40.47013616,
        40.18082373,  39.77956871,  39.52039779,  38.83204996,
        38.22805471,  37.79959828,  37.62818421,  37.11107868,
        36.7841416 ,  36.41688409,  36.19755515,  36.11005467,
        35.5681784 ,  35.24195139,  35.11694815,  34.45418756,
        34.37938099,  34.06700782,  33.76365576,  33.48189199,
        33.38668802,  33.04295229,  32.72714273,  32.48093131,
        32.34447498,  32.08010303,  31.92667814,  31.38214849,
        31.14570529,  31.03587733,  30.61056076,  30.59024625,
        30.21910478,  30.06780253,  29.91175913,  29.55

In [24]:
S.shape

(248,)

In [25]:
Vt

array([[-1.19195485e-01, -2.46238616e-02, -6.52963961e-02, ...,
        -2.18047475e-02, -1.80222824e-02, -5.90937469e-03],
       [-5.93354056e-02, -4.00549777e-02, -4.64539907e-02, ...,
         3.20133848e-02,  2.89744119e-02,  6.09319920e-03],
       [ 3.77628925e-02,  4.22598465e-02,  3.19443941e-02, ...,
        -1.46226831e-02, -3.12695342e-02,  2.31874390e-03],
       ...,
       [ 5.47752358e-03, -3.32635456e-02,  1.18720559e-02, ...,
        -6.43801810e-02, -1.79636938e-02,  4.54976618e-01],
       [-1.25425430e-03, -3.72494507e-02, -2.09651833e-05, ...,
         2.21493611e-02,  2.68393904e-02, -6.70822868e-02],
       [ 2.83418327e-04,  8.96494023e-03, -7.21521652e-03, ...,
        -4.05817490e-02,  2.39229876e-03,  1.87728507e-01]])

In [26]:
Vt.shape

(248, 248)

#### Predicting a rating from SVD Components

Now, to make things under the hood far more understandable we are going to predict rating from user 1 on item 1.

Now, we have already have **U,S,V** components, we can generate prediction on how user will give a rating to an item , what we need to do are :     
1. Slice user factor (U)
2. Slice item factor (V)
3. Perform Dot product $(u.s.v^T)$

In [27]:
# slice user factor
user_id = 0
u_1 = U[user_id,:]

# show output
u_1

array([-6.10986995e-02, -1.40406053e-02,  8.87529304e-02,  5.47528141e-02,
        5.18094713e-02,  8.87199623e-03, -1.07273282e-02, -5.92583702e-02,
       -4.10514311e-02,  7.82964028e-02,  5.45425141e-04, -7.62034712e-02,
       -6.61876397e-02,  8.14280249e-02, -4.99956387e-02, -1.51804803e-01,
       -5.26391778e-02,  3.82218527e-02,  8.79724333e-02, -7.76514655e-03,
       -7.07754099e-03, -6.22517258e-02, -1.18292936e-01, -2.96878183e-02,
        4.17751305e-02,  5.87610062e-02, -2.91692662e-02,  3.19693674e-02,
       -8.81960422e-03, -9.70923355e-02,  3.72105920e-02,  2.10032502e-02,
       -6.40964141e-02,  1.51631925e-01, -1.07941024e-02,  5.11747239e-03,
        4.22785728e-02, -4.49994020e-02, -1.84601374e-02, -1.12255757e-01,
        1.37224475e-01,  1.89959380e-02,  6.56149742e-03, -5.61689817e-02,
        1.03701522e-01, -9.39044184e-02,  4.72992799e-02,  2.41742957e-02,
        1.07183931e-01, -1.34182253e-02, -1.06624887e-01,  7.05486764e-02,
       -4.73448611e-02,  

In [28]:
u_1.shape

(602,)

In [29]:
# slice item factor
item_id = 0
v_1 = Vt[:,item_id]

v_1.shape

(248,)

For singular values we do not need to slice it, we use all of the singular values , we only need to create diagonal matrix so that we can perform dot product

In [30]:
s_diag = np.diag(S)

In [31]:
s_diag.shape

(248, 248)

In [32]:
print(U.shape)
print(S.shape)
print(Vt.shape)

(602, 602)
(248,)
(248, 248)


In [33]:
s_new = np.vstack((np.diag(S),np.zeros(shape=(U.shape[0]-Vt.shape[0],Vt.shape[0]))))
s_new.shape

(602, 248)

In [34]:
print(U.shape)
print(S.shape)
print(Vt.shape)

(602, 602)
(248,)
(248, 248)


In [35]:
# perform dot product U.S
us = u_1.dot(s_new)
# check us shape
print('Us shape',us.shape)

Us shape (248,)


It yield correct answer `<1x602>` x `<602x248>` it result `<1x248>`.

In [36]:
# perform dot product with v_1
predicted_rating = us.dot(v_1)

In [37]:
print('Predicted rating from user 1 to item 1',predicted_rating)

Predicted rating from user 1 to item 1 4.000000000000006


In [38]:
imputed_rating.loc[1,1]

4.0

We can see our prediction is the same as what it supposed to be.
Construct Whole Utility matrix.
To construct whole utility matrix we can multiply each component $U.S.V^T$.

In [39]:
utility_matrix = np.dot(U,s_new).dot(Vt)

In [40]:
    utility_matrix

array([[ 4.00000000e+00,  4.00000000e+00,  4.00000000e+00, ...,
         2.49800181e-15,  1.08246745e-15,  2.16493490e-15],
       [-8.60422844e-16,  5.38458167e-15,  4.81559237e-15, ...,
        -9.99200722e-16, -1.44328993e-15,  1.04083409e-15],
       [ 1.52655666e-16,  1.11369247e-15, -1.38777878e-16, ...,
         1.20389809e-15,  1.88109077e-16,  4.43655529e-16],
       ...,
       [ 2.50000000e+00,  2.00000000e+00, -3.05311332e-15, ...,
        -2.82412982e-15, -3.19189120e-15,  2.68882139e-16],
       [ 3.00000000e+00,  5.85642645e-15,  3.76088050e-15, ...,
         3.88578059e-16,  8.88178420e-16, -4.37150316e-16],
       [ 5.00000000e+00, -8.47932835e-15,  5.00000000e+00, ...,
         4.00000000e+00,  4.00000000e+00,  1.11022302e-15]])

#### Experimenting

We want to experiment how the singular values may give different result if we modify the number of singular value, for benchmarking purpose we will predict rating on user 1 on item 1

for number of singular values we are going to take is :
- all
- 10
- 100

the true value of rating from user 1 on item 1 is **4.0**

we will make function named `slice_singular_values`

In [41]:
# create function
def singular_values(n_singular,sing_vector) :
    """Function to slice singular values"""

    # copy to avoid overwriting
    sing_vector = sing_vector.copy()

    # replace after n singular values with 0
    sing_vector[n_singular:] = 0
    s_new = np.vstack((np.diag(sing_vector),np.zeros(shape=(U.shape[0]-Vt.shape[0],Vt.shape[0]))))


    return s_new


**10 singular Values**

In [42]:
s_10 = singular_values(n_singular=10,sing_vector=S)

predicted_ratigs_10 = np.dot(u_1,s_10).dot(v_1)
predicted_ratigs_10

2.5879168396164935

**100 Singular Values**

In [43]:
s_100 = singular_values(n_singular=100,sing_vector=S)

predicted_ratigs_100 = np.dot(u_1,s_100).dot(v_1)
predicted_ratigs_100

4.267890021309445

**All Singular Values**

In [45]:
s_all = singular_values(n_singular=len(S),sing_vector=S)

predicted_ratigs_all = np.dot(u_1,s_all).dot(v_1)
predicted_ratigs_all

4.000000000000006

**Wrap Up**


In [46]:
summary_singular_value = pd.DataFrame(
    data={'N singular Value': [10,100,'all'],
          'Predicted_ratings': [predicted_ratigs_10,predicted_ratigs_100,predicted_ratigs_all]
          }
)
summary_singular_value

Unnamed: 0,N singular Value,Predicted_ratings
0,10,2.587917
1,100,4.26789
2,all,4.0


We can see that the more singular value component we add the closer to real value