# **Collaborative Filtering based Recommender System using the `Surprise` library**

## Objectives

* Perform KNN and NMF-based collaborative filtering on the user-item interaction matrix

----

Collaborative filtering is probably the most commonly used recommendation algorithm. There are two main types of methods:
 - **User-based** collaborative filtering is based on the user similarity or neighborhood
 - **Item-based** collaborative filtering is based on similarity among items

They both work similarly, let's briefly explain how user-based collaborative filtering works.

User-based collaborative filtering looks for users who are similar.

#### User-item interaction matrix

For most collaborative filtering-based recommender systems, the main dataset format is a 2-D matrix called the user-item interaction matrix. In the matrix,  its row is labeled as the user id/index and column labelled to be the item id/index, and the element `(i, j)` represents the rating of user `i` to item `j`.  

We have generated this matrix here. We called it `profile_df`.

In [2]:
import pandas as pd
profile_df = pd.read_csv('user_profile.csv')
profile_df.head()

Unnamed: 0,user,Database,Python,CloudComputing,DataAnalysis,Containers,MachineLearning,ComputerVision,DataScience,BigData,Chatbot,R,BackendDev,FrontendDev,Blockchain
0,2,83.0,21.0,8.0,68.0,4.0,49.0,0.0,47.0,66.0,3.0,27.0,41.0,9.0,8.0
1,4,78.0,5.0,6.0,48.0,0.0,30.0,0.0,45.0,46.0,0.0,12.0,9.0,0.0,4.0
2,5,47.0,18.0,36.0,46.0,0.0,59.0,0.0,47.0,29.0,4.0,27.0,49.0,7.0,13.0
3,7,4.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0
4,8,13.0,0.0,0.0,10.0,0.0,0.0,0.0,0.0,13.0,0.0,5.0,0.0,0.0,0.0


#### KNN-based collaborative filtering

As mentioned before, each row vector represents the rating history of a user and each column vector represents the users who rated the item. A user-item interaction matrix is usually very sparse as you can imagine one user very likely only interacts with a very small subset of items and one item is very likely to be interacted by a small subset of users.

Now to determine if two users are similar, we can simply calculate the similarities between their row vectors in the interaction matrix. Then based on the similarity measurements, we can find the `k` nearest neighbor as the similar users.

Item-based collaborative filtering works similarly, we just need to look at the user-item matrix vertically. Instead of finding similar users, we are trying to find similar items (courses). If two courses are enrolled by two groups of similar users, then we could consider the two items are similar and use the known ratings from the other users to predict the unknown ratings.

If we formulate the KNN based collaborative filtering,  the predicted rating of user $u$ to item $i$, $\hat{r}_{ui}$ is given by:

**User-based** collaborative filtering:

$$\hat{r}_{ui} = \frac{
\sum\limits_{v \in N^k_i(u)} \text{similarity}(u, v) \cdot r_{vi}}
{\sum\limits_{v \in N^k_i(u)} \text{similarity}(u, v)}$$

**Item-based** collaborative filtering:

$$\hat{r}_{ui} = \frac{
\sum\limits_{j \in N^k_u(i)} \text{similarity}(i, j) \cdot r_{uj}}
{\sum\limits_{j \in N^k_u(i)} \text{similarity}(i, j)}$$

Here $N^k_i(u)$ notates the nearest k neighbors of $u$.

----

*Surprise* is a Python sci-kit library for recommender systems. It is simple and comprehensive to build and test different recommendation algorithms.

In [5]:
from surprise import KNNBasic, NMF
from surprise import Dataset, Reader
from surprise.model_selection import train_test_split
from surprise import accuracy
from surprise.model_selection import cross_validate

In [13]:
# also set a random state
rs = 42

Let's take a look at a code example how easily to perform KNN collaborative filtering on a sample movie review dataset, which contains about 100k movie ratings from users.


In [6]:
# Load the movielens-100k dataset (download it if needed),
data = Dataset.load_builtin('ml-100k', prompt=False)

# sample random trainset and testset
# test set is made of 25% of the ratings.
trainset, testset = train_test_split(data, test_size=.25)

# We'll use the famous KNNBasic algorithm.
algo = KNNBasic()

# Train the algorithm on the trainset, and predict ratings for the testset
algo.fit(trainset)
predictions = algo.test(testset)

# Then compute RMSE
accuracy.rmse(predictions)

Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9785


0.9785231714925331

The main evaluation metric used was `Root Mean Square Error (RMSE)` which is a very popular rating estimation error metric used in recommender systems as well as many regression model evaluations.

In [7]:
# Load the rating dataframe CSV file
ratings_df = pd.read_csv("ratings.csv")
ratings_df.head()

Unnamed: 0,user,item,rating
0,1889878,CC0101EN,5
1,1342067,CL0101EN,3
2,1990814,ML0120ENv3,5
3,380098,BD0211EN,5
4,779563,DS0101EN,3


In [8]:
# create the reader object with Read on the course rating dataset with columns user item rating
reader = Reader(
    line_format='user item rating', sep=',', skip_lines=1, rating_scale=(1, 5))

# Load the dataset from the CSV file
course_dataset = Dataset.load_from_file("ratings.csv", reader=reader)

In [9]:
type(course_dataset)

surprise.dataset.DatasetAutoFolds

We split it into trainset and testset:

In [10]:
trainset, testset = train_test_split(course_dataset, test_size=.2)

In [11]:
# check how many users and items we can use to fit a KNN model:
print(f"Total {trainset.n_users} users and {trainset.n_items} items in the training set")

Total 32167 users and 125 items in the training set


## Perform KNN-based collaborative filtering on the user-item interaction matrix

In [14]:
sim_options = {'name': 'pearson', 'user_based': True}
model_KNN = KNNBasic(k=40, min_k=1, sim_options=sim_options, random_state = rs)

# Run 5-fold cross-validation and print results
cv_knn = cross_validate(model_KNN, course_dataset, measures=["RMSE", "MAE"], cv=5, verbose=True)
cv_knn

Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8271  0.8239  0.8264  0.8265  0.8258  0.8259  0.0011  
MAE (testset)     0.7001  0.6959  0.6992  0.6993  0.6980  0.6985  0.0015  
Fit time          122.82  124.23  122.52  124.99  126.49  124.21  1.45    
Test time         54.34   56.84   54.99   60.28   54.74   56.24   2.19    


{'test_rmse': array([0.82709264, 0.82392115, 0.82642214, 0.82652314, 0.82577362]),
 'test_mae': array([0.70006109, 0.69588436, 0.69917227, 0.69927677, 0.69797642]),
 'fit_time': (122.82270002365112,
  124.2308931350708,
  122.5240170955658,
  124.9873058795929,
  126.48764395713806),
 'test_time': (54.33773708343506,
  56.84211993217468,
  54.989935874938965,
  60.27819013595581,
  54.743783950805664)}

In [15]:
# Train the model
model_KNN.fit(trainset)
# Make predictions with KNN on the test set
predictions_knn = model_KNN.test(testset)
# Evaluate the model
accuracy.rmse(predictions_knn)

Computing the pearson similarity matrix...
Done computing similarity matrix.
RMSE: 0.8284


0.8284336859477366

## Perform NMF-based collaborative filtering 

The Non-negative Matrix Factorization decomposes the **user-item interaction matrix** into **user matrix** and **item matrix**, which contain the **latent features** of users and items and we can simply dot-product them to get an estimated rating.

In [16]:
# Build the NMF model
# An NMF (Non-negative Matrix Factorization) model is instantiated and trained on the training set
model_NMF = NMF(random_state = rs)

# Train the model
model_NMF.fit(trainset)

# Make predictions with NMF on the test set
predictions_nmf = model_NMF.test(testset)

# Evaluate the model
accuracy.rmse(predictions_nmf)

RMSE: 0.9541


0.9540958479887904

----

In the next, and final notebook of this serie, we are going to use a neural network for course rating prediction.  