# **Collaborative Filtering based Recommender System using Non-negative Matrix Factorization**


Estimated time needed: **60** minutes


The KNN algorithm is memory-based which means we need to keep all instances for prediction and maintain a big similarity matrix. These can be infeasible if our user/item scale is large, for example, 1 million users will require a 1 million by 1 million similarity matrix, which is very hard to load into RAM for most computation environments.


#### Non-negative matrix factorization


In the machine learning course, you have learned a dimensionality reduction algorithm called Non-negative matrix factorization (NMF), which decomposes a big sparse matrix into two smaller and dense matrices.

Non-negative matrix factorization can be one solution to big matrix issues. The main idea is to decompose the big and sparse user-interaction into two smaller dense matrices, one represents the transformed user features and another represents the transformed item features.


An example is shown below, suppose we have a user-item interaction matrix $A$ with 10000 users and 100 items (10000 x 100), and its element `(j, k)` represents the rating of item `k` from user `j`. Then we could decompose $A$ into two smaller and dense matrices $U$ (10000 x 16) and $I$ (16 x 100). for user matrix $U$, each row vector is a transformed latent feature vector of a user, and for the item matrix $I$, each column is a transformed latent feature vector of an item. 

Here the dimension 16 is a hyperparameter defines the size of the hidden user and item features, which means now the shape of transposed user feature vector and item feature vector is now 16 x 1.


The magic here is when we multiply the row `j` of $U$ and column `k` of matrix $I$, we can get an estimation to the original rating $\hat{r}_{jk}$. 

For example, if we preform the dot product user ones  row vector in $U$ and item ones  column vector in $I$, we can get the rating estimation of user one to item one, which is the element (1, 1) in the original interaction matrix $I$. This r


Note $I$ is short for Items, and it is not an identity matrix.


Then how do we figure out the values in $U$ and $I$ exactly? Like many other machine learning processes, we could start by initializing the values of $U$ and $I$, then define the following distance or cost function to be minimized:


$$\sum_{r_{jk} \in {train}} \left(r_{jk} - \hat{r}_{jk} \right)^2,$$


where $\hat{r}_{ij}$ is the dot product of $u_j^T$ and $i_k$:


$$\hat{r}_{jk} = u_j^Ti_k$$


The cost function can be optimized using stochastic gradient descent (SGD) or other optimization algorithms, just like in training the weights in a logistic regression model (there are several additional steps so the matrices have no negative elements) . 


----


### Load and exploring dataset


Let's first load our dataset, i.e., the user-item (learn-course) interaction matrix


In [1]:
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

In [2]:
rating_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/datasets/ratings.csv"
rating_df = pd.read_csv(rating_url)

In [3]:
rating_df.head()

Unnamed: 0,user,item,rating
0,1889878,CC0101EN,3.0
1,1342067,CL0101EN,3.0
2,1990814,ML0120ENv3,3.0
3,380098,BD0211EN,3.0
4,779563,DS0101EN,3.0


The dataset contains three columns, `user id`, `item id`, and `the rating`. Note that this matrix is presented as the dense or vertical form, you may convert it using `pivot` to the original sparse matrix:


In [4]:
rating_sparse_df = rating_df.pivot(index='user', columns='item', values='rating').fillna(0).reset_index().rename_axis(index=None, columns=None)
rating_sparse_df.head()

Unnamed: 0,user,AI0111EN,BC0101EN,BC0201EN,BC0202EN,BD0101EN,BD0111EN,BD0115EN,BD0121EN,BD0123EN,...,SW0201EN,TA0105,TA0105EN,TA0106EN,TMP0101EN,TMP0105EN,TMP0106,TMP107,WA0101EN,WA0103EN
0,2,0.0,3.0,0.0,0.0,3.0,2.0,0.0,2.0,2.0,...,0.0,2.0,0.0,3.0,0.0,2.0,2.0,0.0,3.0,0.0
1,4,0.0,0.0,0.0,0.0,2.0,2.0,2.0,2.0,2.0,...,0.0,2.0,0.0,0.0,0.0,2.0,2.0,0.0,2.0,2.0
2,5,2.0,2.0,2.0,0.0,2.0,0.0,0.0,0.0,2.0,...,0.0,0.0,2.0,2.0,2.0,2.0,2.0,2.0,0.0,2.0
3,7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,8,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Next, you need to implement NMF-based collaborative filtering, and you may choose one of the two following implementation options: 
- The first one is to use `Surprise` which is a popular and easy-to-use Python recommendation system library. 
- The second way is to implement it with `numpy`, `pandas`, and `sklearn`. You may need to write a lot of low-level implementation code along the way.


## Implementation Option 1: Use **Surprise** library (recommended)


*Surprise* is a Python scikit library for recommender systems. It is simple and comprehensive to build and test different recommendation algorithms. First let's install it:


In [5]:
# if you haven't done it before, uncomment the below line:
#!pip install scikit-surprise==1.1.1

We import required classes and methods


In [6]:
from surprise import NMF
from surprise import Dataset, Reader
from surprise.model_selection import train_test_split
from surprise import accuracy

In [7]:
rating_df.to_csv("course_ratings.csv", index=False)
# Read the course rating dataset with columns user item rating
reader = Reader(
        line_format='user item rating', sep=',', skip_lines=1, rating_scale=(2, 3))

coruse_dataset = Dataset.load_from_file("course_ratings.csv", reader=reader)

Now  we split the data into a train-set and test-set:


In [8]:
trainset, testset = train_test_split(coruse_dataset, test_size=.3)

Then check how many users and items we can use to fit the KNN model:


In [9]:
print(f"Total {trainset.n_users} users and {trainset.n_items} items in the trainingset")

Total 31454 users and 125 items in the trainingset


### TASK: Perform NMF-based collaborative filtering on the course-interaction matrix


_TODO: Fit a NMF model using the trainset and evaluate the results using the testset_ The code will be very similar to the KNN-based collaborative filtering, you just need to use the `NMF()` model.


In [10]:
## WRITE YOUR CODE HERE:

# - Define a NMF model NMF(verbose=True, random_state=123)
mymodel = NMF(verbose=True, random_state=123, init_low=0.5, init_high=5.0, n_factors=32)

# - Train the NMF on the trainset, and predict ratings for the testset
mymodel.fit(trainset)
predictions = mymodel.test(testset)

# - Then compute RMSE
accuracy.rmse(predictions)

Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Processing epoch 10
Processing epoch 11
Processing epoch 12
Processing epoch 13
Processing epoch 14
Processing epoch 15
Processing epoch 16
Processing epoch 17
Processing epoch 18
Processing epoch 19
Processing epoch 20
Processing epoch 21
Processing epoch 22
Processing epoch 23
Processing epoch 24
Processing epoch 25
Processing epoch 26
Processing epoch 27
Processing epoch 28
Processing epoch 29
Processing epoch 30
Processing epoch 31
Processing epoch 32
Processing epoch 33
Processing epoch 34
Processing epoch 35
Processing epoch 36
Processing epoch 37
Processing epoch 38
Processing epoch 39
Processing epoch 40
Processing epoch 41
Processing epoch 42
Processing epoch 43
Processing epoch 44
Processing epoch 45
Processing epoch 46
Processing epoch 47
Processing epoch 48
Processing epoch 49
RMSE: 0.18

0.18857259466192708

## Implementation Option 2: Use `numpy`, `pandas`, and `sklearn`.


If you do not prefer the one-stop Suprise solution, you may implement the KNN model using `numpy`, `pandas`, and possibly `sklearn`:


In [11]:
## WRITE YOUR CODE HERE:

############ First try ############

import pandas as pd
from sklearn.decomposition import NMF
from sklearn.metrics import mean_squared_error
from math import sqrt

# Load the CSV file into a Pandas DataFrame
rating_df = pd.read_csv('course_ratings.csv', dtype={'user': 'int64', 'item': 'object', 'rating': 'float64'})

# Create a pivot table to get the user-item interaction matrix
interaction_matrix = rating_df.pivot_table(index='user', columns='item', values='rating')

# Fill missing values with 0
interaction_matrix.fillna(0, inplace=True)

# Use NMF to decompose the interaction matrix
model = NMF(n_components=2, init='random', random_state=0)
U = model.fit_transform(interaction_matrix)
I = model.components_

# Calculate the estimated ratings
test_df = pd.DataFrame({'user': [1889878], 'item': ['CC0101EN'], 'rating': [3.0]})
test_matrix = test_df.pivot_table(index='user', columns='item', values='rating')
estimated_rating = U.dot(I)[0][interaction_matrix.columns.get_loc('CC0101EN')]

# Calculate the RMSE for the entire test dataset
predicted_ratings = U.dot(I)
rmse = sqrt(mean_squared_error(interaction_matrix.values, predicted_ratings))
print('RMSE', rmse)

RMSE 0.5478892660533529


In [13]:
############ Second try ############

import pandas as pd
import numpy as np
from sklearn.decomposition import NMF
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from math import sqrt

# Load the CSV file into a Pandas DataFrame
rating_df = pd.read_csv('course_ratings.csv', dtype={'user': 'int64', 'item': 'object', 'rating': 'float64'})

# Split the data into training and test sets
train_df, test_df = train_test_split(rating_df, test_size=0.3)

# Create a pivot table to get the user-item interaction matrix for the training set
train_matrix = train_df.pivot_table(index='user', columns='item', values='rating')

# Fill missing values with 0
train_matrix.fillna(0, inplace=True)

# Use NMF to decompose the interaction matrix
model = NMF(n_components=2, init='random', random_state=123)
U = model.fit_transform(train_matrix)
I = model.components_

# Find all unique users in the original rating_df
unique_users = rating_df['user'].unique()

# Filter test_df to only include users that exist in the original rating_df
test_df = test_df[test_df['user'].isin(unique_users)]

# Calculate the estimated ratings for each row in test_df
test_df['estimated_rating'] = test_df.apply(lambda row: np.dot(U[int(row['user'])-1], I[:,train_matrix.columns.get_loc(row['item'])]) if row['item'] in train_matrix.columns and int(row['user']) <= len(U) else np.nan, axis=1)


# Remove rows with missing estimated ratings
test_df.dropna(subset=['estimated_rating'], inplace=True)

# Calculate the RMSE
rmse = sqrt(mean_squared_error(test_df['rating'], test_df['estimated_rating']))
print('RMSE:', rmse)

RMSE: 2.3398480765193614
