## Introduction to Collaborative Filtering

reference: https://cambridgespark.com/content/tutorials/implementing-your-own-recommender-systems-in-Python/index.html

In [1]:
import os
os.chdir('c:\\users\\bangda\\desktop')

In [2]:
import numpy as np
import pandas as pd

In [3]:
columns = ['user_id', 'item_id', 'rating', 'timestamp']
data = pd.read_csv('ml-100k/u.data', sep = '\t', names = columns)
data.head()

Unnamed: 0,user_id,item_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [4]:
print('Shape of raw data: {}'.format(data.shape))
print('Number of users: {}'.format(data.user_id.unique().shape[0]))
print('Number of items: {}'.format(data.item_id.unique().shape[0]))

Shape of raw data: (100000, 4)
Number of users: 943
Number of items: 1682


In [5]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(data, test_size = 0.25)
train.shape, test.shape

((75000, 4), (25000, 4))

### 1. Memory-Based Collaborative Filtering

Two implementations: user-item filtering (find the similar users), item-item filtering (find similar items)

Create a user-item matrix: 943 `x` 1682 here, fill in ratings. Then calculate the similarity and create a similarity matrix. For user-item CF, the similarity values between users are measured by observing all the items that are rated by both users; for item-item CF, the similarity between items are measured by observing all users who have rated both items

Drawbacks: it doesn't scale to real-world scenarios and cannot address the cold-start problem: when new users or new items enter the system. Model-based CF also has this drawback.

In [6]:
num_users = data.user_id.unique().shape[0]
num_items = data.item_id.unique().shape[0]
train_user_item_matrix = np.zeros((num_users, num_items))
for row in train.itertuples():
    train_user_item_matrix[row[1] - 1, row[2] - 1] = row[3]
    
test_user_item_matrix = np.zeros((num_users, num_items))
for row in test.itertuples():
    test_user_item_matrix[row[1] - 1, row[2] - 1] = row[3]

In [7]:
from sklearn.metrics.pairwise import pairwise_distances
user_similarity = pairwise_distances(train_user_item_matrix, metric = 'cosine')
item_similarity = pairwise_distances(train_user_item_matrix.T, metric = 'cosine')

In [8]:
def predict(ratings, similarity, type = 'user'):
    if type == 'user':
        user_rating_mean = np.mean(ratings, axis = 1)
        ratings_diff = ratings - user_rating_mean[:, np.newaxis]
        prediction = user_rating_mean[:, np.newaxis] + np.dot(similarity, ratings_diff) / np.array([np.abs(similarity).sum(axis = 1)]).T
    elif type == 'item':
        prediction = np.dot(ratings, similarity) / np.array([np.abs(similarity).sum(axis = 1)])
    
    return prediction

In [9]:
user_pred = predict(train_user_item_matrix, user_similarity, type = 'user')
item_pred = predict(train_user_item_matrix, item_similarity, type = 'item')

In [10]:
from sklearn.metrics import mean_squared_error
def rmse(prediction, actual):
    prediction = prediction[actual.nonzero()].flatten()
    actual = actual[actual.nonzero()].flatten()
    return np.sqrt(mean_squared_error(prediction, actual))

In [11]:
print('User-based CF RMSE: {}'.format(rmse(user_pred, test_user_item_matrix)))
print('Item-based CF RMSE: {}'.format(rmse(item_pred, test_user_item_matrix)))

User-based CF RMSE: 3.1273906568
Item-based CF RMSE: 3.4538528082


### 2. Model-Based Collaborative Filtering

Based on matrix factorization, mainly as an unsupervised learning method for latent variable decomposition and dimension reduction. Matrix factorization is widely used for recommender systems where it can deal better with scalability and sparsity than memory-based CF. The goal is to learn the latent preferences of users and the latent attributes of items from ratings to predict the unknow rating through the dot product of latent features of users and items.

When you factorize matrix, represent the multiplication of two low rank matrices, where the rows contain the latent vector (For example, age, location, gender). Fit this matrix to approximate the original matrix as closely as possible. 

Models that use both ratings and content features are called Hybrid Recommender Systems where both CF and Content-based Models are combined.

One of the well-know matrix factorization method is SVD. CF can be formulated by approximating a matrix by using SVD. $X=USV^\top$, where $U$ is an $m\times r$ orthogonal matrix; $S$ is an $r\times r$ diagonal matrix with non-negative diagonal elements, known as singular values; $V^\top$ is an $t\times n$ orthogonal matrix.

$U$ represents the feature vectors corresponding to the users in the hidden feature space and $V$ represents the feature vectors corresponding to the items in the hidden feature space. And you can make a prediction by: $\hat{X} = USV^\top$

In [12]:
sparsity = 1. - data.shape[0] / np.float(num_users * num_items)
print('The sparsity of the matrix is {}'.format(sparsity))

The sparsity of the matrix is 0.936953306358


In [13]:
from scipy.sparse.linalg import svds
u, s, vT = svds(train_user_item_matrix, k = 20)
s_diag_matrix = np.diag(s)
X_pred = np.dot(np.dot(u, s_diag_matrix), vT)

In [14]:
print('User-based CF RMSE {}'.format(rmse(X_pred, test_user_item_matrix)))

User-based CF RMSE 2.71155698102
