# Model-Based Collaborative Filtering

Model-based collaborative filtering (CF) is a technique used in recommendation systems to predict user preferences based on patterns found in user-item interactions. Unlike memory-based methods, which rely directly on user ratings or interactions (e.g., user-user or item-item similarity), model-based approaches build a predictive model based on the available data.

## Key Features of Model-Based CF

- **Matrix Factorization**:  
  One common approach is to use matrix factorization techniques, such as Singular Value Decomposition (SVD) or Alternating Least Squares (ALS). These methods decompose the user-item interaction matrix into lower-dimensional matrices, capturing latent factors that explain user preferences.

- **Statistical Models**:  
  Other model-based methods may use statistical models, such as Bayesian approaches or clustering techniques, to identify patterns in the data.

- **Scalability**:  
  Model-based methods can be more scalable than memory-based methods, especially when dealing with large datasets, because they summarize the data into a model rather than storing all user-item interactions.

- **Cold Start Problem**:  
  Model-based approaches can sometimes handle the cold start problem (new users or items) better, as they rely on inferred preferences rather than explicit ratings.

## Applications

- **E-commerce Recommendations**:  
  Product recommendations based on user behavior.

- **Streaming Services**:  
  Suggesting movies or music based on user preferences.

- **Social Networks**:  
  Friend or content recommendations.

Overall, model-based collaborative filtering is a powerful approach for building personalized recommendation systems, leveraging data to create a predictive model that helps users discover relevant items.


In [40]:
import numpy as np
import pandas as pd
column_names = ['user_id', 'item_id', 'rating', 'timestamp']
df = pd.read_csv('u.data', sep='\t', names=column_names)
df.head()

Unnamed: 0,user_id,item_id,rating,timestamp
0,0,50,5,881250949
1,0,172,5,881250949
2,0,133,1,881250949
3,196,242,3,881250949
4,186,302,3,891717742


In [41]:
movie_titles = pd.read_csv("Movie_Id_Titles")
movie_titles.head()

Unnamed: 0,item_id,title
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)


In [44]:
df = pd.merge(df,movie_titles,on='item_id')
df.head()

Unnamed: 0,user_id,item_id,rating,timestamp,title
0,0,50,5,881250949,Star Wars (1977)
1,0,172,5,881250949,"Empire Strikes Back, The (1980)"
2,0,133,1,881250949,Gone with the Wind (1939)
3,196,242,3,881250949,Kolya (1996)
4,186,302,3,891717742,L.A. Confidential (1997)


In [45]:
n_users = df.user_id.nunique()
n_items = df.item_id.nunique()

print('Num. of Users: '+ str(n_users))
print('Num of Movies: '+str(n_items))

Num. of Users: 944
Num of Movies: 1682


In [48]:
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(df, test_size=0.25)

In [49]:
#Create two user-item matrices, one for training and another for testing
train_data_matrix = np.zeros((n_users, n_items))
for line in train_data.itertuples():
    train_data_matrix[line[1]-1, line[2]-1] = line[3]  

test_data_matrix = np.zeros((n_users, n_items))
for line in test_data.itertuples():
    test_data_matrix[line[1]-1, line[2]-1] = line[3]

In [51]:
#I use cosine similarity method here
from sklearn.metrics.pairwise import pairwise_distances
user_similarity = pairwise_distances(train_data_matrix, metric='cosine')
item_similarity = pairwise_distances(train_data_matrix.T, metric='cosine')

In [52]:
def predict(ratings, similarity, type='user'):
    if type == 'user':
        mean_user_rating = ratings.mean(axis=1)
        #np.newaxis usage is okay so that mean_user_rating has same format as ratings
        ratings_diff = (ratings - mean_user_rating[:, np.newaxis]) 
        pred = mean_user_rating[:, np.newaxis] + similarity.dot(ratings_diff) / np.array([np.abs(similarity).sum(axis=1)]).T
    elif type == 'item':
        pred = ratings.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])     
    return pred

In [53]:
item_prediction = predict(train_data_matrix, item_similarity, type='item')
user_prediction = predict(train_data_matrix, user_similarity, type='user')

In [55]:
from sklearn.metrics import mean_squared_error
from math import sqrt
def rmse(prediction, ground_truth):
    prediction = prediction[ground_truth.nonzero()].flatten() 
    ground_truth = ground_truth[ground_truth.nonzero()].flatten()
    return sqrt(mean_squared_error(prediction, ground_truth))

In [58]:
print('User-based CF RMSE: ' + str(rmse(user_prediction, test_data_matrix)))
print('Item-based CF RMSE: ' + str(rmse(item_prediction, test_data_matrix)))

User-based CF RMSE: 3.1212883629771326
Item-based CF RMSE: 3.4490848574621245


In [62]:
sparsity=round(1.0-len(df)/float(n_users*n_items),3)
print('The sparsity level of MovieLens100K is ' +  str(sparsity*100) + '%')

The sparsity level of MovieLens100K is 93.7%


**SVD**  
Singular Value Decomposition (SVD) is a mathematical technique used in linear algebra, particularly in the context of matrix factorization. It decomposes a matrix into three simpler matrices, revealing important properties of the original matrix. Here's how it works and why it's useful:

### Mathematical Representation  
For a given matrix A of dimensions m x n, SVD states that:  

A = U Σ V^T

- **U**: An m x m orthogonal matrix, where the columns represent the left singular vectors.  
- **Σ**: An m x n diagonal matrix containing the singular values (non-negative values) of A.  
- **V^T**: The transpose of an n x n orthogonal matrix, where the columns represent the right singular vectors.  

### Key Properties  
- **Dimensionality Reduction**: SVD can reduce the dimensionality of data by keeping only the top k singular values and their corresponding singular vectors, which helps in noise reduction and extracting significant features.  

- **Latent Factors**: In recommendation systems, SVD helps identify latent factors that explain user preferences and item characteristics, making it useful for collaborative filtering.  

- **Matrix Approximation**: By truncating the matrices U, Σ, and V to only include the top k singular values, SVD can approximate the original matrix. This is particularly useful when dealing with sparse data.
 with sparse data.
 with sparse data.


In [64]:
import scipy.sparse as sp
from scipy.sparse.linalg import svds

#get SVD components from train matrix. Choose k.
u, s, vt = svds(train_data_matrix, k = 20)
s_diag_matrix=np.diag(s)
X_pred = np.dot(np.dot(u, s_diag_matrix), vt)
print('User-based CF MSE: ' + str(rmse(X_pred, test_data_matrix)))

User-based CF MSE: 2.716478636984842
