# Content-Based RS

In this step we will implement the Rocchio model, a simple content-based RS. 
For this reason, you must do:

- Read the train file extracted from the dataset 
- Read the 0-1 file related to the movies features
- Create a sparse matrix to them
- Implement the Rocchio model and save the items recommended

In [None]:
# import libs
import operator
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from scipy.sparse import csr_matrix
from collections import OrderedDict
from sklearn.metrics.pairwise import cosine_similarity

# useful command
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

plt.rcParams.update({'font.size': 14})

## Reading train and items' features files

You can read this file as you prefer. I propose to read the files by the pandas' library and create the sparse matrix after it.

In [None]:
df_train = 
df_test = 
df_features = 

df_train.head()
df_features.head()

## Creating Sparse Matrix

I propose to use the csr_matrix from scipy.

In [None]:
# Select users, items and ratings logs (i.e., all information from each column)
users = 
items = 
ratings = 

In [None]:
# Define the matrix dimensions based on the max index related to users and items
nb_users = max(users)
nb_items = max(items)

In [None]:
# Creating matrix of ratings
ratings_matrix = csr_matrix((ratings, (users, items)), shape=(nb_users+1, nb_items+1))

ratings_matrix.shape

In [None]:
# Select items and features
nb_items = max(df_features['MovieId'])
nb_features = len(df_features.columns) - 1

In [None]:
# Creating matrix of items' features all zero
features_matrix = csr_matrix((nb_items+1, nb_features+1))

features_matrix.shape

In [None]:
f = open('../Dataset/ML-1M/features-items.txt', 'r')

# ignore header
row = f.readline()

for row in f:
    row = row.rstrip()
    values = row.split(',')
    itemId = int(values.pop(0))
    featureId = 0
    for v in values:
        features_matrix[itemId, featureId] = int(v)
        featureId += 1 
        
f.close()

In [None]:
features_matrix.shape

## An useful function

This function is used to save the recommendations in a file.

In [None]:
def dumpRecommendation(recommendation, users_targets, file_name):
    
    file_out = open(file_name, 'w')
    
    # for each user target
    for userId in users_targets:
        issuedItems = ""
        # for each item in the previous order
        for itemId in recommendation[userId]:
            issuedItems += str(itemId) + ":" + str(0.0) + ","
        # saving in file in correct format
        string_s = str(userId) + "\t" + "[" + issuedItems
        string_out = string_s[:-1] + ']'
        file_out.write(string_out + "\n")
    
    file_out.close()

## Rocchio Recommendation

In Rocchio model, the prediction is based on the similarity (e.g., cosine) between items and users:

- Each item is a vector of features _(similar to features-matrix)_
- Each user is a mix of his/her items consumed:
![image.png](attachment:image.png)

### Representing users by features

In [None]:
# Creating matrix of users' features all zero
users_matrix = 

users_matrix.shape

In [None]:
# Matrix multiplication of ratings and features
aux = 
aux.shape

In [None]:
# Normalizing them by the size of user historic
for u in range(ratings_matrix.shape[0]):
    # measuring the items nonzero
    nb_nonzero = 
    # multiplying this
    if (nb_nonzero != 0):
        users_matrix[u,:] = 

users_matrix.shape

### Recommending items

The recommendation is related to the cosine similarity of users and items vectors.

In [None]:
features_matrix.shape

In [None]:
# consine similarity between each item
prediction_matrix = 
prediction_matrix.shape

In [None]:
# Size of each recommendation
top_k = 10

In [None]:
# Setting the recommendations of items that have not be rated by the user
recommendation = {}

for u in range(ratings_matrix.shape[0]):
    recommendation[u] = []
    cont = 0
    # sorting items by relevance
    order = np.argsort(prediction_matrix[u,:])[::-1]
    # recommending the best items that have never seen by users
    for i in order:
        # recommending the top-k items 
        if (cont < top_k):
            if ( ):
                recommendation[u].append(i)
                cont += 1
        else:
            break

In [None]:
# Save in a file
users_targets = df_test['userId'].unique()
dumpRecommendation(recommendation, users_targets, "recList_Rocchio.txt")

In [None]:
recommendation[300]
recommendation[3000]
recommendation[6010]

In [None]:
## Toy example
A = csr_matrix([[1,0,4,0,5,0,2,0,0,5], 
                [0,2,0,3,0,3,5,0,1,0], 
                [4,0,0,0,4,0,3,0,0,2], 
                [0,2,1,4,0,0,1,3,0,3], 
                [3,0,4,0,3,5,0,0,4,0]])

B = csr_matrix([[1,1,0,0], [0,1,0,1], [1,0,1,0], [0,1,1,1], [1,0,0,1], 
                [1,1,0,0], [0,1,1,1], [1,0,0,0], [1,0,1,1], [1,0,1,0]])

nb_users = 5
nb_items = 10
nb_features = 4