**Basic ItemKNN recommender**

In [1]:
%load_ext Cython

In [2]:
import os
from typing import Tuple,Callable,Dict,Optional,List

import numpy as np
import pandas as pd
import scipy.sparse as sp
import time
import random
#from scipy.sparse import csr_matrix

#----Recommenders----
from SLIM.SLIM_BPR_Cython import SLIM_BPR_Cython
from SLIM.SlimElasticNet import SLIMElasticNetRecommender
from cf.item_cf import ItemBasedCollaborativeFiltering
from cf.user_cf import UserBasedCollaborativeFiltering
#---------------

from sklearn.model_selection import train_test_split

**Dataset loading with pandas**

The function read_csv from pandas provides a wonderful and fast interface to load tabular data like this. For better results and performance we provide the separator ::, the column names ["user_id", "item_id", "ratings", "timestamp"], and the types of each attribute in the dtype parameter.

In [3]:
def load_data():
  return pd.read_csv("./data_train.csv")

In [4]:
ratings=load_data()
d ={'user_id': ratings['row'],'item_id':ratings['col'],'ratings':ratings['data']}
ratings=pd.DataFrame(data=d)

In [5]:
ratings.dtypes

user_id      int64
item_id      int64
ratings    float64
dtype: object

In [6]:
userList=list(d['user_id'])
itemList=list(d['item_id'])
ratingList=list(d['ratings'])

In [7]:
URM_all = sp.coo_matrix((ratingList,(userList,itemList)))
URM_all = URM_all.tocsr()

In [8]:
URM_all

<7947x25975 sparse matrix of type '<class 'numpy.float64'>'
	with 113268 stored elements in Compressed Sparse Row format>

In [9]:
def load_data_ICM():
  return pd.read_csv("./data_ICM_title_abstract.csv")

In [10]:
features=load_data_ICM()
d ={'item_id': features['row'],'feature_id':features['col'],'value':features['data']}
features=pd.DataFrame(data=d)
itemList=list(d['item_id'])

In [11]:
featureList=list(d['feature_id'])
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(featureList)

featureList = le.transform(featureList)

In [12]:
valueList=list(d['value'])
ICM_all = sp.coo_matrix((valueList,(itemList,featureList)))
ICM_all = ICM_all.tocsr()

In [13]:
ICM_all

<25975x19998 sparse matrix of type '<class 'numpy.float64'>'
	with 490691 stored elements in Compressed Sparse Row format>

This section wors with the previously-loaded ratings dataset and extracts the number of users, number of items, and min/max user/item identifiers. Exploring and understanding the data is an essential step prior fitting any recommender/algorithm.

In this specific case, we discover that item identifiers go between 1 and 25974, however, there are only 24896 different items. To ease further calculations, we create new contiguous user/item identifiers, we then assign each user/item only one of these new identifiers. To keep track of these new mappings, we add them into the original dataframe using the pd.merge function.

In [14]:
def preprocess_data(ratings: pd.DataFrame):
    unique_users = ratings.user_id.unique()
    unique_items = ratings.item_id.unique()
    
    num_users, min_user_id, max_user_id = unique_users.size, unique_users.min(), unique_users.max()
    num_items, min_item_id, max_item_id = unique_items.size, unique_items.min(), unique_items.max()
    
    print(num_users, min_user_id, max_user_id)
    print(num_items, min_item_id, max_item_id)
    
    mapping_user_id = pd.DataFrame({"mapped_user_id": np.arange(num_users), "user_id": unique_users})
    mapping_item_id = pd.DataFrame({"mapped_item_id": np.arange(num_items), "item_id": unique_items})
    
    ratings = pd.merge(left=ratings, 
                       right=mapping_user_id,
                       how="inner",
                       on="user_id")
    
    ratings = pd.merge(left=ratings, 
                       right=mapping_item_id,
                       how="inner",
                       on="item_id")
    
    return ratings



In [15]:
ratings=preprocess_data(ratings)

7947 0 7946
24896 0 25974


In [16]:
ratings

Unnamed: 0,user_id,item_id,ratings,mapped_user_id,mapped_item_id
0,0,10080,1.0,0,0
1,4342,10080,1.0,4342,0
2,5526,10080,1.0,5526,0
3,5923,10080,1.0,5923,0
4,0,19467,1.0,0,1
5,149,19467,1.0,149,1
6,4072,19467,1.0,4072,1
7,6193,19467,1.0,6193,1
8,7105,19467,1.0,7105,1
9,1,2665,1.0,1,2


In [17]:
unique_users = ratings.mapped_user_id.unique()
unique_items = ratings.mapped_item_id.unique()
num_users=unique_users.size
num_items=unique_items.size

In [18]:
num_users

7947

In [19]:
num_items

24896

**Dataset splitting into train,validation and test**

This is the last part before creating the recommender. However, this step is super important, as it is the base for the training, parameters optimization, and evaluation of the recommender(s).

In here we read the ratings (which we loaded and preprocessed before) and create the train, validation, and test User-Rating Matrices (URM). It's important that these are disjoint to avoid information leakage from the train into the validation/test set, in our case, we are safe to use the train_test_split function from scikit-learn as the dataset only contains one datapoint for every (user,item) pair. On another topic, we first create the test set and then we create the validation by splitting again the train set.

train_test_split takes an array (or several arrays) and divides it into train and test according to a given size (in our case testing_percentage and validation_percentage, which need to be a float between 0 and 1).

After we have our different splits, we create the sparse URMs by using the csr_matrix function from scipy.




In [20]:
def dataset_splits(ratings, num_users, num_items, validation_percentage: float, testing_percentage: float):
    seed = 1234
    
    (user_ids_training, user_ids_test,
     item_ids_training, item_ids_test,
     ratings_training, ratings_test) = train_test_split(ratings.mapped_user_id,
                                                        ratings.mapped_item_id,
                                                        ratings.ratings,
                                                        test_size=testing_percentage,
                                                        shuffle=True,
                                                        random_state=seed)
    
    (user_ids_training, user_ids_validation,
     item_ids_training, item_ids_validation,
     ratings_training, ratings_validation) = train_test_split(user_ids_training,
                                                              item_ids_training,
                                                              ratings_training,
                                                              test_size=validation_percentage,
                                                             )
    
    urm_train = sp.csr_matrix((ratings_training, (user_ids_training, item_ids_training)), 
                              shape=(num_users, num_items))
    
    
    urm_validation = sp.csr_matrix((ratings_validation, (user_ids_validation, item_ids_validation)), 
                              shape=(num_users, num_items))
    
    urm_test = sp.csr_matrix((ratings_test, (user_ids_test, item_ids_test)), 
                              shape=(num_users, num_items))
    

    
    return urm_train, urm_validation, urm_test



In [21]:
urm_train, urm_validation, urm_test = dataset_splits(ratings, 
                                                     num_users, 
                                                     num_items, 
                                                     validation_percentage=0.10, 
                                                     testing_percentage=0.15)

In [22]:
urm_train_validation = urm_train + urm_validation

**Evaluation Metrics**

In this practice session we will be using the same evaluation metrics defined in the Practice session 2, i.e., precision, recall and mean average precision (MAP).

In [23]:
def recall(recommendations: np.array, relevant_items: np.array) -> float:
    is_relevant = np.in1d(recommendations, relevant_items, assume_unique=True)
    
    recall_score = np.sum(is_relevant) / relevant_items.shape[0]
    
    return recall_score
    
    
def precision(recommendations: np.array, relevant_items: np.array) -> float:
    is_relevant = np.in1d(recommendations, relevant_items, assume_unique=True)
    
    precision_score = np.sum(is_relevant) / recommendations.shape[0]

    return precision_score

def mean_average_precision(recommendations: np.array, relevant_items: np.array) -> float:
    is_relevant = np.in1d(recommendations, relevant_items, assume_unique=True)
    
    precision_at_k = is_relevant * np.cumsum(is_relevant, dtype=np.float32) / (1 + np.arange(is_relevant.shape[0]))

    map_score = np.sum(precision_at_k) / np.min([relevant_items.shape[0], is_relevant.shape[0]])

    return map_score

**Evaluation Procedure**

The evaluation procedure returns the averaged accuracy scores (in terms of precision, recall and MAP) for all users (that have at least 1 rating in the test set). It also calculates the number of evaluated and skipped users. It receives a recommender instance, and the train and test URMs.

In [24]:
def evaluator(recommender: object, urm_train: sp.csr_matrix, urm_test: sp.csr_matrix):
    recommendation_length = 10
    accum_precision = 0
    accum_recall = 0
    accum_map = 0
    
    num_users = urm_train.shape[0]
    
    num_users_evaluated = 0
    num_users_skipped = 0
    for user_id in range(num_users):
        user_profile_start = urm_test.indptr[user_id]
        user_profile_end = urm_test.indptr[user_id+1]
        
        relevant_items = urm_test.indices[user_profile_start:user_profile_end]
        
        if relevant_items.size == 0:
            num_users_skipped += 1
            continue
            
        recommendations = recommender.recommend(user_id=user_id,
                                               urm_train = urm_train,
                                               at=recommendation_length
                                               #remove_seen=True)
                                               )
        
        accum_precision += precision(recommendations, relevant_items)
        accum_recall += recall(recommendations, relevant_items)
        accum_map += mean_average_precision(recommendations, relevant_items)
        
        num_users_evaluated += 1
        
    
    accum_precision /= max(num_users_evaluated, 1)
    accum_recall /= max(num_users_evaluated, 1)
    accum_map /=  max(num_users_evaluated, 1)
    
    return accum_precision, accum_recall, accum_map, num_users_evaluated, num_users_skipped

**Hybrid**

In [25]:
class HybridRecommender(object):
    def __init__(self, w, user_cf_param, item_cf_param, slim_param):

        self.w=w
        ###### USER CF #####
        self.userCF = UserBasedCollaborativeFiltering(knn=user_cf_param["knn"], shrink=user_cf_param["shrink"])
    
        ###### ITEM_CF #####
        self.itemCF = ItemBasedCollaborativeFiltering(knn=item_cf_param["knn"], shrink=item_cf_param["shrink"])
        
        self.slim_random = SLIM_BPR_Cython(epochs=slim_param["epochs"], topK=slim_param["topK"])
        
        #self.slim_elastic = SLIMElasticNetRecommender()
    

    def fit(self, URM):
        self.URM = URM

        ### SUB-FITTING ###
        print("Fitting user cf...")
        self.userCF.fit(URM.copy())

        print("Fitting item cf...")
        self.itemCF.fit(URM.copy())
        
        print("Fitting slim bpr...")
        self.slim_random.fit(URM.copy())
        
        print("Fitting slim elastic...")
        #self.slim_elastic.fit(URM.copy())


    def recommend(self,user_id,urm_train: sp.csr_matrix,at=10):


        ### DUE TO TIME CONSTRAINT THE CODE STRUCTURE HERE IS REDUNTANT
        ### TODO exploit inheritance to reduce code duplications and simple extract ratings, combine them, simply by iterate over a list of recommenders
        self.userCF_ratings = self.userCF.get_expected_ratings(user_id)
        self.itemCF_ratings = self.itemCF.get_expected_ratings(user_id)
        self.slim_ratings = self.slim_random.get_expected_ratings(user_id)
        #self.slim_elastic_ratings = self.slim_elastic.get_expected_ratings(user_id)

        ### COMMON CODE ###
        self.hybrid_ratings = None #BE CAREFUL, MAGIC INSIDE :)


        self.hybrid_ratings = self.userCF_ratings * w["user_cf"]
        self.hybrid_ratings += self.itemCF_ratings * w["item_cf"]
        #self.hybrid_ratings += self.cbf_ratings * w_right["cbf"]
        self.hybrid_ratings += self.slim_ratings * w["slim"]
        #self.hybrid_ratings += self.ALS_ratings * w_right["als"]
        #self.hybrid_ratings += self.slim_elastic_ratings * w["elastic"]

        recommended_items = np.flip(np.argsort(self.hybrid_ratings), 0)

        # REMOVING SEEN
        unseen_items_mask = np.in1d(recommended_items,urm_train[user_id].indices,
                                    assume_unique=True, invert=True)
        recommended_items = recommended_items[unseen_items_mask]

        return recommended_items[0:at]



################################ PARAMETERS #################################
#################### Value for best sumbissions, MAP@10 on Kaggle = 0.1 ###########################

w = {
    "user_cf": 0.03,
    "item_cf": 0.35,
    "cbf": 0.15,
    "icm_svd": 0,
    "als": 0.3,
    "slim": 0.6,
    "elastic": 1.5
}

user_cf_param = {
    "knn": 140,
    "shrink": 0
}

item_cf_param = {
    "knn": 310,
    "shrink": 0
}

slim_param = {
    "epochs": 40,
    "topK": 200
}



In [26]:
recommender = HybridRecommender(w=w,user_cf_param=user_cf_param,item_cf_param=item_cf_param,slim_param=slim_param)

In [27]:
recommender.fit(urm_train)

Fitting user cf...
Similarity column 7947 ( 100 % ), 3723.71 column/sec, elapsed time 0.04 min
Fitting item cf...
Similarity column 24896 ( 100 % ), 2834.04 column/sec, elapsed time 0.15 min
Fitting slim bpr...
SLIM_BPR_Cython: Estimated memory required for similarity matrix of 24896 items is 2479.24 MB


ValueError: Buffer dtype mismatch, expected 'long' but got 'long long'

In [None]:
accum_precision, accum_recall, accum_map, num_user_evaluated, num_users_skipped = evaluator(recommender, 
                                                                                            urm_train_validation, 
                                                                                            urm_test)

In [None]:
accum_precision, accum_recall, accum_map, num_user_evaluated, num_users_skipped

**Submission to competition**

This step serves as a similar step that you will perform when preparing a submission to the competition. Specially after you have chosen and trained your recommender.

For this step the best suggestion is to select the most-performing configuration obtained in the hyperparameter tuning step and to train the recommender using both the train and validation set. Remember that in the competition you do not have access to the test set.

Another consideration is that, due to easier and faster calculations, we replaced the user/item identifiers with new ones in the preprocessing step. For the competition, you are required to generate recommendations using the dataset's original identifiers. Due to this, this step also reverts back the newer identifiers with the ones originally found in the dataset.

Last, this step creates a function that writes the recommendations for each user in the same file in a tabular format following this format:

csv
<user_id>,<item_id_1> <item_id_2> <item_id_3> <item_id_4> <item_id_5> <item_id_6> <item_id_7> <item_id_8> <item_id_9> <item_id_10>
Always verify the competitions' submission file model as it might vary from the one we presented here.

In [None]:
def load_goodguys():
  return pd.read_csv("./data_target_users_test.csv")
goodguys=load_goodguys()

In [None]:
goodguys

In [None]:
users_to_recommend = np.random.choice(goodguys.user_id,size=goodguys.size, replace=False)
users_to_recommend

In [None]:
mapping_to_item_id = dict(zip(ratings.mapped_item_id, ratings.item_id))

In [None]:
mapping_to_item_id

In [None]:
def prepare_submission(ratings: pd.DataFrame, users_to_recommend: np.array, urm_train: sp.csr_matrix, recommender: object):
    users_ids_and_mappings = ratings[ratings.user_id.isin(users_to_recommend)][["user_id", "mapped_user_id"]].drop_duplicates()
    items_ids_and_mappings = ratings[["item_id", "mapped_item_id"]].drop_duplicates()
    
    mapping_to_item_id = dict(zip(ratings.mapped_item_id, ratings.item_id))
    
    
    recommendation_length = 10
    submission = []
    for idx, row in users_ids_and_mappings.iterrows():
        user_id = row.user_id
        mapped_user_id = row.mapped_user_id
        
        recommendations = recommender.recommend(user_id=mapped_user_id,
                                                urm_train=urm_train,
                                                at=recommendation_length,
                                                #remove_seen=True)
                                               )
        
        submission.append((user_id, [mapping_to_item_id[item_id] for item_id in recommendations]))
        
    return submission

In [None]:
submission = prepare_submission(ratings, users_to_recommend, urm_train_validation,rec)

In [None]:
submission

In [None]:
import os
from datetime import datetime

csv_fname = './submission'
csv_fname += datetime.now().strftime('%b%d_%H-%M-%S') + '.csv'

def write_submission(submissions):
    with open(csv_fname, "w") as f:
        f.write(f"user_id,item_list\n")
        for user_id, items in submissions:
            f.write(f"{user_id},{' '.join([str(item) for item in items])}\n")


In [None]:
write_submission(submission)

In this lecture we saw the most simple version of Cosine Similarity, where it just includes a shrink factor. There are different optimizations that we can do to it.

Implement TopK Neighbors
When calculating the cosine similarity we used urm.T.dot(urm) to calculate the enumerator. However, depending of the dataset and the number of items, this matrix could not fit in memory. Implemenent a block version, faster than our vector version but that does not use urm.T.dot(urm) beforehand.

Implement Adjusted Cosine 

Implement Dice Similarity

Implement an implicit CF ItemKNN.

Implement a CF UserKNN model