**Basic ItemKNN recommender**

In [2]:
%load_ext Cython

The Cython extension is already loaded. To reload it, use:
  %reload_ext Cython


In [3]:
import os
from typing import Tuple,Callable,Dict,Optional,List

import numpy as np
import pandas as pd
import scipy.sparse as sp
import time
import random
from scipy.sparse import csr_matrix


from sklearn.model_selection import train_test_split

**Dataset loading with pandas**

The function read_csv from pandas provides a wonderful and fast interface to load tabular data like this. For better results and performance we provide the separator ::, the column names ["user_id", "item_id", "ratings", "timestamp"], and the types of each attribute in the dtype parameter.

In [4]:
def load_data():
  return pd.read_csv("./data_train.csv")

In [5]:
ratings=load_data()
d ={'user_id': ratings['row'],'item_id':ratings['col'],'ratings':ratings['data']}
ratings=pd.DataFrame(data=d)

In [6]:
ratings.dtypes

user_id      int64
item_id      int64
ratings    float64
dtype: object

In [7]:
userList=list(d['user_id'])
itemList=list(d['item_id'])
ratingList=list(d['ratings'])

In [8]:
URM_all = sp.coo_matrix((ratingList,(userList,itemList)))
URM_all = URM_all.tocsr()

In [9]:
URM_all

<7947x25975 sparse matrix of type '<class 'numpy.float64'>'
	with 113268 stored elements in Compressed Sparse Row format>

In [10]:
def load_data_ICM():
  return pd.read_csv("./data_ICM_title_abstract.csv")

In [11]:
features=load_data_ICM()
d ={'item_id': features['row'],'feature_id':features['col'],'value':features['data']}
features=pd.DataFrame(data=d)
itemList=list(d['item_id'])

In [12]:
featureList=list(d['feature_id'])
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(featureList)

featureList = le.transform(featureList)

In [13]:
valueList=list(d['value'])
ICM_all = sp.coo_matrix((valueList,(itemList,featureList)))
ICM_all = ICM_all.tocsr()

In [14]:
ICM_all

<25975x19998 sparse matrix of type '<class 'numpy.float64'>'
	with 490691 stored elements in Compressed Sparse Row format>

This section wors with the previously-loaded ratings dataset and extracts the number of users, number of items, and min/max user/item identifiers. Exploring and understanding the data is an essential step prior fitting any recommender/algorithm.

In this specific case, we discover that item identifiers go between 1 and 25974, however, there are only 24896 different items. To ease further calculations, we create new contiguous user/item identifiers, we then assign each user/item only one of these new identifiers. To keep track of these new mappings, we add them into the original dataframe using the pd.merge function.

In [9]:
def preprocess_data(ratings: pd.DataFrame):
    unique_users = ratings.user_id.unique()
    unique_items = ratings.item_id.unique()
    
    num_users, min_user_id, max_user_id = unique_users.size, unique_users.min(), unique_users.max()
    num_items, min_item_id, max_item_id = unique_items.size, unique_items.min(), unique_items.max()
    
    print(num_users, min_user_id, max_user_id)
    print(num_items, min_item_id, max_item_id)
    
    mapping_user_id = pd.DataFrame({"mapped_user_id": np.arange(num_users), "user_id": unique_users})
    mapping_item_id = pd.DataFrame({"mapped_item_id": np.arange(num_items), "item_id": unique_items})
    
    ratings = pd.merge(left=ratings, 
                       right=mapping_user_id,
                       how="inner",
                       on="user_id")
    
    ratings = pd.merge(left=ratings, 
                       right=mapping_item_id,
                       how="inner",
                       on="item_id")
    
    return ratings



In [10]:
ratings=preprocess_data(ratings)

7947 0 7946
24896 0 25974


In [11]:
ratings

Unnamed: 0,user_id,item_id,ratings,mapped_user_id,mapped_item_id
0,0,10080,1.0,0,0
1,4342,10080,1.0,4342,0
2,5526,10080,1.0,5526,0
3,5923,10080,1.0,5923,0
4,0,19467,1.0,0,1
5,149,19467,1.0,149,1
6,4072,19467,1.0,4072,1
7,6193,19467,1.0,6193,1
8,7105,19467,1.0,7105,1
9,1,2665,1.0,1,2


In [12]:
unique_users = ratings.mapped_user_id.unique()
unique_items = ratings.mapped_item_id.unique()
num_users=unique_users.size
num_items=unique_items.size

In [13]:
num_users

7947

In [14]:
num_items

24896

**Dataset splitting into train,validation and test**

This is the last part before creating the recommender. However, this step is super important, as it is the base for the training, parameters optimization, and evaluation of the recommender(s).

In here we read the ratings (which we loaded and preprocessed before) and create the train, validation, and test User-Rating Matrices (URM). It's important that these are disjoint to avoid information leakage from the train into the validation/test set, in our case, we are safe to use the train_test_split function from scikit-learn as the dataset only contains one datapoint for every (user,item) pair. On another topic, we first create the test set and then we create the validation by splitting again the train set.

train_test_split takes an array (or several arrays) and divides it into train and test according to a given size (in our case testing_percentage and validation_percentage, which need to be a float between 0 and 1).

After we have our different splits, we create the sparse URMs by using the csr_matrix function from scipy.




In [15]:
def dataset_splits(ratings, num_users, num_items, validation_percentage: float, testing_percentage: float):
    seed = 1234
    
    (user_ids_training, user_ids_test,
     item_ids_training, item_ids_test,
     ratings_training, ratings_test) = train_test_split(ratings.mapped_user_id,
                                                        ratings.mapped_item_id,
                                                        ratings.ratings,
                                                        test_size=testing_percentage,
                                                        shuffle=True,
                                                        random_state=seed)
    
    (user_ids_training, user_ids_validation,
     item_ids_training, item_ids_validation,
     ratings_training, ratings_validation) = train_test_split(user_ids_training,
                                                              item_ids_training,
                                                              ratings_training,
                                                              test_size=validation_percentage,
                                                             )
    
    urm_train = sp.csr_matrix((ratings_training, (user_ids_training, item_ids_training)), 
                              shape=(num_users, num_items))
    
    
    urm_validation = sp.csr_matrix((ratings_validation, (user_ids_validation, item_ids_validation)), 
                              shape=(num_users, num_items))
    
    urm_test = sp.csr_matrix((ratings_test, (user_ids_test, item_ids_test)), 
                              shape=(num_users, num_items))
    

    
    return urm_train, urm_validation, urm_test



In [16]:
urm_train, urm_validation, urm_test = dataset_splits(ratings, 
                                                     num_users, 
                                                     num_items, 
                                                     validation_percentage=0.10, 
                                                     testing_percentage=0.15)

In [25]:
urm_train_validation = urm_train + urm_validation

**Evaluation Metrics**

In this practice session we will be using the same evaluation metrics defined in the Practice session 2, i.e., precision, recall and mean average precision (MAP).

In [17]:
def recall(recommendations: np.array, relevant_items: np.array) -> float:
    is_relevant = np.in1d(recommendations, relevant_items, assume_unique=True)
    
    recall_score = np.sum(is_relevant) / relevant_items.shape[0]
    
    return recall_score
    
    
def precision(recommendations: np.array, relevant_items: np.array) -> float:
    is_relevant = np.in1d(recommendations, relevant_items, assume_unique=True)
    
    precision_score = np.sum(is_relevant) / recommendations.shape[0]

    return precision_score

def mean_average_precision(recommendations: np.array, relevant_items: np.array) -> float:
    is_relevant = np.in1d(recommendations, relevant_items, assume_unique=True)
    
    precision_at_k = is_relevant * np.cumsum(is_relevant, dtype=np.float32) / (1 + np.arange(is_relevant.shape[0]))

    map_score = np.sum(precision_at_k) / np.min([relevant_items.shape[0], is_relevant.shape[0]])

    return map_score

**Evaluation Procedure**

The evaluation procedure returns the averaged accuracy scores (in terms of precision, recall and MAP) for all users (that have at least 1 rating in the test set). It also calculates the number of evaluated and skipped users. It receives a recommender instance, and the train and test URMs.

In [18]:
def evaluator(recommender: object, urm_train: sp.csr_matrix, urm_test: sp.csr_matrix):
    recommendation_length = 10
    accum_precision = 0
    accum_recall = 0
    accum_map = 0
    
    num_users = urm_train.shape[0]
    
    num_users_evaluated = 0
    num_users_skipped = 0
    for user_id in range(num_users):
        user_profile_start = urm_test.indptr[user_id]
        user_profile_end = urm_test.indptr[user_id+1]
        
        relevant_items = urm_test.indices[user_profile_start:user_profile_end]
        
        if relevant_items.size == 0:
            num_users_skipped += 1
            continue
            
        recommendations = recommender.recommend(user_id=user_id,
                                               urm_train = urm_train,
                                               at=recommendation_length, 
                                               remove_seen=True)
        
        accum_precision += precision(recommendations, relevant_items)
        accum_recall += recall(recommendations, relevant_items)
        accum_map += mean_average_precision(recommendations, relevant_items)
        
        num_users_evaluated += 1
        
    
    accum_precision /= max(num_users_evaluated, 1)
    accum_recall /= max(num_users_evaluated, 1)
    accum_map /=  max(num_users_evaluated, 1)
    
    return accum_precision, accum_recall, accum_map, num_users_evaluated, num_users_skipped

**Collaborative Filtering ItemKNN Recommender**

This step creates a CFItemKNN class that represents a Collaborative Filtering ItemKNN Recommender. As we have mentioned in previous practice sessions, our recommenders have two main functions: fit and recommend. In addition we added a filter_seen and the train_multiple_epochs_SLIM_MSE.

In [19]:
def matrix_similarity(urm: sp.csc_matrix, shrink: int):
    item_weights = np.sqrt(
        np.sum(urm.power(2), axis=0)
    ).A
    
    numerator = urm.T.dot(urm)
    denominator = item_weights.T.dot(item_weights) + shrink + 1e-6
    weights = numerator / denominator
    np.fill_diagonal(weights, 0.0)
    
    return weights

In [20]:
class CFItemKNN(object):
    def __init__(self, shrink: int):
        self.shrink = shrink
        self.weights = None
    
    
    def fit(self, urm_train: sp.csc_matrix, similarity_function):
        if not sp.isspmatrix_csc(urm_train):
            raise TypeError(f"We expected a CSC matrix, we got {type(urm_train)}")
        
        self.weights = similarity_function(urm_train, self.shrink)
        
    def recommend(self, user_id: int, urm_train: sp.csr_matrix, at: Optional[int] = None, remove_seen: bool = True):
        user_profile = urm_train[user_id]
        
        ranking = user_profile.dot(self.weights).A.flatten()
        
        if remove_seen:
            user_profile_start = urm_train.indptr[user_id]
            user_profile_end = urm_train.indptr[user_id+1]
            
            seen_items = urm_train.indices[user_profile_start:user_profile_end]
            
            ranking[seen_items] = -np.inf
            
        ranking = np.flip(np.argsort(ranking))
        return ranking[:at]

In [21]:
itemknn_recommender = CFItemKNN(shrink=50)

In [22]:
itemknn_recommender.fit(urm_train.tocsc(), matrix_similarity)

In [26]:
accum_precision, accum_recall, accum_map, num_user_evaluated, num_users_skipped = evaluator(itemknn_recommender, 
                                                                                            urm_train_validation, 
                                                                                            urm_test)

In [27]:
accum_precision, accum_recall, accum_map, num_user_evaluated, num_users_skipped 

(0.024340648278638712, 0.09539457672581987, 0.044321570963724626, 4967, 2980)

**SLIM MSE**

Our objective is to minimize the MSE ( mean square error) by choosing the best similarity matrix.

In [28]:
class SLIM(object):
    
    def __init__(self):

        self.similarity_matrix = None
        
    def fit(self,urm_train,learning_rate,epochs):
        
        #if not sp.isspmatrix_csc(urm_train):
        #    raise TypeError(f"We expected a CSC matrix, we got {type(urm_train)}")
        
        self.similarity_matrix=self.train_multiple_epochs_SLIM_MSE(urm_train,learning_rate,epochs)
        
        
    def recommend(self,user_id,urm_train: sp.csr_matrix,at=None,remove_seen=True):
        # compute the scores using the dot product
        user_profile = urm_train[user_id]
        scores = user_profile.dot(self.similarity_matrix).ravel() #ravel transform in 1D array

        if remove_seen:
            scores = self.filter_seen(urm_train,user_id,scores)

        # rank items
        ranking = scores.argsort()[::-1]
            
        return ranking[:at]
    
    
    def filter_seen(self,urm_train,user_id,scores):

        start_pos = urm_train.indptr[user_id]
        end_pos = urm_train.indptr[user_id+1]

        user_profile = urm_train.indices[start_pos:end_pos]
        
        scores[user_profile] = -np.inf

        return scores  
    
    
    def train_multiple_epochs_SLIM_MSE(self,URM_train, learning_rate_input, n_epochs):
    
        URM_train_coo = URM_train.tocoo() #Transform matrix in coordinate form
        n_items = URM_train.shape[1]
        n_interactions = URM_train.nnz #nnz gives number of non zeros values


        URM_train_coo_row = URM_train_coo.row
        URM_train_coo_col = URM_train_coo.col
        URM_train_coo_data = URM_train_coo.data
        URM_train_indices = URM_train.indices
        URM_train_indptr = URM_train.indptr
        URM_train_data = URM_train.data

        # We create a dense similarity matrix, initialized as zero
        item_item_S = np.zeros((n_items, n_items), dtype = np.float)

        learning_rate = learning_rate_input
        loss = 0.0

        for n_epoch in range(n_epochs):

            loss = 0.0
            start_time = time.time()

            for sample_num in range(n_interactions):

                # Randomly pick sample
                index = random.randrange(0,n_interactions)  #randrange(x,y) x included y not included
                user_id = URM_train_coo_row[index]
                item_id = URM_train_coo_col[index]
                true_rating = URM_train_coo_data[index] #Rij

                # Compute prediction
                start_profile = URM_train_indptr[user_id]
                end_profile = URM_train_indptr[user_id+1]
                predicted_rating = 0.0

                for index in range(start_profile, end_profile):
                    profile_item_id = URM_train_indices[index]
                    profile_rating = URM_train_data[index]
                    predicted_rating += item_item_S[profile_item_id,item_id] * profile_rating

                # Compute prediction error, or gradient.
                prediction_error = true_rating - predicted_rating
                loss += prediction_error**2

                # Update model, in this case the similarity
                for index in range(start_profile, end_profile):
                    profile_item_id = URM_train_indices[index]
                    profile_rating = URM_train_data[index]
                    item_item_S[profile_item_id,item_id] += learning_rate * prediction_error * profile_rating

    #             # Print some stats
    #             if (sample_num +1)% 1000000 == 0:
    #                 elapsed_time = time.time() - start_time
    #                 samples_per_second = (sample_num+1)/elapsed_time
    #                 print("Iteration {} in {:.2f} seconds, loss is {:.2f}. Samples per second {:.2f}".format(sample_num+1, elapsed_time, loss/(sample_num+1), samples_per_second))


            elapsed_time = time.time() - start_time
            samples_per_second = (sample_num+1)/elapsed_time

            print("Epoch {} complete in in {:.2f} seconds, loss is {:.3E}. Samples per second {:.2f}".format(n_epoch+1, time.time() - start_time, loss/(sample_num+1), samples_per_second))
            
        return item_item_S#np.array(item_item_S) 

In [29]:
slim_recommender = SLIM()
slim_recommender.fit(urm_train,1e-4,10)

Epoch 1 complete in in 20.19 seconds, loss is 9.871E-01. Samples per second 4291.14
Epoch 2 complete in in 19.82 seconds, loss is 9.631E-01. Samples per second 4371.50
Epoch 3 complete in in 19.68 seconds, loss is 9.405E-01. Samples per second 4403.66
Epoch 4 complete in in 20.88 seconds, loss is 9.196E-01. Samples per second 4150.74
Epoch 5 complete in in 20.42 seconds, loss is 9.025E-01. Samples per second 4243.43
Epoch 6 complete in in 19.46 seconds, loss is 8.848E-01. Samples per second 4453.71
Epoch 7 complete in in 19.71 seconds, loss is 8.691E-01. Samples per second 4396.76
Epoch 8 complete in in 20.01 seconds, loss is 8.542E-01. Samples per second 4329.64
Epoch 9 complete in in 20.69 seconds, loss is 8.393E-01. Samples per second 4187.65
Epoch 10 complete in in 20.53 seconds, loss is 8.260E-01. Samples per second 4220.90


In [30]:
accum_precision, accum_recall, accum_map, num_user_evaluated, num_users_skipped = evaluator(slim_recommender, 
                                                                                            urm_train_validation, 
                                                                                            urm_test)

In [31]:
accum_precision, accum_recall, accum_map, num_user_evaluated, num_users_skipped 

(0.023434668814173276, 0.09140916486082758, 0.04137470285745324, 4967, 2980)

**P3Alpha**

In [32]:
from GraphBased.P3alphaRecommender import P3alphaRecommender

P3alpha = P3alphaRecommender(urm_train)
P3alpha.fit()

P3alphaRecommender: URM Detected 86 (1.08 %) cold users.
P3alphaRecommender: URM Detected 1647 (6.62 %) cold items.


**PureSVD**

In [33]:
from MatrixFactorization.PureSVDRecommender import PureSVDRecommender

pureSVD = PureSVDRecommender(urm_train)
pureSVD.fit()

PureSVDRecommender: URM Detected 86 (1.08 %) cold users.
PureSVDRecommender: URM Detected 1647 (6.62 %) cold items.
PureSVDRecommender: Computing SVD decomposition...
PureSVDRecommender: Computing SVD decomposition... Done!


**ItemKNNCF**

In [34]:
### Step 1: Import the evaluator objects

from Base.Evaluation.Evaluator import EvaluatorHoldout

evaluator_validation = EvaluatorHoldout(urm_validation, cutoff_list=[5])
evaluator_test = EvaluatorHoldout(urm_test, cutoff_list=[5, 10])

In [35]:
### Step 2: Create BayesianSearch object
import skopt
import scipy.sparse as sps
#from skopt import gp_minimize
from KNN.ItemKNNCFRecommender import ItemKNNCFRecommender
from ParameterTuning.SearchBayesianSkopt import SearchBayesianSkopt


recommender_class = ItemKNNCFRecommender

parameterSearch = SearchBayesianSkopt(recommender_class,
                                 evaluator_validation=evaluator_validation,
                                 evaluator_test=evaluator_test)

In [36]:
### Step 3: Define parameters range

from ParameterTuning.SearchAbstractClass import SearchInputRecommenderArgs
from skopt.space import Real, Integer, Categorical

hyperparameters_range_dictionary = {}
hyperparameters_range_dictionary["topK"] = Integer(5, 1000)
hyperparameters_range_dictionary["shrink"] = Integer(0, 1000)
hyperparameters_range_dictionary["similarity"] = Categorical(["cosine"])
hyperparameters_range_dictionary["normalize"] = Categorical([True, False])
    
    
recommender_input_args = SearchInputRecommenderArgs(
    CONSTRUCTOR_POSITIONAL_ARGS = [urm_train],
    CONSTRUCTOR_KEYWORD_ARGS = {},
    FIT_POSITIONAL_ARGS = [],
    FIT_KEYWORD_ARGS = {}
)


output_folder_path = "result_experiments/"

import os

# If directory does not exist, create
if not os.path.exists(output_folder_path):
    os.makedirs(output_folder_path)

In [37]:

### Step 4: Run!    --- error

n_cases = 2
metric_to_optimize = "MAP"

parameterSearch.search(recommender_input_args,
                       parameter_search_space = hyperparameters_range_dictionary,
                       n_cases = n_cases,
                       n_random_starts = 1,
                       save_model = "no",
                       output_folder_path = output_folder_path,
                       output_file_name_root = recommender_class.RECOMMENDER_NAME,
                       metric_to_optimize = metric_to_optimize
                      )

Iteration No: 1 started. Evaluating function at random point.
ItemKNNCFRecommender: URM Detected 86 (1.08 %) cold users.
ItemKNNCFRecommender: URM Detected 1647 (6.62 %) cold items.
SearchBayesianSkopt: Testing config: {'topK': 320, 'shrink': 983, 'similarity': 'cosine', 'normalize': False}
Similarity column 24896 ( 100 % ), 6953.67 column/sec, elapsed time 0.06 min
EvaluatorHoldout: Processed 3737 ( 100.00% ) in 2.71 sec. Users per second: 1377
SearchBayesianSkopt: New best config found. Config 0: {'topK': 320, 'shrink': 983, 'similarity': 'cosine', 'normalize': False} - results: ROC_AUC: 0.0747703, PRECISION: 0.0239765, PRECISION_RECALL_MIN_DEN: 0.0660111, RECALL: 0.0637568, MAP: 0.0390622, MRR: 0.0680715, NDCG: 0.0506684, F1: 0.0348480, HIT_RATE: 0.1198823, ARHR: 0.0695790, NOVELTY: 0.0023463, AVERAGE_POPULARITY: 0.1725842, DIVERSITY_MEAN_INTER_LIST: 0.9728492, DIVERSITY_HERFINDAHL: 0.9945178, COVERAGE_ITEM: 0.1715537, COVERAGE_ITEM_CORRECT: 0.0094794, COVERAGE_USER: 0.4702403, COVE

In [38]:
### Step 4: Run!
from Base.DataIO import DataIO

data_loader = DataIO(folder_path = output_folder_path)
search_metadata = data_loader.load_data(recommender_class.RECOMMENDER_NAME + "_metadata.zip")

search_metadata

best_parameters = search_metadata["hyperparameters_best"]
best_parameters

# Linear combination of item-based models

#### Let's use an ItemKNNCF with the parameters we just learned and a graph based model

itemKNNCF = ItemKNNCFRecommender(urm_train)
itemKNNCF.fit(**best_parameters)

ItemKNNCFRecommender: URM Detected 86 (1.08 %) cold users.
ItemKNNCFRecommender: URM Detected 1647 (6.62 %) cold items.
Similarity column 24896 ( 100 % ), 6949.81 column/sec, elapsed time 0.06 min


**Now, let's put all together!!**

In [44]:
from Base.Recommender_utils import check_matrix, similarityMatrixTopK
from Base.BaseSimilarityMatrixRecommender import BaseItemSimilarityMatrixRecommender


class ItemKNNSimilarityHybridRecommender(BaseItemSimilarityMatrixRecommender):
    """ ItemKNNSimilarityHybridRecommender
    Hybrid of two similarities S = S1*alpha + S2*(1-alpha)

    """

    RECOMMENDER_NAME = "ItemKNNSimilarityHybridRecommender"


    def __init__(self, urm_train,similarity_vector,sparse_weights=True):
        super(ItemKNNSimilarityHybridRecommender, self).__init__(urm_train)
        
        i=0
        self.similarity_vector = []
        for s in similarity_vector:
            self.similarity_vector.append(similarity_vector[i])
            s = check_matrix(s.copy(), 'csr')
            i=i+1

    def fit(self, topK, weight_vector):

        self.topK = topK
        
        W=self.similarity_vector[0]*weight_vector[0]
        for i in range(1,len(self.similarity_vector)):
            W+=self.similarity_vector[i]*weight_vector[i]

        
        self.W_sparse = similarityMatrixTopK(W, k=self.topK).tocsr()
        
        
    def recommend(self,user_id,urm_train: sp.csr_matrix,at=None,remove_seen=True):
        # compute the scores using the dot product
        user_profile = urm_train[user_id]
        scores = user_profile.dot(self.W_sparse).A.ravel() #ravel transform in 1D array
        
        if remove_seen:
            scores = self.filter_seen(urm_train,user_id,scores)

        # rank items
        ranking = scores.argsort()[::-1]

        return ranking[:at]

    
    def filter_seen(self,urm_train,user_id,scores):

        start_pos = urm_train.indptr[user_id]
        end_pos = urm_train.indptr[user_id+1]

        user_profile = urm_train.indices[start_pos:end_pos]
        
        scores[user_profile] = -np.inf

        return scores  
    

In [59]:
similarity_vector = []
weight_vector = []
similarity_vector.append(csr_matrix(slim_recommender.similarity_matrix))
weight_vector.append(0.1)
similarity_vector.append(csr_matrix(itemknn_recommender.weights))
weight_vector.append(0.1)
similarity_vector.append(csr_matrix(P3alpha.W_sparse))
weight_vector.append(0.1)
#similarity_vector.append(csr_matrix(pureSVD.W_sparse))
#weight_vector.append(0.2)
similarity_vector.append(csr_matrix(itemKNNCF.W_sparse))
weight_vector.append(0.7)
hybridrecommender = ItemKNNSimilarityHybridRecommender(urm_train,similarity_vector)
hybridrecommender.fit(100,weight_vector)

ItemKNNSimilarityHybridRecommender: URM Detected 86 (1.08 %) cold users.
ItemKNNSimilarityHybridRecommender: URM Detected 1647 (6.62 %) cold items.


**Submission to competition**

This step serves as a similar step that you will perform when preparing a submission to the competition. Specially after you have chosen and trained your recommender.

For this step the best suggestion is to select the most-performing configuration obtained in the hyperparameter tuning step and to train the recommender using both the train and validation set. Remember that in the competition you do not have access to the test set.

Another consideration is that, due to easier and faster calculations, we replaced the user/item identifiers with new ones in the preprocessing step. For the competition, you are required to generate recommendations using the dataset's original identifiers. Due to this, this step also reverts back the newer identifiers with the ones originally found in the dataset.

Last, this step creates a function that writes the recommendations for each user in the same file in a tabular format following this format:

csv
<user_id>,<item_id_1> <item_id_2> <item_id_3> <item_id_4> <item_id_5> <item_id_6> <item_id_7> <item_id_8> <item_id_9> <item_id_10>
Always verify the competitions' submission file model as it might vary from the one we presented here.

In [60]:
best_recommender = hybridrecommender

In [61]:
accum_precision, accum_recall, accum_map, num_user_evaluated, num_users_skipped = evaluator(best_recommender, 
                                                                                            urm_train_validation, 
                                                                                            urm_test)

In [62]:
accum_precision, accum_recall, accum_map, num_user_evaluated, num_users_skipped  #trust MAP (indicatively)

(0.028689349708072836, 0.11068004213871831, 0.051344156327199075, 4967, 2980)

In [49]:
def load_goodguys():
  return pd.read_csv("./data_target_users_test.csv")
goodguys=load_goodguys()

In [50]:
goodguys

Unnamed: 0,user_id
0,0
1,1
2,2
3,3
4,4
5,5
6,6
7,7
8,8
9,9


In [51]:
users_to_recommend = np.random.choice(goodguys.user_id,size=goodguys.size, replace=False)
users_to_recommend

array([7910, 7722, 5627, ..., 4567, 1768, 4456], dtype=int64)

In [52]:
mapping_to_item_id = dict(zip(ratings.mapped_item_id, ratings.item_id))

In [53]:
mapping_to_item_id

{0: 10080,
 1: 19467,
 2: 2665,
 3: 7494,
 4: 17068,
 5: 17723,
 6: 18131,
 7: 20146,
 8: 19337,
 9: 21181,
 10: 18736,
 11: 23037,
 12: 477,
 13: 6927,
 14: 10204,
 15: 13707,
 16: 18999,
 17: 19838,
 18: 19851,
 19: 814,
 20: 2754,
 21: 3907,
 22: 4481,
 23: 5581,
 24: 6549,
 25: 7583,
 26: 8849,
 27: 9014,
 28: 9658,
 29: 12914,
 30: 13439,
 31: 16587,
 32: 20127,
 33: 20345,
 34: 21663,
 35: 23895,
 36: 24008,
 37: 24577,
 38: 4649,
 39: 11189,
 40: 13703,
 41: 22592,
 42: 1957,
 43: 3606,
 44: 4102,
 45: 6770,
 46: 7059,
 47: 10716,
 48: 12304,
 49: 14405,
 50: 14846,
 51: 16685,
 52: 21319,
 53: 21950,
 54: 22181,
 55: 24783,
 56: 9427,
 57: 10908,
 58: 19750,
 59: 1199,
 60: 7589,
 61: 8059,
 62: 21260,
 63: 21267,
 64: 24865,
 65: 1920,
 66: 6380,
 67: 12816,
 68: 14100,
 69: 17509,
 70: 21184,
 71: 2154,
 72: 3025,
 73: 3041,
 74: 4532,
 75: 5200,
 76: 5208,
 77: 5998,
 78: 6141,
 79: 6817,
 80: 7409,
 81: 7753,
 82: 8589,
 83: 9163,
 84: 11152,
 85: 11405,
 86: 13274,
 87: 14

In [54]:
def prepare_submission(ratings: pd.DataFrame, users_to_recommend: np.array, urm_train: sp.csr_matrix, recommender: object):
    users_ids_and_mappings = ratings[ratings.user_id.isin(users_to_recommend)][["user_id", "mapped_user_id"]].drop_duplicates()
    items_ids_and_mappings = ratings[["item_id", "mapped_item_id"]].drop_duplicates()
    
    mapping_to_item_id = dict(zip(ratings.mapped_item_id, ratings.item_id))
    
    
    recommendation_length = 10
    submission = []
    for idx, row in users_ids_and_mappings.iterrows():
        user_id = row.user_id
        mapped_user_id = row.mapped_user_id
        
        recommendations = recommender.recommend(user_id=mapped_user_id,
                                                urm_train=urm_train,
                                                at=recommendation_length,
                                                remove_seen=True)
        
        submission.append((user_id, [mapping_to_item_id[item_id] for item_id in recommendations]))
        
    return submission

In [55]:
submission = prepare_submission(ratings, users_to_recommend, urm_train_validation, best_recommender)

In [56]:
submission

[(0, [8486, 5085, 19652, 1730, 9742, 9055, 10444, 445, 6340, 20761]),
 (4342, [13363, 5119, 12061, 23351, 5906, 13074, 18274, 5494, 17933, 773]),
 (5526, [8568, 4927, 19909, 23351, 19910, 12734, 3995, 9007, 13324, 3534]),
 (5923, [11792, 18528, 21854, 5119, 19114, 2333, 17858, 12341, 23127, 13837]),
 (149, [14012, 21653, 9007, 12056, 10786, 12734, 23666, 21552, 364, 5119]),
 (4072, [10965, 6926, 19255, 22359, 21442, 10367, 10080, 9007, 23351, 3951]),
 (6193, [18190, 15145, 13541, 11792, 7274, 23711, 23536, 12341, 2839, 22152]),
 (7105, [21479, 11741, 7229, 562, 5489, 18146, 13181, 9182, 10203, 22523]),
 (1, [19089, 23600, 12409, 17257, 8431, 3165, 19480, 20095, 16630, 19709]),
 (150, [7494, 17723, 23600, 18168, 8431, 17257, 24128, 23244, 24209, 23524]),
 (183, [23481, 14895, 19709, 7494, 23600, 3165, 17760, 20146, 17723, 12409]),
 (249, [7494, 16630, 12409, 18984, 11658, 22196, 19089, 23481, 3570, 9664]),
 (296, [11295, 15691, 16630, 8894, 7494, 22445, 19089, 20095, 25407, 19709]),
 (4

In [57]:
import os
from datetime import datetime

csv_fname = './submission'
csv_fname += datetime.now().strftime('%b%d_%H-%M-%S') + '.csv'

def write_submission(submissions):
    with open(csv_fname, "w") as f:
        f.write(f"user_id,item_list\n")
        for user_id, items in submissions:
            f.write(f"{user_id},{' '.join([str(item) for item in items])}\n")


In [58]:
write_submission(submission)

In this lecture we saw the most simple version of Cosine Similarity, where it just includes a shrink factor. There are different optimizations that we can do to it.

Implement TopK Neighbors
When calculating the cosine similarity we used urm.T.dot(urm) to calculate the enumerator. However, depending of the dataset and the number of items, this matrix could not fit in memory. Implemenent a block version, faster than our vector version but that does not use urm.T.dot(urm) beforehand.

Implement Adjusted Cosine 

Implement Dice Similarity

Implement an implicit CF ItemKNN.

Implement a CF UserKNN model