## Behrad Hemati
### 11849389

# Recommender Systems in Python

This notebook introduces Recommender Systems from the practical point of view by implementing examples of them in Python. The main filtering techniques which are Popularity Filtering, Content-Based Filtering, Collaborative Filtering and Hybrid Filtering are illustrated with examples. Their performance is also tested and compared.

### Data Exploration and Processing

A dataset about user posts and comments in CI&T Deskdrop platform is utilized for the examples. This dataset is shared in Kaggle and can be downloaded from https://www.kaggle.com/gspmoreira/articles-sharing-reading-from-cit-deskdrop. The data consist of logs in a time period of 12 months CI&T's Internal Communication platform. There are about 73000 logged user interactions on 3000 public articles shared in the platform. The two main data structures are the shared articles and the user interactions with those articles. The latter can be View, Like, Comment, Follow, Bookmark.

In [1]:
import numpy as np
import scipy
import pandas as pd
import math
import random
import sklearn
from nltk.corpus import stopwords
from scipy.sparse import csr_matrix
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse.linalg import svds
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt

# loading shared articles
articles_df = pd.read_csv('./shared_articles.csv')
# showing shared content
articles_df = articles_df[articles_df['eventType'] == 'CONTENT SHARED']
articles_df.head(5)

Unnamed: 0,timestamp,eventType,contentId,authorPersonId,authorSessionId,authorUserAgent,authorRegion,authorCountry,contentType,url,title,text,lang
1,1459193988,CONTENT SHARED,-4110354420726924665,4340306774493623681,8940341205206233829,,,,HTML,http://www.nytimes.com/2016/03/28/business/dea...,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...,en
2,1459194146,CONTENT SHARED,-7292285110016212249,4340306774493623681,8940341205206233829,,,,HTML,http://cointelegraph.com/news/bitcoin-future-w...,Bitcoin Future: When GBPcoin of Branson Wins O...,The alarm clock wakes me at 8:00 with stream o...,en
3,1459194474,CONTENT SHARED,-6151852268067518688,3891637997717104548,-1457532940883382585,,,,HTML,https://cloudplatform.googleblog.com/2016/03/G...,Google Data Center 360° Tour,We're excited to share the Google Data Center ...,en
4,1459194497,CONTENT SHARED,2448026894306402386,4340306774493623681,8940341205206233829,,,,HTML,https://bitcoinmagazine.com/articles/ibm-wants...,"IBM Wants to ""Evolve the Internet"" With Blockc...",The Aite Group projects the blockchain market ...,en
5,1459194522,CONTENT SHARED,-2826566343807132236,4340306774493623681,8940341205206233829,,,,HTML,http://www.coindesk.com/ieee-blockchain-oxford...,IEEE to Talk Blockchain at Cloud Computing Oxf...,One of the largest and oldest organizations fo...,en


In [2]:
# loading and showing user interactions
interactions_df = pd.read_csv('./users_interactions.csv')
interactions_df.head(10)

Unnamed: 0,timestamp,eventType,contentId,personId,sessionId,userAgent,userRegion,userCountry
0,1465413032,VIEW,-3499919498720038879,-8845298781299428018,1264196770339959068,,,
1,1465412560,VIEW,8890720798209849691,-1032019229384696495,3621737643587579081,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2...,NY,US
2,1465416190,VIEW,310515487419366995,-1130272294246983140,2631864456530402479,,,
3,1465413895,FOLLOW,310515487419366995,344280948527967603,-3167637573980064150,,,
4,1465412290,VIEW,-7820640624231356730,-445337111692715325,5611481178424124714,,,
5,1465413742,VIEW,310515487419366995,-8763398617720485024,1395789369402380392,Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebK...,MG,BR
6,1465415950,VIEW,-8864073373672512525,3609194402293569455,1143207167886864524,,,
7,1465415066,VIEW,-1492913151930215984,4254153380739593270,8743229464706506141,Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53...,SP,BR
8,1465413762,VIEW,310515487419366995,344280948527967603,-3167637573980064150,,,
9,1465413771,VIEW,3064370296170038610,3609194402293569455,1143207167886864524,,,


In [3]:
# setting different weights or strengths to user interactions
event_type_strength = {
   'VIEW': 1.0,
   'LIKE': 2.0, 
   'BOOKMARK': 2.5, 
   'FOLLOW': 3.0,
   'COMMENT CREATED': 4.0,  
}

interactions_df['eventStrength'] = interactions_df['eventType'].apply(lambda x: event_type_strength[x])

In [4]:
# group users by number of interactions
users_interactions_count_df = interactions_df.groupby(['personId', 'contentId']).size().groupby('personId').size()
print('# users: %d' % len(users_interactions_count_df))

# get users with more than 5 interactions
users_with_enough_interactions_df = users_interactions_count_df[users_interactions_count_df >= 5].reset_index()[['personId']]
print('# users with at least 5 interactions: %d' % len(users_with_enough_interactions_df))

# users: 1895
# users with at least 5 interactions: 1140


In [5]:
# show total number of interactions
print('# of interactions: %d' % len(interactions_df))

# merge interactions with active users -- right-join of two tables
interactions_from_selected_users_df = interactions_df.merge(users_with_enough_interactions_df, 
               how = 'right', left_on = 'personId', right_on = 'personId')

# show number of interactions from active users
print('# of interactions from users with at least 5 interactions: %d' % len(interactions_from_selected_users_df))

# of interactions: 72312
# of interactions from users with at least 5 interactions: 69868


In [6]:
# function to smooth user interactions
def smooth_user_preference(x):
    return math.log(1+x, 2)

# get total unique interactions logaritmically smoothed
interactions_full_df = interactions_from_selected_users_df \
                    .groupby(['personId', 'contentId'])['eventStrength'].sum() \
                    .apply(smooth_user_preference).reset_index()

# show number of unique interactions
print('# of unique user/item interactions: %d' % len(interactions_full_df))
interactions_full_df.head(10)

# of unique user/item interactions: 39106


Unnamed: 0,personId,contentId,eventStrength
0,-9223121837663643404,-8949113594875411859,1.0
1,-9223121837663643404,-8377626164558006982,1.0
2,-9223121837663643404,-8208801367848627943,1.0
3,-9223121837663643404,-8187220755213888616,1.0
4,-9223121837663643404,-7423191370472335463,3.169925
5,-9223121837663643404,-7331393944609614247,1.0
6,-9223121837663643404,-6872546942144599345,1.0
7,-9223121837663643404,-6728844082024523434,1.0
8,-9223121837663643404,-6590819806697898649,1.0
9,-9223121837663643404,-6558712014192834002,1.584963


### Evaluation

To check the performance of the filtering techniques, holdout cross-validation is used. It keeps aside a random data slice (e.g., 20 % of the total samples) called the test set for the evaluation process. The rest of the data commonly known as the train set are used for the training process. The evaluation steps are the following: (i) for each test-set item a user has interacted with, sample 100 other items the user has not interacted with. (ii) ask the model to produce a ranked list of recommended items, from a set composed of one interacted item and the 100 non-interacted items. (iii) compute the Top-N accuracy metrics for this user and interacted item from the recommendations ranked list. (iv) aggregate the global Top-N accuracy metrics. The Top-N evaluation metric used is Recall@N which evaluates whether the interacted item is among the top N items in the ranked list of 101 recommendations for a user. 

In [7]:
# train - test split of the interaction samples
interactions_train_df, interactions_test_df = train_test_split(interactions_full_df,
                                   stratify=interactions_full_df['personId'], test_size=0.20, random_state=42)

# show train - test sizes
print('# interactions on Train set: %d' % len(interactions_train_df))
print('# interactions on Test set: %d' % len(interactions_test_df))

# interactions on Train set: 31284
# interactions on Test set: 7822


In [8]:
#Indexing by personId to speed up the searches during evaluation
interactions_full_indexed_df = interactions_full_df.set_index('personId')
interactions_train_indexed_df = interactions_train_df.set_index('personId')
interactions_test_indexed_df = interactions_test_df.set_index('personId')

In [9]:
# function to get interactions by person id
def get_items_interacted(person_id, interactions_df):
    interacted_items = interactions_df.loc[person_id]['contentId']
    return set(interacted_items if type(interacted_items) == pd.Series else [interacted_items])

In [10]:
# Top-N accuracy metrics consts
EVAL_RANDOM_SAMPLE_NON_INTERACTED_ITEMS = 100

class ModelEvaluator:

    # function to get slice of non-interacted items
    def get_not_interacted_items_sample(self, person_id, sample_size, seed=42):
        # first get interacted items for the specific person
        interacted_items = get_items_interacted(person_id, interactions_full_indexed_df)
        # now get all items as a set
        all_items = set(articles_df['contentId'])
        # non-interacted items are the difference between the above two sets
        non_interacted_items = all_items - interacted_items

        random.seed(seed)
        # randomly sample a batch of non-interacted items for the specific person
        non_interacted_items_sample = random.sample(non_interacted_items, sample_size)
        # return the non-interacted batch as a set
        return set(non_interacted_items_sample)

    # check if there is an interacted item in topn items - return value and index
    def _verify_hit_top_n(self, item_id, recommended_items, topn):        
            try:
                # check if item_id is in the recommended items batch and return index
                index = next(i for i, c in enumerate(recommended_items) if c == item_id)
            except:
                # if not present, index is -1
                index = -1
            # return the rank position of the recommendation if there was a hit and the item index
            hit = int(index in range(0, topn))
            return hit, index

    # evaluate model for each user given by person_id
    def evaluate_model_for_user(self, model, person_id):
        #Getting the items in test set
        interacted_values_testset = interactions_test_indexed_df.loc[person_id]
        
        # get the data in a correct structure
        if type(interacted_values_testset['contentId']) == pd.Series:
            # return a pandas dataframe
            person_interacted_items_testset = set(interacted_values_testset['contentId'])
        else:
            # or return a list
            person_interacted_items_testset = set([int(interacted_values_testset['contentId'])])  
            
        # count number of interacted items
        interacted_items_count_testset = len(person_interacted_items_testset) 

        #Getting a ranked recommendation list from a model for a given user
        person_recs_df = model.recommend_items(person_id, 
                                               items_to_ignore=get_items_interacted(person_id, 
                                                interactions_train_indexed_df), 
                                               topn=10000000000)
        # scores to measure
        hits_at_5_count = 0
        hits_at_10_count = 0
        
        #For each item the user has interacted in test set
        for item_id in person_interacted_items_testset:
            # getting a random sample (100) items the user has not interacted 
            # (to represent items that are assumed to be no relevant to the user)
            non_interacted_items_sample = self.get_not_interacted_items_sample(person_id, 
                                                        sample_size=EVAL_RANDOM_SAMPLE_NON_INTERACTED_ITEMS, 
                                                            seed=item_id%(2**32))

            # Combining the current interacted item with the 100 random items
            items_to_filter_recs = non_interacted_items_sample.union(set([item_id]))

            # Filtering only recommendations that are either the interacted item
            # or from a random sample of 100 non-interacted items
            valid_recs_df = person_recs_df[person_recs_df['contentId'].isin(items_to_filter_recs)]                    
            valid_recs = valid_recs_df['contentId'].values
            
            #Verifying if the current interacted item is among the Top-N recommended items
            hit_at_5, index_at_5 = self._verify_hit_top_n(item_id, valid_recs, 5)
            hits_at_5_count += hit_at_5
            hit_at_10, index_at_10 = self._verify_hit_top_n(item_id, valid_recs, 10)
            hits_at_10_count += hit_at_10

        #Recall is the rate of the interacted items that are ranked among the Top-N recommended items, 
        #when mixed with a set of non-relevant items
        recall_at_5 = hits_at_5_count / float(interacted_items_count_testset)
        recall_at_10 = hits_at_10_count / float(interacted_items_count_testset)

        person_metrics = {'hits@5_count':hits_at_5_count, 'hits@10_count':hits_at_10_count, 
                          'interacted_count': interacted_items_count_testset, 'recall@5': recall_at_5,
                          'recall@10': recall_at_10}
        return person_metrics

    def evaluate_model(self, model):
        #print('Running evaluation for users')
        people_metrics = []
        for idx, person_id in enumerate(list(interactions_test_indexed_df.index.unique().values)):
            # if idx % 100 == 0 and idx > 0:
            # print('%d users processed' % idx)
            person_metrics = self.evaluate_model_for_user(model, person_id)  
            person_metrics['_person_id'] = person_id
            people_metrics.append(person_metrics)
        print('%d users processed' % idx)

        detailed_results_df = pd.DataFrame(people_metrics) \
                            .sort_values('interacted_count', ascending=False)
        
        global_recall_at_5 = detailed_results_df['hits@5_count'].sum() / float(detailed_results_df['interacted_count'].sum())
        global_recall_at_10 = detailed_results_df['hits@10_count'].sum() / float(detailed_results_df['interacted_count'].sum())
        
        global_metrics = {'modelName': model.get_model_name(),
                          'recall@5': global_recall_at_5,
                          'recall@10': global_recall_at_10}    
        return global_metrics, detailed_results_df
    
model_evaluator = ModelEvaluator()

### Popularity Filtering

A common filtering technique is Popularity Filtering which represents a good baseline. It recommends the user the most popular items that he/she has not previously consumed. The popularity comes from other users' preferences, thus it does not account for the actual user preferences. In our case, we can identify popular items by counting their interaction strength.

In [11]:
# compute the most popular items based on interaction strength
item_popularity_df = interactions_full_df.groupby('contentId')['eventStrength'].sum().sort_values(ascending=False).reset_index()
item_popularity_df.head(10)

Unnamed: 0,contentId,eventStrength
0,-4029704725707465084,307.733799
1,-6783772548752091658,233.762157
2,-133139342397538859,228.024567
3,-8208801367848627943,197.107608
4,-6843047699859121724,193.825208
5,8224860111193157980,189.04468
6,-2358756719610361882,183.110951
7,2581138407738454418,180.282876
8,7507067965574797372,179.094002
9,1469580151036142903,170.548969


In [37]:
class PopularityRecommender:
    
    MODEL_NAME = 'Popularity'
    
    def __init__(self, popularity_df, items_df=None, order='dsc'):
        self.popularity_df = popularity_df
        self.items_df = items_df
        self.order = order
        
    def get_model_name(self):
        return self.MODEL_NAME
        
    # modify this function to get different orders of recommended items:
    # 1. sorted in descending order (expected to be the best case)
    # 2. sorted in ascending order (expected to be the worst case)
    # 3. returned in random order (expected to be average case)
    # Evaluate the Popularity Recommender in each case and report the recall@5 and recall@10 values
    def recommend_items(self, user_id, items_to_ignore=[], topn=10, verbose=False):
        # here the recommended items are retrieved unordered
        # recommendations_df = self.popularity_df[~self.popularity_df['contentId'].isin(items_to_ignore)]
        
        # 1. sorted in descending order
        recommendations_df = self.popularity_df[~self.popularity_df['contentId'].isin(items_to_ignore)] \
                               .sort_values('eventStrength', ascending=False)
        # 2. sorted in ascending order
        recommendations_df_asc = recommendations_df.sort_values('eventStrength', ascending=True)
        # 3. returned in random order
        recommendations_df_rnd = recommendations_df.sample(frac=1)
        if verbose:
            if self.items_df is None:
                raise Exception('"items_df" is required in verbose mode')
            
            # get more details to show for the recommended items
            recommendations_df = recommendations_df.merge(self.items_df, how='left', 
                                                          left_on='contentId', 
                                                          right_on='contentId')[['eventStrength', 'contentId', 'title', 'url', 'lang']]
            
                  
        return recommendations_df if self.order=='dsc' else recommendations_df_asc if self.order=='asc' else recommendations_df_rnd
    
popularity_model = PopularityRecommender(item_popularity_df, articles_df)
popularity_model_asc = PopularityRecommender(item_popularity_df, articles_df, order='asc')
popularity_model_rnd = PopularityRecommender(item_popularity_df, articles_df, order='rnd')

In [40]:
print('Evaluating Popularity recommendation model descending order...')
pop_global_metrics, pop_detailed_results_df = model_evaluator.evaluate_model(popularity_model)

print('\nGlobal metrics for descending order:\n%s' % pop_global_metrics)
print(pop_detailed_results_df)

print('Evaluating Popularity recommendation model ascending order...')
pop_global_metrics, pop_detailed_results_df = model_evaluator.evaluate_model(popularity_model_asc)

print('\nGlobal metrics for descending order:\n%s' % pop_global_metrics)
print(pop_detailed_results_df)

print('Evaluating Popularity recommendation model random order...')
pop_global_metrics, pop_detailed_results_df = model_evaluator.evaluate_model(popularity_model_rnd)

print('\nGlobal metrics for descending order:\n%s' % pop_global_metrics)
print(pop_detailed_results_df)

Evaluating Popularity recommendation model descending order...
1139 users processed

Global metrics for descending order:
{'modelName': 'Popularity', 'recall@5': 0.2418818716440808, 'recall@10': 0.3725389925850166}
      hits@5_count  hits@10_count  interacted_count  recall@5  recall@10  \
76              28             50               192  0.145833   0.260417   
17              12             25               134  0.089552   0.186567   
16              13             23               130  0.100000   0.176923   
10               5              9               117  0.042735   0.076923   
82              26             40                88  0.295455   0.454545   
...            ...            ...               ...       ...        ...   
872              0              0                 1  0.000000   0.000000   
869              0              0                 1  0.000000   0.000000   
867              0              0                 1  0.000000   0.000000   
865              0       

Unfortunately I was not able to fix the interacted_count as it's part of another function
and the dependencies are inter-twined, but looking at the recall@5 and recall@10 for 
best-case scenario (descending) remain at ~24% and ~37% respectively as expected.
For the worst case scenario these values decrease to 0.003% and 0.008% respectively 
which is essentially 0.
In the random case, unless specifying a seed, the values fluctuate a bit as expected but 
the range is limited to ~5% for recall@5 and 10% for recall@10, which gives us a picture
of best and worst-case scenarios and also how it improves from random which essentially means
no recommender system. This shows this recommender system works and signifacantly improves the
hit rate.