# Article Recommender System
**About the data** - [Source](https://www.kaggle.com/gspmoreira/recommender-systems-in-python-101/data) This file contains information about the articles shared in the platform. 
Each article has its sharing date (timestamp), the original url, title, content in plain text, the article' lang (Portuguese - pt or English - en) and information about the user who shared the article (author).

There are two possible event types at a given timestamp: 
- CONTENT SHARED: The article was shared in the platform and is available for users. 
- CONTENT REMOVED: The article was removed from the platform and not available for further recommendation.

**Aim** - Build recommender systems for sharing these articles. We are going to create different types of recommender systems:
1. Popularity Model
2. 

### Importing libraries and loading data

In [103]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import math
import random
import scipy

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import normalize

from nltk.corpus import stopwords

In [104]:
# Loading data
articles_df = pd.read_csv('data/shared_articles.csv')
# Choosing only the shared articles
articles_df = articles_df[articles_df['eventType'] == 'CONTENT SHARED']

# Loading how users interact with data
interactions_df = pd.read_csv('data/users_interactions.csv')


### Data cleaning
Here we are changing the data in a number of ways:
1. **Transformation** -Numerical assignment to interactino types, and combining the interaction into one number to estimate the total strength of interaction with one content by a user
2. **Filtering** - Removing users where they have interacted less than 5 times with content
3. **Splitting** - Splitting into test and train sets

In [105]:
# Creating a new column to quantify the degree of interaction
# Weights are assigned by me
event_type_strength = {
   'VIEW': 1.0,
   'LIKE': 2.0, 
   'BOOKMARK': 3.0, 
   'FOLLOW': 4.0,
   'COMMENT CREATED': 5.0,  
}

interactions_df['eventStrength'] = interactions_df['eventType'].apply(lambda x: event_type_strength[x])

# Making interactions a smooth function
def smooth_user_preference(x):
    """Return a log transformation"""
    return math.log(1+x, 2)
    
interactions_df = interactions_df \
                    .groupby(['personId', 'contentId'])['eventStrength'].sum() \
                    .apply(smooth_user_preference).reset_index()
print('Unique user/item interactions: %d' % len(interactions_df))


Unique user/item interactions: 40710


In [106]:
# Only taking users with at least5 interaction
# Such that we have enough information to recommend
count_df = interactions_df.groupby(['personId', 'contentId']).size().groupby('personId').size()
print('Total users = %d' % len(count_df))
users_with_enough_interactions_df = count_df[count_df >= 5].reset_index()[['personId']]
print('Users after filtering = %d' % len(users_with_enough_interactions_df))

Total users = 1895
Users after filtering = 1140


In [107]:
# Removing less than 5 interactions from the originals interaction dataset
print('Total interactions = %d' % len(interactions_df))
interactions_from_selected_users_df = interactions_df.merge(users_with_enough_interactions_df, 
               how = 'right',
               left_on = 'personId',
               right_on = 'personId')
print('Interactions after filtering = %d' % len(interactions_from_selected_users_df))

Total interactions = 40710
Interactions after filtering = 39106


In [108]:
# Split into test and training sets
train, test = train_test_split(interactions_from_selected_users_df,
                                   stratify=interactions_from_selected_users_df['personId'], 
                                   test_size=0.20,
                                   random_state=42)

print('Size of Training set: %d' % len(train))
print('Size of Testing set: %d' % len(test))

Size of Training set: 31284
Size of Testing set: 7822


In [109]:
# Resetting index for faster computations
interactions_from_selected_users_df = interactions_from_selected_users_df.set_index('personId')
train = train.set_index('personId')
test = test.set_index('personId')

In [110]:
def get_items_interacted(person_id, interactions_df):
    """Get the user's data and merge in the movie information."""
    interacted_items = interactions_df.loc[person_id]['contentId']
    return set(interacted_items if type(interacted_items) == pd.Series else [interacted_items])

### Evaluation Metric - Recall/Coverage
As a simple evaluation method, we calculate how many interactions of the current user are captured by the recommendations (top 5 and top 10).

In [111]:
#Top-N accuracy metrics consts
# Reference: https://www.kaggle.com/gspmoreira/recommender-systems-in-python-101
EVAL_RANDOM_SAMPLE_NON_INTERACTED_ITEMS = 100

class ModelEvaluator:


    def get_not_interacted_items_sample(self, person_id, sample_size, seed=42):
        interacted_items = get_items_interacted(person_id, interactions_from_selected_users_df)
        all_items = set(articles_df['contentId'])
        non_interacted_items = all_items - interacted_items

        random.seed(seed)
        non_interacted_items_sample = random.sample(non_interacted_items, sample_size)
        return set(non_interacted_items_sample)

    def _verify_hit_top_n(self, item_id, recommended_items, topn):        
            try:
                index = next(i for i, c in enumerate(recommended_items) if c == item_id)
            except:
                index = -1
            hit = int(index in range(0, topn))
            return hit, index

    def evaluate_model_for_user(self, model, person_id):
        #Getting the items in test set
        interacted_values_testset = test.loc[person_id]
        if type(interacted_values_testset['contentId']) == pd.Series:
            person_interacted_items_testset = set(interacted_values_testset['contentId'])
        else:
            person_interacted_items_testset = set([int(interacted_values_testset['contentId'])])  
        interacted_items_count_testset = len(person_interacted_items_testset) 

        #Getting a ranked recommendation list from a model for a given user
        person_recs_df = model.recommend_items(person_id, 
                                               items_to_ignore=get_items_interacted(person_id,train), 
                                               topn=10000000000)

        hits_at_5_count = 0
        hits_at_10_count = 0
        #For each item the user has interacted in test set
        for item_id in person_interacted_items_testset:
            #Getting a random sample (100) items the user has not interacted 
            #(to represent items that are assumed to be no relevant to the user)
            non_interacted_items_sample = self.get_not_interacted_items_sample(person_id, 
                                                                          sample_size=EVAL_RANDOM_SAMPLE_NON_INTERACTED_ITEMS, 
                                                                          seed=item_id%(2**32))

            #Combining the current interacted item with the 100 random items
            items_to_filter_recs = non_interacted_items_sample.union(set([item_id]))

            #Filtering only recommendations that are either the interacted item or from a random sample of 100 non-interacted items
            valid_recs_df = person_recs_df[person_recs_df['contentId'].isin(items_to_filter_recs)]                    
            valid_recs = valid_recs_df['contentId'].values
            #Verifying if the current interacted item is among the Top-N recommended items
            hit_at_5, index_at_5 = self._verify_hit_top_n(item_id, valid_recs, 5)
            hits_at_5_count += hit_at_5
            hit_at_10, index_at_10 = self._verify_hit_top_n(item_id, valid_recs, 10)
            hits_at_10_count += hit_at_10

        #Recall is the rate of the interacted items that are ranked among the Top-N recommended items, 
        #when mixed with a set of non-relevant items
        recall_at_5 = hits_at_5_count / float(interacted_items_count_testset)
        recall_at_10 = hits_at_10_count / float(interacted_items_count_testset)

        person_metrics = {'hits@5_count':hits_at_5_count, 
                          'hits@10_count':hits_at_10_count, 
                          'interacted_count': interacted_items_count_testset,
                          'recall@5': recall_at_5,
                          'recall@10': recall_at_10}
        return person_metrics

    def evaluate_model(self, model):
        #print('Running evaluation for users')
        people_metrics = []
        for idx, person_id in enumerate(list(test.index.unique().values)):
            #if idx % 100 == 0 and idx > 0:
            #    print('%d users processed' % idx)
            person_metrics = self.evaluate_model_for_user(model, person_id)  
            person_metrics['_person_id'] = person_id
            people_metrics.append(person_metrics)
        print('%d users processed' % idx)

        detailed_results_df = pd.DataFrame(people_metrics) \
                            .sort_values('interacted_count', ascending=False)
        
        global_recall_at_5 = detailed_results_df['hits@5_count'].sum() / float(detailed_results_df['interacted_count'].sum())
        global_recall_at_10 = detailed_results_df['hits@10_count'].sum() / float(detailed_results_df['interacted_count'].sum())
        
        global_metrics = {'modelName': model.get_model_name(),
                          'recall@5': global_recall_at_5,
                          'recall@10': global_recall_at_10}    
        return global_metrics, detailed_results_df
    
model_evaluator = ModelEvaluator()    

# Popularity Model
Recommending the most popular articles to users - this is almost like collaborative filtering but assuming that the user is like everyone so recommending the best is a safe bet. Does not have any personalization.

In [112]:
#Computes the 10 most popular items
item_popularity_df = interactions_from_selected_users_df.groupby('contentId')['eventStrength'].sum().sort_values(ascending=False).reset_index()
item_popularity_df.head(10)

Unnamed: 0,contentId,eventStrength
0,-4029704725707465084,310.791362
1,-6783772548752091658,237.14541
2,-133139342397538859,229.089057
3,-8208801367848627943,199.660338
4,-6843047699859121724,196.174998
5,8224860111193157980,191.584205
6,-2358756719610361882,186.52826
7,2581138407738454418,182.495414
8,7507067965574797372,181.070143
9,1469580151036142903,172.066596


In [113]:
class PopularityRecommender:
    
    MODEL_NAME = 'Popularity'
    
    def __init__(self, popularity_df, items_df=None):
        self.popularity_df = popularity_df
        self.items_df = items_df
        
    def get_model_name(self):
        return self.MODEL_NAME
        
    def recommend_items(self, user_id, items_to_ignore=[], topn=10, verbose=False):
        # Recommend the more popular items that the user hasn't seen yet.
        recommendations_df = self.popularity_df[~self.popularity_df['contentId'].isin(items_to_ignore)] \
                               .sort_values('eventStrength', ascending = False) \
                               .head(topn)

        if verbose:
            if self.items_df is None:
                raise Exception('"items_df" is required in verbose mode')

            recommendations_df = recommendations_df.merge(self.items_df, how = 'left', 
                                                          left_on = 'contentId', 
                                                          right_on = 'contentId')[['eventStrength', 'contentId', 'title', 'url', 'lang']]


        return recommendations_df
    
popularity_model = PopularityRecommender(item_popularity_df, articles_df)
print('Evaluating Popularity recommendation model...')
pop_global_metrics, pop_detailed_results_df = model_evaluator.evaluate_model(popularity_model)
print('\nGlobal metrics:\n%s' % pop_global_metrics)
pop_detailed_results_df.head(10)

Evaluating Popularity recommendation model...
1139 users processed

Global metrics:
{'recall@5': 0.24507798517003324, 'modelName': 'Popularity', 'recall@10': 0.3734339043722833}


Unnamed: 0,_person_id,hits@10_count,hits@5_count,interacted_count,recall@10,recall@5
76,3609194402293569455,49,26,192,0.255208,0.135417
17,-2626634673110551643,27,10,134,0.201493,0.074627
16,-1032019229384696495,27,14,130,0.207692,0.107692
10,-1443636648652872475,7,4,117,0.059829,0.034188
82,-2979881261169775358,37,22,88,0.420455,0.25
161,-3596626804281480007,23,13,80,0.2875,0.1625
65,1116121227607581999,34,21,73,0.465753,0.287671
81,692689608292948411,21,16,69,0.304348,0.231884
106,-9016528795238256703,20,14,69,0.289855,0.202899
52,3636910968448833585,27,22,68,0.397059,0.323529


**Conclusion** - We deployed a Popularity Model where the users were always recommended the top 5 or top 10 most popular items of the dataset. Even with simple model, the top 5 recommendations included 24% of the interactions of all users, and top10 recommendations included 37% of interactions, which is pretty good for such a simple model.

### Content-based Filtering
In this model, the previous interactions of the user determine what content should be recommended. The algorithm takes place in two steps:
1. Train a vectorizer for all titles and text in the data using tfidf (term frequency–inverse document frequency)
2. Create a profile of each user based on the above vectorizer
3. Recommend articles based on cosine similarity
4. Evaluate using the recall/coverage metric stated above

In [114]:
stopwords_list = stopwords.words('english') + stopwords.words('portuguese')

#Trains a model whose vectors size is 5000, composed by the main unigrams and bigrams found in the corpus
vectorizer = TfidfVectorizer(analyzer='word',
                     ngram_range=(1, 2),
                     min_df=0.003,
                     max_df=0.5,
                     max_features=5000,
                     stop_words=stopwords_list)

item_ids = articles_df['contentId'].tolist()
tfidf_matrix = vectorizer.fit_transform(articles_df['title'] + "" + articles_df['text'])
tfidf_feature_names = vectorizer.get_feature_names()

In [115]:
def get_item_profile(item_id):
    idx = item_ids.index(item_id)
    item_profile = tfidf_matrix[idx:idx+1]
    return item_profile

def get_item_profiles(ids):
    item_profiles_list = [get_item_profile(x) for x in ids]
    item_profiles = scipy.sparse.vstack(item_profiles_list)
    return item_profiles

def build_users_profile(person_id, interactions_indexed_df):
    interactions_person_df = interactions_indexed_df.loc[person_id]
    user_item_profiles = get_item_profiles(interactions_person_df['contentId'])
    
    user_item_strengths = np.array(interactions_person_df['eventStrength']).reshape(-1,1)
    #Weighted average of item profiles by the interactions strength
    user_item_strengths_weighted_avg = np.sum(user_item_profiles.multiply(user_item_strengths), axis=0) / np.sum(user_item_strengths)
    user_profile_norm = normalize(user_item_strengths_weighted_avg)
    return user_profile_norm

def build_users_profiles(): 
    interactions_indexed_df = interactions_from_selected_users_df[interactions_from_selected_users_df['contentId'] \
                                                   .isin(articles_df['contentId'])]
    #.set_index('personId')
    user_profiles = {}
    for person_id in interactions_indexed_df.index.unique():
        user_profiles[person_id] = build_users_profile(person_id, interactions_indexed_df)
    return user_profiles

In [116]:
user_profiles = build_users_profiles()
print "Number of user profiles created = %d" %(len(user_profiles))

Number of user profiles created = 1140


In [117]:
# Checking data from one of the user profiles
myprofile = user_profiles[-1479311724257856983]
print(myprofile.shape)
pd.DataFrame(sorted(zip(tfidf_feature_names, 
                        user_profiles[-1479311724257856983].flatten().tolist()), key=lambda x: -x[1])[:20],
             columns=['token', 'relevance'])

(1, 5000)


Unnamed: 0,token,relevance
0,learning,0.304935
1,machine learning,0.255497
2,machine,0.245706
3,google,0.206779
4,data,0.173277
5,ai,0.136863
6,algorithms,0.102461
7,graph,0.100292
8,like,0.097249
9,language,0.084607


In [118]:
class ContentBasedRecommender:
    
    MODEL_NAME = 'Content-Based'
    
    def __init__(self, items_df=None):
        self.item_ids = item_ids
        self.items_df = items_df
        
    def get_model_name(self):
        return self.MODEL_NAME
        
    def _get_similar_items_to_user_profile(self, person_id, topn=1000):
        #Computes the cosine similarity between the user profile and all item profiles
        cosine_similarities = cosine_similarity(user_profiles[person_id], tfidf_matrix)
        #Gets the top similar items
        similar_indices = cosine_similarities.argsort().flatten()[-topn:]
        #Sort the similar items by similarity
        similar_items = sorted([(item_ids[i], cosine_similarities[0,i]) for i in similar_indices], key=lambda x: -x[1])
        return similar_items
        
    def recommend_items(self, user_id, items_to_ignore=[], topn=10, verbose=False):
        similar_items = self._get_similar_items_to_user_profile(user_id)
        #Ignores items the user has already interacted
        similar_items_filtered = list(filter(lambda x: x[0] not in items_to_ignore, similar_items))
        
        recommendations_df = pd.DataFrame(similar_items_filtered, columns=['contentId', 'recStrength']) \
                                    .head(topn)

        if verbose:
            if self.items_df is None:
                raise Exception('"items_df" is required in verbose mode')

            recommendations_df = recommendations_df.merge(self.items_df, how = 'left', 
                                                          left_on = 'contentId', 
                                                          right_on = 'contentId')[['recStrength', 'contentId', 'title', 'url', 'lang']]


        return recommendations_df
    
content_based_recommender_model = ContentBasedRecommender(articles_df)

In [119]:
print('Evaluating Content-Based Filtering model...')
cb_global_metrics, cb_detailed_results_df = model_evaluator.evaluate_model(content_based_recommender_model)
print('\nGlobal metrics:\n%s' % cb_global_metrics)
cb_detailed_results_df.head(10)

Evaluating Content-Based Filtering model...
1139 users processed

Global metrics:
{'recall@5': 0.4115315776016364, 'modelName': 'Content-Based', 'recall@10': 0.5251853745845052}


Unnamed: 0,_person_id,hits@10_count,hits@5_count,interacted_count,recall@10,recall@5
76,3609194402293569455,23,17,192,0.119792,0.088542
17,-2626634673110551643,33,21,134,0.246269,0.156716
16,-1032019229384696495,34,18,130,0.261538,0.138462
10,-1443636648652872475,54,35,117,0.461538,0.299145
82,-2979881261169775358,15,6,88,0.170455,0.068182
161,-3596626804281480007,26,13,80,0.325,0.1625
65,1116121227607581999,15,9,73,0.205479,0.123288
81,692689608292948411,20,9,69,0.289855,0.130435
106,-9016528795238256703,13,6,69,0.188406,0.086957
52,3636910968448833585,15,9,68,0.220588,0.132353


**Conclusion** - By implementing a content-based filtering system, the recommendations for the top 5 and top 10 articles are much more improved such that they show 41% and 52% coverage/recall, respectively.

### Collaborative Filtering
These methods can either be memory-based(item interactions that are common between users) or model-based (clustering or SVD). Here we are going to use a model-based approach in anumber of steps:
1. 

In [120]:
#Creating a sparse pivot table with users in rows and items in columns
users_items_pivot_matrix_df = train.pivot(columns='contentId', 
                                          values='eventStrength').fillna(0)

users_items_pivot_matrix_df.head(10)

contentId,-9222795471790223670,-9216926795620865886,-9194572880052200111,-9192549002213406534,-9190737901804729417,-9189659052158407108,-9176143510534135851,-9172673334835262304,-9171475473795142532,-9166778629773133902,...,9191014301634017491,9207286802575546269,9208127165664287660,9209629151177723638,9209886322932807692,9213260650272029784,9215261273565326920,9217155070834564627,9220445660318725468,9222265156747237864
personId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
-9223121837663643404,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
-9212075797126931087,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
-9207251133131336884,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
-9199575329909162940,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
-9196668942822132778,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
-9188188261933657343,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
-9172914609055320039,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
-9156344805277471150,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
-9120685872592674274,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
-9109785559521267180,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [121]:
users_items_pivot_matrix = users_items_pivot_matrix_df.as_matrix()
users_items_pivot_matrix[:10]

  """Entry point for launching an IPython kernel.


array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 2., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [122]:
# Getting all the user ids in one array
users_ids = list(users_items_pivot_matrix_df.index)
users_ids[:10]

[-9223121837663643404,
 -9212075797126931087,
 -9207251133131336884,
 -9199575329909162940,
 -9196668942822132778,
 -9188188261933657343,
 -9172914609055320039,
 -9156344805277471150,
 -9120685872592674274,
 -9109785559521267180]

In [123]:
#The number of factors to factor the user-item matrix.
NUMBER_OF_FACTORS_MF = 15
#Performs matrix factorization of the original user item matrix
U, sigma, Vt = scipy.sparse.linalg.svds(users_items_pivot_matrix, k = NUMBER_OF_FACTORS_MF)
sigma = np.diag(sigma)

In [124]:
# Reconstruct the original matrix by multiplying factors
all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) 
all_user_predicted_ratings

array([[ 8.04135163e-03,  7.44256578e-04, -1.67901617e-02, ...,
         3.40275507e-03,  1.34887873e-02,  2.31844013e-03],
       [-2.33862909e-04, -3.47837746e-04, -2.80582928e-03, ...,
         2.33132436e-03, -1.37418969e-04, -1.80409702e-03],
       [-1.18793907e-02,  6.63819793e-03, -6.65330770e-03, ...,
         7.84203483e-03, -1.01827597e-02,  1.11758831e-02],
       ...,
       [-2.79335799e-02,  8.15628913e-03, -2.05967208e-02, ...,
        -1.02084874e-02,  1.38245272e-03,  9.02732771e-03],
       [-2.05942521e-02,  4.60906707e-03,  1.37341561e-02, ...,
         4.91322768e-03,  2.14419951e-03, -7.45483522e-03],
       [-1.13252995e-02,  3.73209610e-03,  1.43613853e-01, ...,
        -1.18795237e-02,  6.24218958e-02,  1.41113074e-02]])

In [125]:
#Converting the reconstructed matrix back to a Pandas dataframe
cf_preds_df = pd.DataFrame(all_user_predicted_ratings, columns = users_items_pivot_matrix_df.columns, index=users_ids).transpose()

In [126]:
class CFRecommender:
    
    MODEL_NAME = 'Collaborative Filtering'
    
    def __init__(self, cf_predictions_df, items_df=None):
        self.cf_predictions_df = cf_predictions_df
        self.items_df = items_df
        
    def get_model_name(self):
        return self.MODEL_NAME
        
    def recommend_items(self, user_id, items_to_ignore=[], topn=10, verbose=False):
        # Get and sort the user's predictions
        sorted_user_predictions = self.cf_predictions_df[user_id].sort_values(ascending=False) \
                                    .reset_index().rename(columns={user_id: 'recStrength'})

        # Recommend the highest predicted rating movies that the user hasn't seen yet.
        recommendations_df = sorted_user_predictions[~sorted_user_predictions['contentId'].isin(items_to_ignore)] \
                               .sort_values('recStrength', ascending = False) \
                               .head(topn)

        if verbose:
            if self.items_df is None:
                raise Exception('"items_df" is required in verbose mode')

            recommendations_df = recommendations_df.merge(self.items_df, how = 'left', 
                                                          left_on = 'contentId', 
                                                          right_on = 'contentId')[['recStrength', 'contentId', 'title', 'url', 'lang']]


        return recommendations_df
    
cf_recommender_model = CFRecommender(cf_preds_df, articles_df)

In [127]:
print('Evaluating Collaborative Filtering (SVD Matrix Factorization) model...')
cf_global_metrics, cf_detailed_results_df = model_evaluator.evaluate_model(cf_recommender_model)
print('\nGlobal metrics:\n%s' % cf_global_metrics)
cf_detailed_results_df.head(10)

Evaluating Collaborative Filtering (SVD Matrix Factorization) model...
1139 users processed

Global metrics:
{'recall@5': 0.3309895167476349, 'modelName': 'Collaborative Filtering', 'recall@10': 0.4667604193300946}


Unnamed: 0,_person_id,hits@10_count,hits@5_count,interacted_count,recall@10,recall@5
76,3609194402293569455,52,23,192,0.270833,0.119792
17,-2626634673110551643,51,32,134,0.380597,0.238806
16,-1032019229384696495,30,17,130,0.230769,0.130769
10,-1443636648652872475,50,39,117,0.42735,0.333333
82,-2979881261169775358,48,38,88,0.545455,0.431818
161,-3596626804281480007,32,21,80,0.4,0.2625
65,1116121227607581999,32,21,73,0.438356,0.287671
81,692689608292948411,26,17,69,0.376812,0.246377
106,-9016528795238256703,29,18,69,0.42029,0.26087
52,3636910968448833585,31,21,68,0.455882,0.308824


**Conclusion** - This model did not do as well as the content-based filtering approaches. Improvements to this model may include increasing the number of features in the SVD or using Hybrid models

### Hybrid models
In this hybrid model we are creating a rank for the system in two ways:
1. Multiplying the content-based and collaborative-filter score


### Multiplicative model

In [128]:
class HybridRecommender:
    
    MODEL_NAME = 'Hybrid'
    
    def __init__(self, cb_rec_model, cf_rec_model, items_df):
        self.cb_rec_model = cb_rec_model
        self.cf_rec_model = cf_rec_model
        self.items_df = items_df
        
    def get_model_name(self):
        return self.MODEL_NAME
        
    def recommend_items(self, user_id, items_to_ignore=[], topn=10, verbose=False):
        #Getting the top-1000 Content-based filtering recommendations
        cb_recs_df = self.cb_rec_model.recommend_items(user_id, items_to_ignore=items_to_ignore, verbose=verbose,
                                                           topn=1000).rename(columns={'recStrength': 'recStrengthCB'})
        
        #Getting the top-1000 Collaborative filtering recommendations
        cf_recs_df = self.cf_rec_model.recommend_items(user_id, items_to_ignore=items_to_ignore, verbose=verbose, 
                                                           topn=1000).rename(columns={'recStrength': 'recStrengthCF'})
        
        #Combining the results by contentId
        recs_df = cb_recs_df.merge(cf_recs_df,
                                   how = 'inner', 
                                   left_on = 'contentId', 
                                   right_on = 'contentId')
        
        #Computing a hybrid recommendation score based on CF and CB scores
        recs_df['recStrengthHybrid'] = recs_df['recStrengthCB'] * recs_df['recStrengthCF']
        
        #Sorting recommendations by hybrid score
        recommendations_df = recs_df.sort_values('recStrengthHybrid', ascending=False).head(topn)

        if verbose:
            if self.items_df is None:
                raise Exception('"items_df" is required in verbose mode')

            recommendations_df = recommendations_df.merge(self.items_df, how = 'left', 
                                                          left_on = 'contentId', 
                                                          right_on = 'contentId')[['recStrengthHybrid', 'contentId', 'title', 'url', 'lang']]


        return recommendations_df
    
hybrid_recommender_model = HybridRecommender(content_based_recommender_model, cf_recommender_model, articles_df)

In [129]:
print('Evaluating Hybrid model...')
hybrid_global_metrics, hybrid_detailed_results_df = model_evaluator.evaluate_model(hybrid_recommender_model)
print('\nGlobal metrics:\n%s' % hybrid_global_metrics)
hybrid_detailed_results_df.head(10)

Evaluating Hybrid model...
1139 users processed

Global metrics:
{'recall@5': 0.4285349015597034, 'modelName': 'Hybrid', 'recall@10': 0.540654564050115}


Unnamed: 0,_person_id,hits@10_count,hits@5_count,interacted_count,recall@10,recall@5
76,3609194402293569455,39,30,192,0.203125,0.15625
17,-2626634673110551643,52,34,134,0.38806,0.253731
16,-1032019229384696495,35,27,130,0.269231,0.207692
10,-1443636648652872475,55,39,117,0.470085,0.333333
82,-2979881261169775358,34,26,88,0.386364,0.295455
161,-3596626804281480007,28,19,80,0.35,0.2375
65,1116121227607581999,23,16,73,0.315068,0.219178
81,692689608292948411,21,15,69,0.304348,0.217391
106,-9016528795238256703,20,11,69,0.289855,0.15942
52,3636910968448833585,23,17,68,0.338235,0.25


### Additive model

In [130]:
class HybridRecommender:
    
    MODEL_NAME = 'Hybrid'
    
    def __init__(self, cb_rec_model, cf_rec_model, items_df):
        self.cb_rec_model = cb_rec_model
        self.cf_rec_model = cf_rec_model
        self.items_df = items_df
        
    def get_model_name(self):
        return self.MODEL_NAME
        
    def recommend_items(self, user_id, items_to_ignore=[], topn=10, verbose=False):
        #Getting the top-1000 Content-based filtering recommendations
        cb_recs_df = self.cb_rec_model.recommend_items(user_id, items_to_ignore=items_to_ignore, verbose=verbose,
                                                           topn=1000).rename(columns={'recStrength': 'recStrengthCB'})
        
        #Getting the top-1000 Collaborative filtering recommendations
        cf_recs_df = self.cf_rec_model.recommend_items(user_id, items_to_ignore=items_to_ignore, verbose=verbose, 
                                                           topn=1000).rename(columns={'recStrength': 'recStrengthCF'})
        
        #Combining the results by contentId
        recs_df = cb_recs_df.merge(cf_recs_df,
                                   how = 'inner', 
                                   left_on = 'contentId', 
                                   right_on = 'contentId')
        
        #Computing a hybrid recommendation score based on CF and CB scores
        recs_df['recStrengthHybrid'] = recs_df['recStrengthCB'] + recs_df['recStrengthCF']
        
        #Sorting recommendations by hybrid score
        recommendations_df = recs_df.sort_values('recStrengthHybrid', ascending=False).head(topn)

        if verbose:
            if self.items_df is None:
                raise Exception('"items_df" is required in verbose mode')

            recommendations_df = recommendations_df.merge(self.items_df, how = 'left', 
                                                          left_on = 'contentId', 
                                                          right_on = 'contentId')[['recStrengthHybrid', 'contentId', 'title', 'url', 'lang']]


        return recommendations_df
    
hybrid_recommender_model = HybridRecommender(content_based_recommender_model, cf_recommender_model, articles_df)

In [131]:
print('Evaluating Hybrid model...')
hybrid_global_metrics, hybrid_detailed_results_df = model_evaluator.evaluate_model(hybrid_recommender_model)
print('\nGlobal metrics:\n%s' % hybrid_global_metrics)
hybrid_detailed_results_df.head(10)

Evaluating Hybrid model...
1139 users processed

Global metrics:
{'recall@5': 0.45960112503196116, 'modelName': 'Hybrid', 'recall@10': 0.5475581692661723}


Unnamed: 0,_person_id,hits@10_count,hits@5_count,interacted_count,recall@10,recall@5
76,3609194402293569455,41,33,192,0.213542,0.171875
17,-2626634673110551643,53,35,134,0.395522,0.261194
16,-1032019229384696495,32,22,130,0.246154,0.169231
10,-1443636648652872475,59,42,117,0.504274,0.358974
82,-2979881261169775358,34,25,88,0.386364,0.284091
161,-3596626804281480007,27,19,80,0.3375,0.2375
65,1116121227607581999,23,16,73,0.315068,0.219178
81,692689608292948411,23,16,69,0.333333,0.231884
106,-9016528795238256703,20,11,69,0.289855,0.15942
52,3636910968448833585,23,17,68,0.338235,0.25


**Conclusion** - When using a Hybrid model, adding both scores performed better than multiplying the scores. And overall, the Hybrid model performed better than just using a simple content-based or collaborative filtering method.