# News Recommender System. Collaborative Filtering (User-Based)

This the next part of the project for the AI Course at UCU, 2021.    

In this section, we will implement the collaborative filtering recommender based on user similarity. Additionally, the model's performance will be evaluated and leter on compared to other recommendation approaches.

**Authors**: Dmytro Lopushanskyy, Volodymyr Savchuk.

## Imports

In [1]:
import pandas as pd
import random
import time
from sklearn.neighbors import NearestNeighbors
from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings('ignore')

## Load The Data

We are using MIND data set for our recommendation system. It contains two main files: behaviors and English news articles data.

In [2]:
filtered_behaviors = pd.read_csv('files/filtered_behaviours.csv', sep='\t')
filtered_articles = pd.read_csv('files/filtered_articles.csv', sep='\t')

del filtered_behaviors['Unnamed: 0']
del filtered_articles['Unnamed: 0']

train_filtered_behaviours = pd.read_csv('files/train_filtered_behaviours.csv', sep='\t').set_index('UserID')
test_filtered_behaviours = pd.read_csv('files/test_filtered_behaviours.csv', sep='\t').set_index('UserID')
full_filtered_behaviours = train_filtered_behaviours.append(test_filtered_behaviours)

In [3]:
# group by userID back to aggregated values
train_filtered_behaviours = train_filtered_behaviours.groupby(['UserID'])['NewsID'].apply(list).reset_index().set_index('UserID')
# train_filtered_behaviours.rename(columns={'NewsID': 'All_History'}, inplace=True)

test_filtered_behaviours = test_filtered_behaviours.groupby(['UserID'])['NewsID'].apply(list).reset_index().set_index('UserID')
# test_filtered_behaviours.rename(columns={'NewsID': 'All_History'}, inplace=True)

In [4]:
# implement filtering
train_filtered_behaviours = train_filtered_behaviours[train_filtered_behaviours.index.isin(test_filtered_behaviours.index.values.tolist())]
test_filtered_behaviours = test_filtered_behaviours[test_filtered_behaviours.index.isin(train_filtered_behaviours.index.values.tolist())]

## Collaborative Filtering

We need to take all of the news articles available to us and the train behaviours dataset.

Since CF is taking quite a lot of memory, we will start by using 100 users and all articles.

In [5]:
LIMIT = 100
limited_users = train_filtered_behaviours.index[:LIMIT]

ratings_df = pd.DataFrame(data=0, columns=filtered_articles.NewsID, index=limited_users)

for i in range(LIMIT):
    user_history = train_filtered_behaviours.iloc[i].tolist()[0]
    for news_id in user_history:
        ratings_df.iloc[i][news_id] = 1

In [6]:
ratings_df

NewsID,N55528,N61837,N53526,N38324,N2073,N11429,N49186,N2131,N59295,N24510,...,N16016,N25854,N7618,N16804,N19926,N42491,N13097,N63550,N30345,N30135
UserID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
U1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
U10,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
U10000,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
U10002,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
U10004,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
U10227,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
U10228,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
U1023,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
U10233,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [7]:
knn = NearestNeighbors(metric='cosine', algorithm='brute')
knn.fit(ratings_df.values)
distances, indices = knn.kneighbors(ratings_df.values, n_neighbors=10)

In [8]:
# get the index for user
index_for_user = ratings_df.index.tolist().index('U1')

# find the indices for the similar users
sim_users = indices[index_for_user].tolist()

# distances between user and the similar users
user_distances = distances[index_for_user].tolist()

# the position of user in the list sim_users
id_user = sim_users.index(index_for_user)

# remove user from the list sim_users
sim_users.remove(index_for_user)

# remove user from the list user_distances
user_distances.pop(id_user)

print('The Nearest Users to U1:', sim_users)
print('The Distance from U1:', user_distances)

The Nearest Users to U1: [5, 78, 54, 42, 97, 73, 91, 70, 2]
The Distance from U1: [0.8585786437626906, 0.8585786437626906, 0.8881966011250105, 0.894590744661054, 0.9154845745271484, 0.9183503419072274, 0.9309934440657646, 0.9354502775632098, 0.9379826327053957]


In [9]:
def augment_rating_df(ratings):
    augmented_ratings = ratings.copy()

    # find the nearest neighbors using NearestNeighbors(n_neighbors=10)
    number_neighbors = 10
    knn = NearestNeighbors(metric='cosine', algorithm='brute')
    knn.fit(ratings.values)
    distances, indices = knn.kneighbors(ratings.values, n_neighbors=number_neighbors)
    
    start = time.time()

    for news_idx, news_id in enumerate(filtered_articles.NewsID):
        if news_idx % 1000 == 0:
            print(f'Number of articles processed: {news_idx} / {len(filtered_articles.NewsID)}. Minutes passed: {int((time.time() - start) / 60)}')
        news_index = filtered_articles.NewsID.tolist().index(news_id)

        for user_loc, user_id in list(enumerate(ratings.index)):
            # find news without ratings by user
            if ratings.iloc[user_loc, news_index] == 0:
                sim_users = indices[user_loc].tolist()
                users_distances = distances[user_loc].tolist()
                
                # Generally, this is the case. The user itself is in the first place.
                if user_id in sim_users:
                    user_idx = sim_users.index(user_id)
                    sim_users.remove(user_id)
                    users_distances.pop(user_idx) 

                # However, sometimes even the movie itself cannot be included in the indices.
                # In that case, we take off the farthest movie in the list.
                else:
                    sim_users = sim_users[:number_neighbors - 1]
                    users_distances = users_distances[:number_neighbors - 1]
                    
                # user_similarity = 1 - users_distances
                user_similarity = [1 - x for x in users_distances]
                user_similarity_copy = user_similarity.copy()
                nominator = 0

                # for each similar user
                for i in range(len(sim_users)):
                    # check if the rating of a similar news is zero
                    if ratings.iloc[sim_users[i], news_index] == 0:
                        # if the rating is zero, ignore the rating and the similarity in calculating the predicted rating
                        if len(user_similarity_copy) == (number_neighbors - 1):
                            user_similarity_copy.pop(i)
                        else:
                            user_similarity_copy.pop(i - (len(user_similarity) - len(user_similarity_copy)))

                    # if the rating is not zero, use the rating and similarity in the calculation
                    else:
                        nominator += user_similarity[i] * ratings.iloc[sim_users[i], news_index]

                # check if the number of the ratings with non-zero is positive
                if len(user_similarity_copy) > 0:
                    # check if the sum of the ratings of the similar movies is positive.
                    if sum(user_similarity_copy) > 0:
                        predicted_r = nominator / sum(user_similarity_copy)

                    # Even if there are some news for which the ratings are positive, some movies have zero similarity even though they are selected as similar movies.
                    # in this case, the predicted rating becomes zero as well  
                    else:
                        predicted_r = 0

                # if all the ratings of the similar news are zero, then predicted rating should be zero
                else:
                    predicted_r = 0

                # place the predicted rating into the augmented original dataset
                augmented_ratings.iloc[user_loc, news_index] = predicted_r
    end = time.time()
    print(f"Processing finished. Total time: {int((end - start) / 60)}")
    return augmented_ratings
            

In [10]:
augmented_ratings_user_based = augment_rating_df(ratings_df)
augmented_ratings_user_based

Number of articles processed: 0 / 39726. Minutes passed: 0
Number of articles processed: 1000 / 39726. Minutes passed: 0
Number of articles processed: 2000 / 39726. Minutes passed: 1
Number of articles processed: 3000 / 39726. Minutes passed: 2
Number of articles processed: 4000 / 39726. Minutes passed: 2
Number of articles processed: 5000 / 39726. Minutes passed: 3
Number of articles processed: 6000 / 39726. Minutes passed: 4
Number of articles processed: 7000 / 39726. Minutes passed: 4
Number of articles processed: 8000 / 39726. Minutes passed: 5
Number of articles processed: 9000 / 39726. Minutes passed: 6
Number of articles processed: 10000 / 39726. Minutes passed: 7
Number of articles processed: 11000 / 39726. Minutes passed: 7
Number of articles processed: 12000 / 39726. Minutes passed: 8
Number of articles processed: 13000 / 39726. Minutes passed: 9
Number of articles processed: 14000 / 39726. Minutes passed: 10
Number of articles processed: 15000 / 39726. Minutes passed: 10
Num

NewsID,N55528,N61837,N53526,N38324,N2073,N11429,N49186,N2131,N59295,N24510,...,N16016,N25854,N7618,N16804,N19926,N42491,N13097,N63550,N30345,N30135
UserID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
U1,0,0,0.0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
U10,0,0,0.0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
U10000,0,0,0.0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
U10002,0,0,0.0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
U10004,0,0,0.0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
U10227,0,0,0.0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
U10228,0,0,0.0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
U1023,0,0,0.0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
U10233,0,0,0.0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [11]:
def recommend_news(user, items_to_ignore, num_recommended_news, ignore_interacted=True, verbose=False):
    if verbose:
        print('The list of the News {} Has Clicked on \n'.format(user))

        for news_id in ratings_df.loc[user, :][ratings_df.loc[user, :] > 0].index.tolist():
            print(news_id)

        print('\n')

    recommended_news = []

    for news_id in ratings_df.loc[user, :][ratings_df.loc[user, :] == 0].index.tolist():
        predicted_rating = augmented_ratings_user_based.loc[user, news_id]
        recommended_news.append((news_id, predicted_rating))

    sorted_rm = sorted(recommended_news, key=lambda x: x[1], reverse=True)
    if not ignore_interacted:
        # filter from items to ignore
        sorted_rm = list(filter(lambda x: x[0] not in items_to_ignore, sorted_rm)) 
        
    # filter from non-clickable news
    # sorted_rm = list(filter(lambda x: x[1] != 0, sorted_rm))  
    
    if verbose:
        print('The list of the Recommended News \n')
        rank = 1
        for recommended_news in sorted_rm[:num_recommended_news]:
            print('{}: {} - predicted rating: {}'.format(rank, recommended_news[0], recommended_news[1]))
            rank = rank + 1
        
    return [news[0] for news in sorted_rm[:num_recommended_news]]

In [12]:
recommend_news('U1', [], 5, verbose=True)

The list of the News U1 Has Clicked on 

N596
N52301
N13374
N24356
N32607
N57737
N40207
N62058
N10646
N25682


The list of the Recommended News 

1: N732 - predicted rating: 1.0
2: N27608 - predicted rating: 1.0
3: N32755 - predicted rating: 1.0
4: N43265 - predicted rating: 1.0
5: N8179 - predicted rating: 1.0


['N732', 'N27608', 'N32755', 'N43265', 'N8179']

In [21]:
test_filtered_behaviours

Unnamed: 0_level_0,NewsID
UserID,Unnamed: 1_level_1
U1,"[N58267, N23571]"
U10,[N9120]
U10000,"[N10059, N63324, N47173, N50049, N63709, N7422..."
U10002,"[N10865, N24356, N42136, N4607, N38701, N32004..."
U10004,"[N43482, N55805]"
...,...
U9980,"[N6623, N31598, N47008, N6785, N7208, N28139, ..."
U9982,"[N62396, N45544, N33358]"
U9986,"[N20476, N56967, N55468, N23732, N35083]"
U9998,"[N47289, N19152, N49146, N16233, N47993, N54536]"


In [24]:
train_filtered_behaviours

Unnamed: 0_level_0,NewsID
UserID,Unnamed: 1_level_1
U1,"[N10646, N52301, N596, N57737, N24356, N40207,..."
U10,"[N9803, N64777, N2945, N36699, N57967, N27612]"
U10000,"[N11037, N19434, N18094, N3345, N2479, N42620,..."
U10002,"[N12215, N54225, N52307, N54300, N64777, N4574..."
U10004,"[N38118, N27251, N15402, N15627, N33859, N5266..."
...,...
U9980,"[N51614, N44007, N47426, N58988, N22816, N871,..."
U9982,"[N47765, N56742, N60050, N46513, N16715, N4603..."
U9986,"[N62535, N64863, N12892, N49362, N38014, N3770..."
U9998,"[N30698, N20483, N24593, N60209, N57083, N3264..."


In [25]:
# Top-N accuracy metrics consts
EVAL_RANDOM_SAMPLE_NON_INTERACTED_ITEMS = 100

class ModelEvaluatorCF:
    def get_not_interacted_items_sample(self, person_id, sample_size, seed=42):
        interacted_items = self.get_items_interacted(person_id, full_filtered_behaviours)
        all_items = set(filtered_articles['NewsID'])
        non_interacted_items = all_items - interacted_items

        random.seed(seed)
        non_interacted_items_sample = random.sample(non_interacted_items, sample_size)
        return set(non_interacted_items_sample)
    
    def get_items_interacted(self, person_id, interactions_df):
        # Get the user's data and merge in the news information.
        interacted_items = interactions_df.loc[person_id]['NewsID']
        return set(interacted_items if type(interacted_items) == pd.Series else interacted_items)

    def _verify_hit_top_n(self, item_id, recommended_items, topn): 
        try:
            item_idx = recommended_items.index(item_id)
        except:
            item_idx = -1
        hit = int(item_idx in range(0, topn))
        return hit, item_idx

    def evaluate_model_for_user(self, person_id):
        # Getting the items in test set
        interacted_values_testset = test_filtered_behaviours.loc[person_id]
        if type(interacted_values_testset['NewsID']) == pd.Series:
            person_interacted_items_testset = set(interacted_values_testset['NewsID'])
        else:
            person_interacted_items_testset = set(interacted_values_testset['NewsID'])  
        interacted_items_count_testset = len(person_interacted_items_testset) 

        # Getting a ranked recommendation list from a model for a given user
        person_recs = recommend_news(
            person_id, 
            items_to_ignore=self.get_items_interacted(person_id, train_filtered_behaviours), 
            num_recommended_news=100000, ignore_interacted=False)
        
        hits_at_5_count = 0
        hits_at_10_count = 0
        # For each item the user has interacted in test set
        for item_id in person_interacted_items_testset:
            # Getting a random sample (100) items the user has not interacted 
            # (to represent items that are assumed to be no relevant to the user)
            non_interacted_items_sample = self.get_not_interacted_items_sample(person_id, 
                                                                          sample_size=EVAL_RANDOM_SAMPLE_NON_INTERACTED_ITEMS, 
                                                                          seed=random.randint(0, 2**32))

            # Combining the current interacted item with the 100 random items
            items_to_filter_recs = non_interacted_items_sample.union(set([item_id]))

            # Filtering only recommendations that are either the interacted item or from a random sample of 100 non-interacted items
            valid_recs = list(filter(lambda x : x in items_to_filter_recs, person_recs))
            # Verifying if the current interacted item is among the Top-N recommended items
            hit_at_5, index_at_5 = self._verify_hit_top_n(item_id, valid_recs, 5)
            hits_at_5_count += hit_at_5
            hit_at_10, index_at_10 = self._verify_hit_top_n(item_id, valid_recs, 10)
            hits_at_10_count += hit_at_10

        # Recall is the rate of the interacted items that are ranked among the Top-N recommended items, 
        # when mixed with a set of non-relevant items
        recall_at_5 = hits_at_5_count / float(interacted_items_count_testset)
        recall_at_10 = hits_at_10_count / float(interacted_items_count_testset)

        person_metrics = {'hits@5_count': hits_at_5_count, 
                          'hits@10_count': hits_at_10_count, 
                          'interacted_count': interacted_items_count_testset,
                          'recall@5': recall_at_5,
                          'recall@10': recall_at_10}
        return person_metrics

    def evaluate_model(self):
        print('Running evaluation for users')
        people_metrics = []
        filtered_users = list(filter(lambda user_id : user_id in limited_users, list(test_filtered_behaviours.index.unique().values[:])))
        for idx, person_id in enumerate(filtered_users):
            if idx % 10 == 0 and idx > 0:
                print('%d users processed' % idx)
            person_metrics = self.evaluate_model_for_user(person_id)  
            person_metrics['_person_id'] = person_id
            people_metrics.append(person_metrics)
        print('%d users processed' % len(filtered_users))

        detailed_results_df = pd.DataFrame(people_metrics) \
                            .sort_values('interacted_count', ascending=False)
        
        global_recall_at_5 = detailed_results_df['hits@5_count'].sum() / float(detailed_results_df['interacted_count'].sum())
        global_recall_at_10 = detailed_results_df['hits@10_count'].sum() / float(detailed_results_df['interacted_count'].sum())
        
        global_metrics = {'modelName': 'User-Based CF',
                          'recall@5': global_recall_at_5,
                          'recall@10': global_recall_at_10}    
        return global_metrics, detailed_results_df
    
model_evaluator = ModelEvaluatorCF()    


In [26]:
print('Evaluating Collaborative User-Based Filtering model...')
cb_global_metrics, cb_detailed_results_df = model_evaluator.evaluate_model()

Evaluating Collaborative User-Based Filtering model...
Running evaluation for users
10 users processed
20 users processed
30 users processed
40 users processed
50 users processed
60 users processed
70 users processed
80 users processed
90 users processed
100 users processed


In [27]:
print('\nGlobal metrics:\n%s' % cb_global_metrics)
cb_detailed_results_df.sort_values('recall@10', ascending=False).head(20)


Global metrics:
{'modelName': 'User-Based CF', 'recall@5': 0.17591125198098256, 'recall@10': 0.21711568938193343}


Unnamed: 0,hits@5_count,hits@10_count,interacted_count,recall@5,recall@10,_person_id
14,2,2,2,1.0,1.0,U10029
23,0,2,2,0.0,1.0,U10050
1,1,1,1,1.0,1.0,U10
46,1,1,1,1.0,1.0,U1010
94,1,1,1,1.0,1.0,U10225
34,2,2,2,1.0,1.0,U10075
32,1,2,2,0.5,1.0,U10073
80,2,3,4,0.5,0.75,U10192
37,0,2,3,0.0,0.666667,U10084
67,3,3,5,0.6,0.6,U10158


### Save to files

In [28]:
ratings_df.to_csv('files/user-based-cf-pred.csv', sep='\t')