# News Recommender System

This a Google Colab for our project for the AI Course at UCU, 2021.

**Authors**: Dmytro Lopushanskyy, Volodymyr Savchuk.

The report for this project will be attached separately on CMS.

Here is a list of materials that helped us create this project:

* [MIND Data set](https://msnews.github.io/)
* [Build Recommendation Engine](https://realpython.com/build-recommendation-engine-collaborative-filtering/)
* [Recommender Systems in Python](https://www.kaggle.com/gspmoreira/recommender-systems-in-python-101#Recommender-Systems-in-Python-101)
* [MIND Recommendation Notebook](https://www.kaggle.com/accountstatus/mind-microsoft-news-recommendation-v2/notebook#Text-Preprocessing)
* [Evaluating Recommender Systems](http://fastml.com/evaluating-recommender-systems/)

## Imports

In [111]:
import numpy as np
import scipy
import pandas as pd
import math
import random
import sklearn
from scipy.sparse.linalg import svds
from scipy.sparse import csr_matrix
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
from nltk.stem import WordNetLemmatizer 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import plotly.express as px
from wordcloud import WordCloud
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

In [112]:
import nltk
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /Users/vozak16/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/vozak16/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/vozak16/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### Loading Data

In [113]:
filtered_behaviors = pd.read_csv('files/filtered_behaviours.csv', sep='\t')

filtered_articles = pd.read_csv('files/filtered_articles.csv', sep='\t')

behaviours_train_indexed_df = pd.read_csv('files/train_filtered_behaviours.csv', sep='\t')
behaviours_test_indexed_df = pd.read_csv('files/test_filtered_behaviours.csv', sep='\t')

In [114]:
filtered_articles.head()

Unnamed: 0.1,Unnamed: 0,NewsID,Category,SubCategory,Title,Abstract
0,0,N55528,lifestyle,lifestyleroyals,"The Brands Queen Elizabeth, Prince Charles, an...","Shop the notebooks, jackets, and more that the..."
1,2,N61837,news,newsworld,The Cost of Trump's Aid Freeze in the Trenches...,Lt. Ivan Molchanets peeked over a parapet of s...
2,3,N53526,health,voices,I Was An NBA Wife. Here's How It Affected My M...,"I felt like I was a fraud, and being an NBA wi..."
3,4,N38324,health,medical,"How to Get Rid of Skin Tags, According to a De...","They seem harmless, but there's a very good re..."
4,5,N2073,sports,football_nfl,Should NFL be able to fine players for critici...,Several fines came down against NFL players fo...


In [115]:
filtered_behaviors.set_index('UserID')
filtered_behaviors['All_History'] = filtered_behaviors.groupby(['UserID']).History.transform(lambda x: ' '.join(x)).transform(lambda x: list(set(x.split())))

In [116]:
all_history = filtered_behaviors.drop_duplicates(subset=['UserID'])
all_history = all_history.filter(['UserID', 'All_History'])
all_history = all_history.set_index('UserID')
all_history

Unnamed: 0_level_0,All_History
UserID,Unnamed: 1_level_1
U80234,"[N28088, N46039, N47686, N264, N6616, N63573, ..."
U60458,"[N33742, N6778, N51180, N58715, N34775, N50020..."
U44190,"[N15634, N16233, N1150, N53033, N3259, N51706,..."
U87380,"[N2597, N28926, N44402, N49153, N7649, N23232,..."
U69606,"[N34140, N879, N54088, N53033, N21503, N4607, ..."
...,...
U11,"[N5905, N18870, N49023, N33271, N31820, N4647]"
U77536,"[N25258, N25633, N37120, N37159, N7884, N58434..."
U56193,"[N28088, N4705, N46492, N26026, N31099, N58782..."
U16799,"[N15295, N52294, N46845, N40826, N64536, N1567..."


In [117]:
expanded_behaviors = all_history.explode('All_History').reset_index() 
expanded_behaviors.rename(columns={'All_History': 'NewsID'}, inplace=True)

In [118]:
behaviours_train_df, behaviours_test_df = train_test_split(expanded_behaviors,
                                   stratify=expanded_behaviors['UserID'], 
                                   test_size=0.20,
                                   random_state=42)

print('# interactions on Train set: %d' % len(behaviours_train_df))
print('# interactions on Test set: %d' % len(behaviours_test_df))

# interactions on Train set: 983294
# interactions on Test set: 245824


In [119]:
# Indexing by UserID to speed up the searches during evaluation
behaviours_full_indexed_df = expanded_behaviors.set_index('UserID')
behaviours_train_indexed_df = behaviours_train_df.set_index('UserID')
behaviours_test_indexed_df = behaviours_test_df.set_index('UserID')

In [147]:
history_train_indexed_df

Unnamed: 0_level_0,All_History
UserID,Unnamed: 1_level_1
U1,"[N25682, N40207, N23571, N62058, N57737, N1064..."
U10,"[N2945, N9803, N57967, N64777, N36699, N9120]"
U10000,"[N19434, N56753, N47348, N22719, N3345, N8572,..."
U10002,"[N15521, N64955, N7171, N55743, N60412, N39235..."
U10004,"[N55805, N52665, N38118, N33859, N43482, N1887..."
...,...
U9980,"[N30765, N31225, N46731, N47008, N871, N56460,..."
U9982,"[N44482, N16304, N60050, N47765, N56742, N2668..."
U9986,"[N37706, N28001, N56967, N49362, N11855, N6486..."
U9998,"[N60340, N63906, N24593, N5102, N22519, N11512..."


In [120]:
# group by userID back to aggregated values
history_train_indexed_df = behaviours_train_indexed_df.groupby(['UserID'])['NewsID'].apply(list).reset_index().set_index('UserID')
history_train_indexed_df.rename(columns={'NewsID': 'All_History'}, inplace=True)

history_test_indexed_df = behaviours_test_indexed_df.groupby(['UserID'])['NewsID'].apply(list).reset_index().set_index('UserID')
history_test_indexed_df.rename(columns={'NewsID': 'All_History'}, inplace=True)

In [121]:
history_train_indexed_df.index.values

array(['U1', 'U10', 'U10000', ..., 'U9986', 'U9998', 'U9999'],
      dtype=object)

In [122]:
# implement filtering
history_test_indexed_df = history_test_indexed_df[history_test_indexed_df.index.isin(history_train_indexed_df.index.values.tolist())]
behaviours_test_indexed_df = behaviours_test_indexed_df[behaviours_test_indexed_df.index.isin(history_train_indexed_df.index.values.tolist())]

In [152]:
LIMIT = 200
limited_users = history_train_indexed_df.index[:LIMIT]

ratings_df = pd.DataFrame(data=0, columns=filtered_articles.NewsID, index=limited_users.unique())

for i in range(LIMIT):
    user_history = history_train_indexed_df.iloc[i].tolist()[0]
    for news_id in user_history:
        ratings_df.iloc[i][news_id] = 1

In [153]:
ratings_df.shape

(200, 39726)

In [154]:
ratings_df.head()

NewsID,N55528,N61837,N53526,N38324,N2073,N11429,N49186,N2131,N59295,N24510,...,N16016,N25854,N7618,N16804,N19926,N42491,N13097,N63550,N30345,N30135
UserID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
U1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
U10,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
U10000,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
U10002,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
U10004,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [155]:
users_ids = list(ratings_df.index)
users_ids[:10]

['U1',
 'U10',
 'U10000',
 'U10002',
 'U10004',
 'U10006',
 'U10008',
 'U10009',
 'U10012',
 'U10013']

In [156]:
users_items_pivot_sparse_matrix = csr_matrix(ratings_df)
users_items_pivot_sparse_matrix

<200x39726 sparse matrix of type '<class 'numpy.int64'>'
	with 4810 stored elements in Compressed Sparse Row format>

In [157]:
#The number of factors to factor the user-item matrix.
NUMBER_OF_FACTORS_MF = 2
#Performs matrix factorization of the original user item matrix
#U, sigma, Vt = svds(users_items_pivot_matrix, k = NUMBER_OF_FACTORS_MF)
users_items_pivot_sparse_matrix = users_items_pivot_sparse_matrix.asfptype()
U, sigma, Vt = svds(users_items_pivot_sparse_matrix, k = NUMBER_OF_FACTORS_MF)

In [158]:
U.shape

(200, 2)

In [159]:
Vt.shape

(2, 39726)

In [160]:
sigma = np.diag(sigma)
sigma.shape

(2, 2)

In [161]:
all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) 
all_user_predicted_ratings

array([[0.        , 0.        , 0.00069405, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.00065986, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.00224556, ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.00138636, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.0009795 , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.0027997 , ..., 0.        , 0.        ,
        0.        ]])

In [162]:
#Converting the reconstructed matrix back to a Pandas dataframe
cf_preds_df = pd.DataFrame(all_user_predicted_ratings, columns = ratings_df.columns, index=users_ids).transpose()
cf_preds_df.head(10)

Unnamed: 0_level_0,U1,U10,U10000,U10002,U10004,U10006,U10008,U10009,U10012,U10013,...,U10462,U10463,U10464,U10465,U10468,U10469,U10470,U10471,U10474,U10479
NewsID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
N55528,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
N61837,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
N53526,0.000694,0.00066,0.002246,0.012799,0.000168,0.000983,0.000858,0.004345,0.00091,0.001446,...,0.001725,0.001727,0.002349,0.000626,0.000489,0.001309,0.000744,0.001386,0.00098,0.0028
N38324,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
N2073,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
N11429,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
N49186,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
N2131,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
N59295,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
N24510,5e-06,4e-06,1.4e-05,8.4e-05,1e-06,7e-06,5e-06,2.6e-05,5e-06,1e-05,...,1.1e-05,1.1e-05,1.5e-05,4e-06,3e-06,8e-06,5e-06,9e-06,7e-06,1.7e-05


In [163]:
class CFRecommender:
    
    MODEL_NAME = 'Collaborative Filtering'
    
    def __init__(self, cf_predictions_df, items_df=None):
        self.cf_predictions_df = cf_predictions_df
        self.items_df = items_df
        
    def get_model_name(self):
        return self.MODEL_NAME
        
    def recommend_items(self, user_id, items_to_ignore=[], topn=10, verbose=False):
        # Get and sort the user's predictions
        sorted_user_predictions = self.cf_predictions_df[user_id].sort_values(ascending=False) \
                                    .reset_index().rename(columns={user_id: 'Click'})

        # Recommend the highest predicted rating movies that the user hasn't seen yet.
        recommendations_df = sorted_user_predictions[~sorted_user_predictions['NewsID'].isin(items_to_ignore)] \
                               .sort_values('Click', ascending = False) \
                               .head(topn)

        if verbose:
            if self.items_df is None:
                raise Exception('"items_df" is required in verbose mode')

            recommendations_df = recommendations_df.merge(self.items_df, how = 'left', 
                                                          left_on = 'NewsID', 
                                                          right_on = 'NewsID')[['Click', 'NewsID', 'Title']]


        return recommendations_df
    
cf_recommender_model = CFRecommender(cf_preds_df, filtered_articles)

In [164]:
cf_recommender_model.recommend_items('U10006')

Unnamed: 0,NewsID,Click
0,N4607,0.043668
1,N871,0.040251
2,N55846,0.034324
3,N5978,0.034198
4,N61388,0.033539
5,N55743,0.030476
6,N4593,0.029243
7,N10897,0.028706
8,N32004,0.027305
9,N32852,0.026722


In [178]:
behaviours_test_indexed_df

Unnamed: 0_level_0,NewsID
UserID,Unnamed: 1_level_1
U81837,N46597
U10057,N21506
U15329,N45535
U85850,N306
U82226,N48850
...,...
U20689,N64435
U28431,N30114
U88752,N38585
U67693,N33201


In [181]:
# Top-N accuracy metrics consts
EVAL_RANDOM_SAMPLE_NON_INTERACTED_ITEMS = 100

class ModelEvaluatorCF:
    def get_not_interacted_items_sample(self, person_id, sample_size, seed=42):
        interacted_items = self.get_items_interacted(person_id, behaviours_full_indexed_df)
        all_items = set(filtered_articles['NewsID'])
        non_interacted_items = all_items - interacted_items

        random.seed(seed)
        non_interacted_items_sample = random.sample(non_interacted_items, sample_size)
        return set(non_interacted_items_sample)
    
    def get_items_interacted(self, person_id, interactions_df):
        # Get the user's data and merge in the news information.
        interacted_items = interactions_df.loc[person_id]['NewsID']
        return set(interacted_items if type(interacted_items) == pd.Series else [interacted_items])
 
    def _verify_hit_top_n(self, item_id, recommended_items, topn): 
        try:
            item_idx = recommended_items.index(item_id)
        except:
            item_idx = -1
        hit = int(item_idx in range(0, topn))
        return hit, item_idx

    def evaluate_model_for_user(self, person_id):
        # Getting the items in test set
        
        interacted_values_testset = behaviours_test_indexed_df.loc[person_id]
        
        if type(interacted_values_testset['NewsID']) == pd.Series:
            person_interacted_items_testset = set(interacted_values_testset['NewsID'])
        else:
            person_interacted_items_testset = set([interacted_values_testset['NewsID']])  
        interacted_items_count_testset = len(person_interacted_items_testset) 

        # Getting a ranked recommendation list from a model for a given user
        person_recs = cf_recommender_model.recommend_items(
            person_id, 
            items_to_ignore=self.get_items_interacted(person_id, behaviours_train_indexed_df),topn=100)
        
        hits_at_5_count = 0
        hits_at_10_count = 0
        # For each item the user has interacted in test set
        for item_id in person_interacted_items_testset:
            # Getting a random sample (100) items the user has not interacted 
            # (to represent items that are assumed to be no relevant to the user)
            non_interacted_items_sample = self.get_not_interacted_items_sample(person_id, 
                                                                          sample_size=EVAL_RANDOM_SAMPLE_NON_INTERACTED_ITEMS, 
                                                                          seed=random.randint(0, 2**32))

            # Combining the current interacted item with the 100 random items
            items_to_filter_recs = non_interacted_items_sample.union(set([item_id]))           
            # Filtering only recommendations that are either the interacted item or from a random sample of 100 non-interacted items
            valid_recs = list(filter(lambda x : x in items_to_filter_recs, person_recs))
            # Verifying if the current interacted item is among the Top-N recommended items
            hit_at_5, index_at_5 = self._verify_hit_top_n(item_id, valid_recs, 5)
            hits_at_5_count += hit_at_5
            hit_at_10, index_at_10 = self._verify_hit_top_n(item_id, valid_recs, 10)
            hits_at_10_count += hit_at_10

        # Recall is the rate of the interacted items that are ranked among the Top-N recommended items, 
        # when mixed with a set of non-relevant items
        recall_at_5 = hits_at_5_count / float(interacted_items_count_testset)
        recall_at_10 = hits_at_10_count / float(interacted_items_count_testset)

        person_metrics = {'hits@5_count': hits_at_5_count, 
                          'hits@10_count': hits_at_10_count, 
                          'interacted_count': interacted_items_count_testset,
                          'recall@5': recall_at_5,
                          'recall@10': recall_at_10}
        return person_metrics

    def evaluate_model(self):
        print('Running evaluation for users')
        people_metrics = []
        filtered_users = list(filter(lambda user_id : user_id in limited_users, list(behaviours_test_indexed_df.index.unique().values[:])))
        for idx, person_id in enumerate(filtered_users):
            if idx % 10 == 0 and idx > 0:
                print('%d users processed' % idx)
            person_metrics = self.evaluate_model_for_user(person_id)  
            person_metrics['_person_id'] = person_id
            people_metrics.append(person_metrics)
        print('%d users processed' % len(filtered_users))

        detailed_results_df = pd.DataFrame(people_metrics) \
                            .sort_values('interacted_count', ascending=False)
        
        global_recall_at_5 = detailed_results_df['hits@5_count'].sum() / float(detailed_results_df['interacted_count'].sum())
        global_recall_at_10 = detailed_results_df['hits@10_count'].sum() / float(detailed_results_df['interacted_count'].sum())
        
        global_metrics = {'modelName': 'User-Based CF',
                          'recall@5': global_recall_at_5,
                          'recall@10': global_recall_at_10}    
        return global_metrics, detailed_results_df
    
model_evaluator = ModelEvaluatorCF() 

In [None]:
print('Evaluating Content-Based Filtering model...')
cb_global_metrics, cb_detailed_results_df = model_evaluator.evaluate_model()

Evaluating Content-Based Filtering model...
Running evaluation for users
10 users processed


In [177]:
print('\nGlobal metrics:\n%s' % cb_global_metrics)
cb_detailed_results_df.sort_values('recall@10', ascending=False).head(20)


Global metrics:
{'modelName': 'User-Based CF', 'recall@5': 0.0, 'recall@10': 0.0}


Unnamed: 0,hits@5_count,hits@10_count,interacted_count,recall@5,recall@10,_person_id
48,0,0,33,0.0,0.0,U10282
170,0,0,2,0.0,0.0,U10073
1,0,0,3,0.0,0.0,U1017
15,0,0,3,0.0,0.0,U10312
111,0,0,3,0.0,0.0,U10359
87,0,0,3,0.0,0.0,U10430
189,0,0,3,0.0,0.0,U10383
29,0,0,3,0.0,0.0,U10290
171,0,0,2,0.0,0.0,U101
182,0,0,2,0.0,0.0,U10028
