# News Recommender System

This a Google Colab for our project for the AI Course at UCU, 2021.

**Authors**: Dmytro Lopushanskyy, Volodymyr Savchuk.

The report for this project will be attached separately on CMS.

Here is a list of materials that helped us create this project:

* [MIND Data set](https://msnews.github.io/)
* [Build Recommendation Engine](https://realpython.com/build-recommendation-engine-collaborative-filtering/)
* [Recommender Systems in Python](https://www.kaggle.com/gspmoreira/recommender-systems-in-python-101#Recommender-Systems-in-Python-101)
* [MIND Recommendation Notebook](https://www.kaggle.com/accountstatus/mind-microsoft-news-recommendation-v2/notebook#Text-Preprocessing)
* [Evaluating Recommender Systems](http://fastml.com/evaluating-recommender-systems/)

## Imports

In [26]:
import numpy as np
import scipy
import pandas as pd
import math
import random
import sklearn
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import pytorch_lightning as pl

from scipy.sparse.linalg import svds
from scipy.sparse import csr_matrix
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
from nltk.stem import WordNetLemmatizer 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import plotly.express as px
from wordcloud import WordCloud
import seaborn as sns
from tqdm.notebook import tqdm

import warnings
warnings.filterwarnings('ignore')

In [13]:
import nltk
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /Users/vozak16/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/vozak16/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/vozak16/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### Loading Data

In [14]:
filtered_behaviors = pd.read_csv('files/filtered_behaviours.csv', sep='\t')

filtered_articles = pd.read_csv('files/filtered_articles.csv', sep='\t')

behaviours_train_indexed_df = pd.read_csv('files/train_filtered_behaviours.csv', sep='\t')
behaviours_test_indexed_df = pd.read_csv('files/test_filtered_behaviours.csv', sep='\t')

In [15]:
filtered_articles.head()

Unnamed: 0.1,Unnamed: 0,NewsID,Category,SubCategory,Title,Abstract
0,0,N55528,lifestyle,lifestyleroyals,"The Brands Queen Elizabeth, Prince Charles, an...","Shop the notebooks, jackets, and more that the..."
1,2,N61837,news,newsworld,The Cost of Trump's Aid Freeze in the Trenches...,Lt. Ivan Molchanets peeked over a parapet of s...
2,3,N53526,health,voices,I Was An NBA Wife. Here's How It Affected My M...,"I felt like I was a fraud, and being an NBA wi..."
3,4,N38324,health,medical,"How to Get Rid of Skin Tags, According to a De...","They seem harmless, but there's a very good re..."
4,5,N2073,sports,football_nfl,Should NFL be able to fine players for critici...,Several fines came down against NFL players fo...


In [5]:
filtered_behaviors.set_index('UserID')
filtered_behaviors['All_History'] = filtered_behaviors.groupby(['UserID']).History.transform(lambda x: ' '.join(x)).transform(lambda x: list(set(x.split())))

In [6]:
all_history = filtered_behaviors.drop_duplicates(subset=['UserID'])
all_history = all_history.filter(['UserID', 'All_History'])
all_history = all_history.set_index('UserID')
all_history

Unnamed: 0_level_0,All_History
UserID,Unnamed: 1_level_1
U80234,"[N53234, N35671, N43955, N28088, N6616, N38895..."
U60458,"[N61186, N54827, N33438, N32109, N33742, N5002..."
U44190,"[N3259, N1150, N61704, N16233, N55189, N53033,..."
U87380,"[N28926, N53531, N29361, N7649, N49153, N44402..."
U69606,"[N54088, N879, N19591, N63054, N14952, N21503,..."
...,...
U11,"[N49023, N4647, N5905, N31820, N33271, N18870]"
U77536,"[N9056, N63370, N58434, N55556, N28691, N20078..."
U56193,"[N58782, N53531, N31099, N28257, N4705, N46492..."
U16799,"[N52294, N64536, N46845, N42078, N40826, N1529..."


In [7]:
expanded_behaviors = all_history.explode('All_History').reset_index() 
expanded_behaviors.rename(columns={'All_History': 'NewsID'}, inplace=True)

In [8]:
behaviours_train_df, behaviours_test_df = train_test_split(expanded_behaviors,
                                   stratify=expanded_behaviors['UserID'], 
                                   test_size=0.20,
                                   random_state=42)

print('# interactions on Train set: %d' % len(behaviours_train_df))
print('# interactions on Test set: %d' % len(behaviours_test_df))

# interactions on Train set: 983294
# interactions on Test set: 245824


In [9]:
# Indexing by UserID to speed up the searches during evaluation
behaviours_full_indexed_df = expanded_behaviors.set_index('UserID')
behaviours_train_indexed_df = behaviours_train_df.set_index('UserID')
behaviours_test_indexed_df = behaviours_test_df.set_index('UserID')

In [10]:
history_train_indexed_df

NameError: name 'history_train_indexed_df' is not defined

In [None]:
# group by userID back to aggregated values
history_train_indexed_df = behaviours_train_indexed_df.groupby(['UserID'])['NewsID'].apply(list).reset_index().set_index('UserID')
history_train_indexed_df.rename(columns={'NewsID': 'All_History'}, inplace=True)

history_test_indexed_df = behaviours_test_indexed_df.groupby(['UserID'])['NewsID'].apply(list).reset_index().set_index('UserID')
history_test_indexed_df.rename(columns={'NewsID': 'All_History'}, inplace=True)

In [None]:
history_train_indexed_df.index.values

In [None]:
# implement filtering
history_test_indexed_df = history_test_indexed_df[history_test_indexed_df.index.isin(history_train_indexed_df.index.values.tolist())]
behaviours_test_indexed_df = behaviours_test_indexed_df[behaviours_test_indexed_df.index.isin(history_train_indexed_df.index.values.tolist())]

In [11]:
behaviours_train_indexed_df

Unnamed: 0_level_0,NewsID
UserID,Unnamed: 1_level_1
U90143,N57113
U47251,N56220
U54899,N13057
U83499,N3500
U77389,N26026
...,...
U49856,N17161
U48394,N62285
U40730,N48904
U84258,N59254


In [56]:
behaviours_train_indexed_df["rating"] = 1
behaviours_train_indexed_df["UserID"] = behaviours_train_indexed_df["UserID"].apply((lambda x: int(x[1:])))
behaviours_train_indexed_df["NewsID"] = behaviours_train_indexed_df["NewsID"].apply((lambda x: int(x[1:])))
behaviours_train_indexed_df

Unnamed: 0,UserID,NewsID,rating
0,90143,54131,1
1,47251,20214,1
2,54899,20109,1
3,83499,28538,1
4,77389,54842,1
...,...,...,...
983289,49856,9248,1
983290,48394,33683,1
983291,40730,35657,1
983292,84258,13137,1


In [63]:
behaviours_train_indexed_df.dtypes

UserID    int64
NewsID    int64
rating    int64
dtype: object

In [67]:
# Get a list of all movie IDs
all_articleIds = behaviours_train_indexed_df['UserID'].unique()

# Placeholders that will hold the training data
users, items, labels = [], [], []

# This is the set of items that each user has interaction with
user_item_set = set(zip(behaviours_train_indexed_df['UserID'], behaviours_train_indexed_df['NewsID']))

# 4:1 ratio of negative to positive samples
num_negatives = 4

for (u, i) in tqdm(user_item_set):
    users.append(u)
    items.append(i)
    labels.append(1) # items that the user has interacted with are positive
    for _ in range(num_negatives):
        # randomly select an item
        negative_item = np.random.choice(all_articleIds) 
        # check that the user has not interacted with this item
        while (u, negative_item) in user_item_set:
            negative_item = np.random.choice(all_articleIds)
        users.append(np.int64(u))
        items.append(negative_item)
        labels.append(0) # items not interacted with are negative

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=983294.0), HTML(value='')))




In [68]:
class MINDTrainDataset(Dataset):
    """MovieLens PyTorch Dataset for Training
    
    Args:
        ratings (pd.DataFrame): Dataframe containing the movie ratings
        all_movieIds (list): List containing all movieIds
    
    """

    def __init__(self, ratings, all_artcileIds):
        self.users, self.items, self.labels = self.get_dataset(behaviours_train_indexed_df, all_artcileIds)

    def __len__(self):
        return len(self.users)
  
    def __getitem__(self, idx):
        return self.users[idx], self.items[idx], self.labels[idx]

    def get_dataset(self, ratings, all_artcileIds):
        users, items, labels = [], [], []
        user_item_set = set(zip(ratings['UserID'], ratings['NewsID']))

        num_negatives = 4
        for u, i in user_item_set:
            users.append(u)
            items.append(i)
            labels.append(1)
            for _ in range(num_negatives):
                negative_item = np.random.choice(all_artcileIds)
                while (u, negative_item) in user_item_set:
                    negative_item = np.random.choice(all_artcileIds)
                users.append(u)
                items.append(negative_item)
                labels.append(0)

        return torch.tensor(users), torch.tensor(items), torch.tensor(labels)

In [85]:
class NCF(pl.LightningModule):
    """ Neural Collaborative Filtering (NCF)
    
        Args:
            num_users (int): Number of unique users
            num_items (int): Number of unique items
            ratings (pd.DataFrame): Dataframe containing the movie ratings for training
            all_movieIds (list): List containing all movieIds (train + test)
    """
    
    def __init__(self, num_users, num_items, ratings, all_articleIds):
        super().__init__()
        self.user_embedding = nn.Embedding(num_embeddings=num_users, embedding_dim=16)
        self.item_embedding = nn.Embedding(num_embeddings=num_items, embedding_dim=16)
        self.fc1 = nn.Linear(in_features=16, out_features=64)
        self.fc2 = nn.Linear(in_features=64, out_features=32)
        self.output = nn.Linear(in_features=32, out_features=1)
        self.ratings = ratings
        self.all_articleIds = all_articleIds
        
    def forward(self, user_input, item_input):
        
        # Pass through embedding layers
        user_embedded = self.user_embedding(user_input)
        item_embedded = self.item_embedding(item_input)

        # Concat the two embedding layers
        vector = torch.cat([user_embedded, item_embedded], dim=-1)

        # Pass through dense layer
        vector = nn.ReLU()(self.fc1(vector))
        vector = nn.ReLU()(self.fc2(vector))

        # Output layer
        pred = nn.Sigmoid()(self.output(vector))

        return pred
    
    def training_step(self, batch, batch_idx):
        user_input, item_input, labels = batch
        predicted_labels = self(user_input, item_input)
        loss = nn.BCELoss()(predicted_labels, labels.view(-1, 1).float())
        return loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters())

    def train_dataloader(self):
        return DataLoader(MINDTrainDataset(self.ratings, self.all_articleIds),
                          batch_size=512, num_workers=0)

In [86]:
num_users = len(set(behaviours_train_indexed_df['UserID'])) + 1
num_items = len(set(behaviours_train_indexed_df['NewsID'])) + 1

all_articleIds = np.array(list(set(filtered_articles['NewsID'].apply((lambda x: np.int64(x[1:]))))))

model = NCF(num_users, num_items, behaviours_train_indexed_df, all_articleIds)

TypeError: __init__() missing 1 required positional argument: 'embedding_dim'

In [87]:
num_users

In [88]:
num_items

34443

In [89]:
behaviours_train_indexed_df.dtypes

UserID    int64
NewsID    int64
rating    int64
dtype: object

In [90]:
trainer = pl.Trainer(max_epochs=5, reload_dataloaders_every_epoch=True,
                     progress_bar_refresh_rate=50, logger=False, checkpoint_callback=False)

trainer.fit(model)

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs

  | Name           | Type      | Params
---------------------------------------------
0 | user_embedding | Embedding | 635 K 
1 | item_embedding | Embedding | 551 K 
2 | fc1            | Linear    | 1.1 K 
3 | fc2            | Linear    | 2.1 K 
4 | output         | Linear    | 33    
---------------------------------------------
1.2 M     Trainable params
0         Non-trainable params
1.2 M     Total params
4.759     Total estimated model params size (MB)


HBox(children=(HTML(value='Training'), FloatProgress(value=1.0, bar_style='info', layout=Layout(flex='2'), max…

IndexError: index out of range in self