# News Recommender System: Data Loading and Preprocessing

This a Google Colab for our project for the AI Course at UCU, 2021.

**Authors**: Dmytro Lopushanskyy, Volodymyr Savchuk.

The report for this project will be attached separately on CMS.

Here is a list of materials that helped us create this project:

* [MIND Data set](https://msnews.github.io/)
* [Build Recommendation Engine](https://realpython.com/build-recommendation-engine-collaborative-filtering/)
* [Recommender Systems in Python](https://www.kaggle.com/gspmoreira/recommender-systems-in-python-101#Recommender-Systems-in-Python-101)
* [MIND Recommendation Notebook](https://www.kaggle.com/accountstatus/mind-microsoft-news-recommendation-v2/notebook#Text-Preprocessing)
* [Evaluating Recommender Systems](http://fastml.com/evaluating-recommender-systems/)

## Imports

In [8]:
import numpy as np
import scipy
import pandas as pd
import math
import random
import sklearn
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
from nltk.stem import WordNetLemmatizer 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import plotly.express as px
from wordcloud import WordCloud
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

In [9]:
import nltk
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /Users/vozak16/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/vozak16/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/vozak16/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## Load The Data

We are using MIND data set for our recommendation system. It contains two main files: behaviors and English news articles data.

In [10]:
behaviors_header = ['ImpressionID', 'UserID', 'Time', 'History', 'Impressions']
behaviors_df = pd.read_csv('MINDsmall_dev/behaviors.tsv', sep='\t', names=behaviors_header)
#test_behaviors_df = pd.read_csv('/content/drive/MyDrive/Data/test_behaviors.tsv', sep='\t', names=behaviors_header)

news_header = ['NewsID', 'Category', 'SubCategory', 'Title', 'Abstract', 'URL', 'TitleEntities', 'AbstractEntities']
articles_df = pd.read_csv('MINDsmall_dev/news.tsv', sep='\t', names=news_header)
#test_articles_df = pd.read_csv('/content/drive/MyDrive/Data/test_news.tsv', sep='\t', names=news_header)

# articles_df += test_articles_df

behaviors_df.head(5)

Unnamed: 0,ImpressionID,UserID,Time,History,Impressions
0,1,U80234,11/15/2019 12:37:50 PM,N55189 N46039 N51741 N53234 N11276 N264 N40716...,N28682-0 N48740-0 N31958-1 N34130-0 N6916-0 N5...
1,2,U60458,11/15/2019 7:11:50 AM,N58715 N32109 N51180 N33438 N54827 N28488 N611...,N20036-0 N23513-1 N32536-0 N46976-0 N35216-0 N...
2,3,U44190,11/15/2019 9:55:12 AM,N56253 N1150 N55189 N16233 N61704 N51706 N5303...,N36779-0 N62365-0 N58098-0 N5472-0 N13408-0 N5...
3,4,U87380,11/15/2019 3:12:46 PM,N63554 N49153 N28678 N23232 N43369 N58518 N444...,N6950-0 N60215-0 N6074-0 N11930-0 N6916-0 N248...
4,5,U9444,11/15/2019 8:25:46 AM,N51692 N18285 N26015 N22679 N55556,N5940-1 N23513-0 N49285-0 N23355-0 N19990-0 N3...


In [11]:
articles_df.head()

Unnamed: 0,NewsID,Category,SubCategory,Title,Abstract,URL,TitleEntities,AbstractEntities
0,N55528,lifestyle,lifestyleroyals,"The Brands Queen Elizabeth, Prince Charles, an...","Shop the notebooks, jackets, and more that the...",https://assets.msn.com/labs/mind/AAGH0ET.html,"[{""Label"": ""Prince Philip, Duke of Edinburgh"",...",[]
1,N18955,health,medical,Dispose of unwanted prescription drugs during ...,,https://assets.msn.com/labs/mind/AAISxPN.html,"[{""Label"": ""Drug Enforcement Administration"", ...",[]
2,N61837,news,newsworld,The Cost of Trump's Aid Freeze in the Trenches...,Lt. Ivan Molchanets peeked over a parapet of s...,https://assets.msn.com/labs/mind/AAJgNsz.html,[],"[{""Label"": ""Ukraine"", ""Type"": ""G"", ""WikidataId..."
3,N53526,health,voices,I Was An NBA Wife. Here's How It Affected My M...,"I felt like I was a fraud, and being an NBA wi...",https://assets.msn.com/labs/mind/AACk2N6.html,[],"[{""Label"": ""National Basketball Association"", ..."
4,N38324,health,medical,"How to Get Rid of Skin Tags, According to a De...","They seem harmless, but there's a very good re...",https://assets.msn.com/labs/mind/AAAKEkt.html,"[{""Label"": ""Skin tag"", ""Type"": ""C"", ""WikidataI...","[{""Label"": ""Skin tag"", ""Type"": ""C"", ""WikidataI..."


Filter from NA values and from users who have a small click history (less than 5)

In [12]:
filtered_behaviors = behaviors_df[behaviors_df.History.str.split().str.len() > 5]

f"Number of rows before: {behaviors_df.shape[0]}. After: {filtered_behaviors.shape[0]}"


'Number of rows before: 73152. After: 60170'

In [13]:
# Drop articles with duplicate titles
filtered_articles = articles_df.drop_duplicates(subset=['Title'])

# Drop articles with empty Abstract
filtered_articles = filtered_articles[filtered_articles.Abstract.notna()]

# Drop articles with titles less than 4 words
filtered_articles = filtered_articles[filtered_articles['Title'].apply((lambda x: len(x.split())>=4))]

f"Number of rows before: {articles_df.shape[0]}. After: {filtered_articles.shape[0]}"


'Number of rows before: 42416. After: 39726'

In [14]:
filtered_articles.head()

Unnamed: 0,NewsID,Category,SubCategory,Title,Abstract,URL,TitleEntities,AbstractEntities
0,N55528,lifestyle,lifestyleroyals,"The Brands Queen Elizabeth, Prince Charles, an...","Shop the notebooks, jackets, and more that the...",https://assets.msn.com/labs/mind/AAGH0ET.html,"[{""Label"": ""Prince Philip, Duke of Edinburgh"",...",[]
2,N61837,news,newsworld,The Cost of Trump's Aid Freeze in the Trenches...,Lt. Ivan Molchanets peeked over a parapet of s...,https://assets.msn.com/labs/mind/AAJgNsz.html,[],"[{""Label"": ""Ukraine"", ""Type"": ""G"", ""WikidataId..."
3,N53526,health,voices,I Was An NBA Wife. Here's How It Affected My M...,"I felt like I was a fraud, and being an NBA wi...",https://assets.msn.com/labs/mind/AACk2N6.html,[],"[{""Label"": ""National Basketball Association"", ..."
4,N38324,health,medical,"How to Get Rid of Skin Tags, According to a De...","They seem harmless, but there's a very good re...",https://assets.msn.com/labs/mind/AAAKEkt.html,"[{""Label"": ""Skin tag"", ""Type"": ""C"", ""WikidataI...","[{""Label"": ""Skin tag"", ""Type"": ""C"", ""WikidataI..."
5,N2073,sports,football_nfl,Should NFL be able to fine players for critici...,Several fines came down against NFL players fo...,https://assets.msn.com/labs/mind/AAJ4lap.html,"[{""Label"": ""National Football League"", ""Type"":...","[{""Label"": ""National Football League"", ""Type"":..."


In [15]:
filtered_articles = filtered_articles.iloc[:,:5]
filtered_articles.head()

Unnamed: 0,NewsID,Category,SubCategory,Title,Abstract
0,N55528,lifestyle,lifestyleroyals,"The Brands Queen Elizabeth, Prince Charles, an...","Shop the notebooks, jackets, and more that the..."
2,N61837,news,newsworld,The Cost of Trump's Aid Freeze in the Trenches...,Lt. Ivan Molchanets peeked over a parapet of s...
3,N53526,health,voices,I Was An NBA Wife. Here's How It Affected My M...,"I felt like I was a fraud, and being an NBA wi..."
4,N38324,health,medical,"How to Get Rid of Skin Tags, According to a De...","They seem harmless, but there's a very good re..."
5,N2073,sports,football_nfl,Should NFL be able to fine players for critici...,Several fines came down against NFL players fo...


## Split into train and test data sets

In [19]:
filtered_behaviors.set_index('UserID')
filtered_behaviors['All_History'] = filtered_behaviors.groupby(['UserID']).History.transform(lambda x: ' '.join(x)).transform(lambda x: list(set(x.split())))

In [20]:
all_history = filtered_behaviors.drop_duplicates(subset=['UserID'])
all_history = all_history.filter(['UserID', 'All_History'])
all_history = all_history.set_index('UserID')
all_history

Unnamed: 0_level_0,All_History
UserID,Unnamed: 1_level_1
U80234,"[N47686, N6616, N38895, N53234, N63573, N55189..."
U60458,"[N54827, N6778, N61186, N34775, N32109, N51180..."
U44190,"[N51706, N56253, N55189, N15634, N61704, N3259..."
U87380,"[N29361, N2597, N34452, N28926, N53033, N63554..."
U69606,"[N879, N19591, N37394, N63054, N34140, N21503,..."
...,...
U11,"[N5905, N33271, N31820, N49023, N4647, N18870]"
U77536,"[N20078, N8224, N7884, N46259, N8024, N11605, ..."
U56193,"[N26026, N38311, N28257, N58782, N4705, N28088..."
U16799,"[N15670, N42078, N40826, N64536, N15295, N5229..."


In [21]:
expanded_behaviors = all_history.explode('All_History').reset_index() 
expanded_behaviors.rename(columns={'All_History': 'NewsID'}, inplace=True)

In [22]:
behaviours_train_df, behaviours_test_df = train_test_split(expanded_behaviors,
                                   stratify=expanded_behaviors['UserID'], 
                                   test_size=0.20,
                                   random_state=42)

print('# interactions on Train set: %d' % len(behaviours_train_df))
print('# interactions on Test set: %d' % len(behaviours_test_df))

# interactions on Train set: 983294
# interactions on Test set: 245824


In [23]:
# Indexing by UserID to speed up the searches during evaluation
behaviours_full_indexed_df = expanded_behaviors.set_index('UserID')
behaviours_train_indexed_df = behaviours_train_df.set_index('UserID')
behaviours_test_indexed_df = behaviours_test_df.set_index('UserID')

In [24]:
# group by userID back to aggregated values
history_train_indexed_df = behaviours_train_indexed_df.groupby(['UserID'])['NewsID'].apply(list).reset_index().set_index('UserID')
history_train_indexed_df.rename(columns={'NewsID': 'All_History'}, inplace=True)

history_test_indexed_df = behaviours_test_indexed_df.groupby(['UserID'])['NewsID'].apply(list).reset_index().set_index('UserID')
history_test_indexed_df.rename(columns={'NewsID': 'All_History'}, inplace=True)

In [25]:
history_train_indexed_df.index.values

array(['U1', 'U10', 'U10000', ..., 'U9986', 'U9998', 'U9999'],
      dtype=object)

In [26]:
# implement filtering
history_test_indexed_df = history_test_indexed_df[history_test_indexed_df.index.isin(history_train_indexed_df.index.values.tolist())]
behaviours_test_indexed_df = behaviours_test_indexed_df[behaviours_test_indexed_df.index.isin(history_train_indexed_df.index.values.tolist())]

In [27]:
print('Full size: ', behaviours_full_indexed_df.size)
print('Train size: ', behaviours_train_indexed_df.size)
print('Test size: ', behaviours_test_indexed_df.size)

Full size:  1229118
Train size:  983294
Test size:  245824


## Writing processed datasets to files

In [28]:
filtered_behaviors.to_csv('files/filtered_behaviours.csv', sep='\t')
behaviours_train_indexed_df.to_csv('files/train_filtered_behaviours.csv', sep='\t')
behaviours_test_indexed_df.to_csv('files/test_filtered_behaviours.csv', sep='\t')
filtered_articles.to_csv('files/filtered_articles.csv', sep='\t')