# NEWS RECOMMENDATION

Welcome to the notebook on news recommendation using NLP and recommendation systems! In today's world, there is an abundance of news articles available online, making it challenging for users to find the content that is most relevant to their interests. This is where news recommendation systems come in, providing personalized article suggestions to users based on their reading history and preferences.

In this notebook, we will be using the Microsoft MIND dataset and the Microsoft Recommenders library to build a news recommendation system. Specifically, we will be implementing the LSTUR algorithm, which combines natural language processing techniques with recommendation systems to provide a more accurate and personalized experience for users.

By the end of this notebook, you will have a deeper understanding of news recommendation systems and the tools and techniques used to build them. So, let's get started and dive into the world of news recommendation using NLP and recommendation systems!

## Installing required packages

In [None]:
# installing microsoft recommenders library 
!pip install recommenders scrapbook

## Importing Required Packages

In [2]:
import sys
import os
import numpy as np
import pandas as pd
import zipfile
from tqdm import tqdm
import scrapbook as sb
from tempfile import TemporaryDirectory
import tensorflow as tf
tf.get_logger().setLevel('ERROR') # only show error messages

from recommenders.models.deeprec.deeprec_utils import download_deeprec_resources 
from recommenders.models.newsrec.newsrec_utils import prepare_hparams
from recommenders.models.newsrec.models.lstur import LSTURModel
from recommenders.models.newsrec.io.mind_iterator import MINDIterator
from recommenders.models.newsrec.newsrec_utils import get_mind_data_set

print("System version: {}".format(sys.version))
print("Tensorflow version: {}".format(tf.__version__))

System version: 3.9.16 (main, Dec  7 2022, 01:11:51) 
[GCC 9.4.0]
Tensorflow version: 2.11.0


## Collecting and Loading Data

In [3]:
# Initializing parameters
epochs = 5
seed = 40
batch_size = 32

# Options: demo, small, large
MIND_type = 'large'

In [None]:
# Getting data
# First, we create a temporary directory to store the data for this notebook.
tmpdir = TemporaryDirectory()
data_path = tmpdir.name

# We then define the file paths for the train news and behavior files, as well as the valid news and behavior files.
train_news_file = os.path.join(data_path, 'train', r'news.tsv')
train_behaviors_file = os.path.join(data_path, 'train', r'behaviors.tsv')
valid_news_file = os.path.join(data_path, 'valid', r'news.tsv')
valid_behaviors_file = os.path.join(data_path, 'valid', r'behaviors.tsv')

# We also define the file paths for the word embeddings, user dictionary, word dictionary, and yaml file.
wordEmb_file = os.path.join(data_path, "utils", "embedding.npy")
userDict_file = os.path.join(data_path, "utils", "uid2index.pkl")
wordDict_file = os.path.join(data_path, "utils", "word_dict.pkl")
yaml_file = os.path.join(data_path, "utils", r'lstur.yaml')

# We then call the get_mind_data_set function to get the MIND dataset, train and dev data, and utils file path.
mind_url, mind_train_dataset, mind_dev_dataset, mind_utils = get_mind_data_set(MIND_type)

# We then check if the train and valid news files exist. If they don't, we call the download_deeprec_resources
# function to download the MIND dataset train and dev files to the corresponding directories.
if not os.path.exists(train_news_file):
    download_deeprec_resources(mind_url, os.path.join(data_path, 'train'), mind_train_dataset)
    
if not os.path.exists(valid_news_file):
    download_deeprec_resources(mind_url, os.path.join(data_path, 'valid'), mind_dev_dataset)

# We also check if the yaml file exists. If it doesn't, we call the download_deeprec_resources function to download
# the lstur.yaml file to the utils directory.
if not os.path.exists(yaml_file):
    download_deeprec_resources(r'https://recodatasets.z20.web.core.windows.net/newsrec/', \
                               os.path.join(data_path, 'utils'), mind_utils)


## Exploratory Data Analysis

In [10]:
# Loading news dataset, format: [News ID] [Category] [Subcategory] [News Title] [News Abstrct] [News Url] [Entities in News Title] [Entities in News Abstract]
df_news = pd.read_table(train_news_file)
# Display the top 4 rows 
print(df_news.head(4).to_string(header=False))

0  N10399     news          newsworld                      The Cost of Trump's Aid Freeze in the Trenches of Ukraine's War  Lt. Ivan Molchanets peeked over a parapet of sand bags at the front line of the war in Ukraine. Next to him was an empty helmet propped up to trick snipers, already perforated with multiple holes.                                       https://www.msn.com/en-us/news/world/the-cost-of-trumps-aid-freeze-in-the-trenches-of-ukraines-war/ar-AAJgNsz?ocid=chopendata                                                                                                                                                 []                                                                                                                                                                                                                                                                                                                    [{"Label": "Ukraine", "Type": "G", "WikidataId": "Q212", "Con

In [11]:
# loading behaviors dataset, format: [Impression ID] [User ID] [Impression Time] [User Click History] [Impression News]
df_behaviors = pd.read_table(train_behaviors_file)
# Display the top 4 rows 
print(df_behaviors.head(4).to_string(header=False))

0  2  U84185  11/12/2019 10:36:47 AM                                                                                                                                                                                                                                                                                                                                                           N27209 N11723 N4617 N12320 N11333 N24461 N22111 N14026 N21705 N17551 N17039                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             N

## Model Training

In [64]:
# Hypter parameters
hparams = prepare_hparams(yaml_file, 
                          wordEmb_file=wordEmb_file,
                          wordDict_file=wordDict_file, 
                          userDict_file=userDict_file,
                          batch_size=batch_size,
                          epochs=epochs)
print(hparams)

HParams object with values {'support_quick_scoring': True, 'dropout': 0.2, 'attention_hidden_dim': 200, 'head_num': 4, 'head_dim': 100, 'filter_num': 400, 'window_size': 3, 'vert_emb_dim': 100, 'subvert_emb_dim': 100, 'gru_unit': 400, 'type': 'ini', 'user_emb_dim': 50, 'learning_rate': 0.0001, 'optimizer': 'adam', 'epochs': 5, 'batch_size': 32, 'show_step': 100000, 'title_size': 30, 'his_size': 50, 'data_format': 'news', 'npratio': 4, 'metrics': ['group_auc', 'mean_mrr', 'ndcg@5;10'], 'word_emb_dim': 300, 'cnn_activation': 'relu', 'model_type': 'lstur', 'loss': 'cross_entropy_loss', 'wordEmb_file': '/tmp/tmpp60vqkoq/utils/embedding.npy', 'wordDict_file': '/tmp/tmpp60vqkoq/utils/word_dict.pkl', 'userDict_file': '/tmp/tmpp60vqkoq/utils/uid2index.pkl'}


In [15]:
# model initializaing using the hyperparameters
model = LSTURModel(hparams, MINDIterator, seed=seed)

Tensor("conv1d/Relu:0", shape=(None, 30, 400), dtype=float32)
Tensor("att_layer2/Sum_1:0", shape=(None, 400), dtype=float32)


  super().__init__(name, **kwargs)


In [16]:
print(model.run_eval(valid_news_file, valid_behaviors_file))

  updates=self.state_updates,
586it [00:10, 54.94it/s] 
236it [00:08, 27.87it/s]
7538it [00:00, 8271.07it/s]


{'group_auc': 0.5201, 'mean_mrr': 0.2214, 'ndcg@5': 0.2292, 'ndcg@10': 0.2912}


In [17]:
%%time
model.fit(train_news_file, train_behaviors_file, valid_news_file, valid_behaviors_file)

1086it [03:19,  5.44it/s]
586it [00:00, 707.51it/s]
236it [00:06, 34.21it/s]
7538it [00:01, 4366.11it/s]


at epoch 1
train info: logloss loss:1.490429998322306
eval info: group_auc:0.5931, mean_mrr:0.257, ndcg@10:0.3458, ndcg@5:0.2824
at epoch 1 , train time: 199.5 eval time: 15.9


1086it [03:07,  5.79it/s]
586it [00:00, 721.44it/s]
236it [00:06, 34.16it/s]
7538it [00:00, 8904.79it/s]


at epoch 2
train info: logloss loss:1.4036107010604268
eval info: group_auc:0.62, mean_mrr:0.2787, ndcg@10:0.371, ndcg@5:0.3063
at epoch 2 , train time: 187.5 eval time: 15.0


1086it [03:07,  5.80it/s]
586it [00:00, 687.19it/s]
236it [00:06, 33.90it/s]
7538it [00:00, 8924.65it/s]


at epoch 3
train info: logloss loss:1.3566746225453652
eval info: group_auc:0.6313, mean_mrr:0.2919, ndcg@10:0.3834, ndcg@5:0.3227
at epoch 3 , train time: 187.1 eval time: 15.1


1086it [03:06,  5.82it/s]
586it [00:00, 724.82it/s]
236it [00:06, 33.73it/s]
7538it [00:00, 8486.13it/s]


at epoch 4
train info: logloss loss:1.324074923992157
eval info: group_auc:0.6343, mean_mrr:0.2898, ndcg@10:0.3829, ndcg@5:0.3201
at epoch 4 , train time: 186.7 eval time: 15.3


1086it [03:07,  5.80it/s]
586it [00:01, 492.45it/s]
236it [00:06, 35.22it/s]
7538it [00:00, 8976.10it/s]


at epoch 5
train info: logloss loss:1.2885026645814077
eval info: group_auc:0.6409, mean_mrr:0.2971, ndcg@10:0.3916, ndcg@5:0.3265
at epoch 5 , train time: 187.4 eval time: 16.1
CPU times: user 18min 49s, sys: 52.2 s, total: 19min 42s
Wall time: 17min 5s


<recommenders.models.newsrec.models.lstur.LSTURModel at 0x7f8aa8f5b490>

In [18]:
%%time
res_syn = model.run_eval(valid_news_file, valid_behaviors_file)
print(res_syn)

586it [00:00, 625.97it/s]
236it [00:06, 35.43it/s]
7538it [00:01, 4977.84it/s]


{'group_auc': 0.6409, 'mean_mrr': 0.2971, 'ndcg@5': 0.3265, 'ndcg@10': 0.3916}
CPU times: user 14.8 s, sys: 1.81 s, total: 16.6 s
Wall time: 15.8 s


In [19]:
sb.glue("res_syn", res_syn)

## Saving the model

In [20]:
#Setting the path for the model
model_path = os.path.join(data_path, "model")

#Creating the directory if it does not exist
os.makedirs(model_path, exist_ok=True)

#Saving the weights of the model at the specified path
model.model.save_weights(os.path.join(model_path, "lstur_ckpt"))

## Predictions

In [21]:
# Load the pre-trained model and evaluate on the validation data
group_impr_indexes, group_labels, group_preds = model.run_fast_eval(valid_news_file, valid_behaviors_file)

# Open a file to write the prediction results
with open(os.path.join(data_path, 'prediction.txt'), 'w') as f:
    # Iterate through each impression index and the corresponding predictions
    for impr_index, preds in tqdm(zip(group_impr_indexes, group_preds)):
        # Increment the impression index by 1 (since impression indexes start from 0)
        impr_index += 1
        # Compute the predicted ranking for the current set of predictions
        pred_rank = (np.argsort(np.argsort(preds)[::-1]) + 1).tolist()
        # Format the predicted ranking as a string in the required format
        pred_rank = '[' + ','.join([str(i) for i in pred_rank]) + ']'
        # Write the impression index and predicted ranking to the output file
        f.write(' '.join([str(impr_index), pred_rank])+ '\n')


586it [00:00, 601.25it/s]
236it [00:06, 36.56it/s]
7538it [00:02, 3624.13it/s]
7538it [00:00, 27741.56it/s]


In [44]:
# Create a new zip file to store the prediction file
f = zipfile.ZipFile(os.path.join(data_path, 'prediction.zip'), 'w', zipfile.ZIP_DEFLATED)

# Add the prediction file to the zip archive with a specified arcname
f.write(os.path.join(data_path, 'prediction.txt'), arcname='prediction.txt')

# Close the zip file
f.close()


## Summary and Conclusion
In this notebook, we have explored the use of NLP and recommendation systems to build a news recommendation engine. Using the Microsoft MIND dataset and the Microsoft Recommenders library, we implemented the LSTUR algorithm to learn user representations and predict personalized news recommendations for each user.

After training the model, we obtained a group_auc of 0.6409, mean_mrr of 0.2971, ndcg@5 of 0.3265, and ndcg@10 of 0.3916. These evaluation metrics indicate that our model is performing reasonably well in terms of accuracy and personalization of recommendations.

In conclusion, the use of NLP and recommendation systems is a powerful tool for building news recommendation engines that can provide personalized recommendations to users. Our implementation of the LSTUR algorithm has shown promising results, and there is room for further improvement and exploration in this field. With the growing importance of news consumption and the abundance of news articles available online, news recommendation engines have become increasingly important in helping users find the content that is most relevant to their interests.