<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>

<i>Licensed under the MIT License.</i>

# NRMS: Neural News Recommendation with Multi-Head Self-Attention
NRMS \[1\] is a neural news recommendation approach with multi-head selfattention. The core of NRMS is a news encoder and a user encoder. In the newsencoder, a multi-head self-attentions is used to learn news representations from news titles by modeling the interactions between words. In the user encoder, we learn representations of users from their browsed news and use multihead self-attention to capture the relatedness between the news. Besides, we apply additive
attention to learn more informative news and user representations by selecting important words and news.

## Properties of NRMS:
- NRMS is a content-based neural news recommendation approach.
- It uses multi-self attention to learn news representations by modeling the iteractions between words and learn user representations by capturing the relationship between user browsed news.
- NRMS uses additive attentions to learn informative news and user representations by selecting important words and news.

## Data format:
For quicker training and evaluaiton, we sample MINDdemo dataset of 5k users from [MIND small dataset](https://msnews.github.io/). The MINDdemo dataset has the same file format as MINDsmall and MINDlarge. If you want to try experiments on MINDsmall and MINDlarge, please change the dowload source. Select the MIND_type parameter from ['large', 'small', 'demo'] to choose dataset.
 
**MINDdemo_train** is used for training, and **MINDdemo_dev** is used for evaluation. Training data and evaluation data are composed of a news file and a behaviors file. You can find more detailed data description in [MIND repo](https://github.com/msnews/msnews.github.io/blob/master/assets/doc/introduction.md)

### news data
This file contains news information including newsid, category, subcatgory, news title, news abstarct, news url and entities in news title, entities in news abstarct.
One simple example: <br>

`N46466	lifestyle	lifestyleroyals	The Brands Queen Elizabeth, Prince Charles, and Prince Philip Swear By	Shop the notebooks, jackets, and more that the royals can't live without.	https://www.msn.com/en-us/lifestyle/lifestyleroyals/the-brands-queen-elizabeth,-prince-charles,-and-prince-philip-swear-by/ss-AAGH0ET?ocid=chopendata	[{"Label": "Prince Philip, Duke of Edinburgh", "Type": "P", "WikidataId": "Q80976", "Confidence": 1.0, "OccurrenceOffsets": [48], "SurfaceForms": ["Prince Philip"]}, {"Label": "Charles, Prince of Wales", "Type": "P", "WikidataId": "Q43274", "Confidence": 1.0, "OccurrenceOffsets": [28], "SurfaceForms": ["Prince Charles"]}, {"Label": "Elizabeth II", "Type": "P", "WikidataId": "Q9682", "Confidence": 0.97, "OccurrenceOffsets": [11], "SurfaceForms": ["Queen Elizabeth"]}]	[]`
<br>

In general, each line in data file represents information of one piece of news: <br>

`[News ID] [Category] [Subcategory] [News Title] [News Abstrct] [News Url] [Entities in News Title] [Entities in News Abstract] ...`

<br>

We generate a word_dict file to transform words in news title to word indexes, and a embedding matrix is initted from pretrained glove embeddings.

### behaviors data
One simple example: <br>
`1	U82271	11/11/2019 3:28:58 PM	N3130 N11621 N12917 N4574 N12140 N9748	N13390-0 N7180-0 N20785-0 N6937-0 N15776-0 N25810-0 N20820-0 N6885-0 N27294-0 N18835-0 N16945-0 N7410-0 N23967-0 N22679-0 N20532-0 N26651-0 N22078-0 N4098-0 N16473-0 N13841-0 N15660-0 N25787-0 N2315-0 N1615-0 N9087-0 N23880-0 N3600-0 N24479-0 N22882-0 N26308-0 N13594-0 N2220-0 N28356-0 N17083-0 N21415-0 N18671-0 N9440-0 N17759-0 N10861-0 N21830-0 N8064-0 N5675-0 N15037-0 N26154-0 N15368-1 N481-0 N3256-0 N20663-0 N23940-0 N7654-0 N10729-0 N7090-0 N23596-0 N15901-0 N16348-0 N13645-0 N8124-0 N20094-0 N27774-0 N23011-0 N14832-0 N15971-0 N27729-0 N2167-0 N11186-0 N18390-0 N21328-0 N10992-0 N20122-0 N1958-0 N2004-0 N26156-0 N17632-0 N26146-0 N17322-0 N18403-0 N17397-0 N18215-0 N14475-0 N9781-0 N17958-0 N3370-0 N1127-0 N15525-0 N12657-0 N10537-0 N18224-0`
<br>

In general, each line in data file represents one instance of an impression. The format is like: <br>

`[Impression ID] [User ID] [Impression Time] [User Click History] [Impression News]`

<br>

User Click History is the user historical clicked news before Impression Time. Impression News is the displayed news in an impression, which format is:<br>

`[News ID 1]-[label1] ... [News ID n]-[labeln]`

<br>
Label represents whether the news is clicked by the user. All information of news in User Click History and Impression News can be found in news data file.

## Global settings and imports

In [26]:
import sys
import os
import numpy as np
import pandas as pd
import zipfile
from tqdm import tqdm
import scrapbook as sb
from tempfile import TemporaryDirectory
import tensorflow as tf
tf.get_logger().setLevel('ERROR') # only show error messages

from recommenders.models.deeprec.deeprec_utils import download_deeprec_resources 
from recommenders.models.newsrec.newsrec_utils import prepare_hparams
from recommenders.models.newsrec.models.nrms import NRMSModel
from recommenders.models.newsrec.io.mind_iterator import MINDIterator
from recommenders.models.newsrec.newsrec_utils import get_mind_data_set

print("System version: {}".format(sys.version))
print("Tensorflow version: {}".format(tf.__version__))


System version: 3.9.12 (main, Apr  5 2022, 01:53:17) 
[Clang 12.0.0 ]
Tensorflow version: 2.12.0


## Prepare parameters

In [3]:
epochs = 5
seed = 42
batch_size = 32

# Options: demo, small, large
MIND_type = 'demo'

## Download and load data

In [24]:
tmpdir = TemporaryDirectory()
data_path = tmpdir.name

train_news_file = os.path.join(data_path, 'train', r'news.tsv')
train_behaviors_file = os.path.join(data_path, 'train', r'behaviors.tsv')
valid_news_file = os.path.join(data_path, 'valid', r'news.tsv')
valid_behaviors_file = os.path.join(data_path, 'valid', r'behaviors.tsv')
wordEmb_file = os.path.join(data_path, "utils", "embedding.npy")
userDict_file = os.path.join(data_path, "utils", "uid2index.pkl")
wordDict_file = os.path.join(data_path, "utils", "word_dict.pkl")
yaml_file = os.path.join(data_path, "utils", r'nrms.yaml')

mind_url, mind_train_dataset, mind_dev_dataset, mind_utils = get_mind_data_set(MIND_type)

if not os.path.exists(train_news_file):
    download_deeprec_resources(mind_url, os.path.join(data_path, 'train'), mind_train_dataset)
    
if not os.path.exists(valid_news_file):
    download_deeprec_resources(mind_url, \
                               os.path.join(data_path, 'valid'), mind_dev_dataset)
if not os.path.exists(yaml_file):
    download_deeprec_resources(r'https://recodatasets.z20.web.core.windows.net/newsrec/', \
                               os.path.join(data_path, 'utils'), mind_utils)

100%|██████████| 17.0k/17.0k [00:02<00:00, 7.84kKB/s]
100%|██████████| 9.84k/9.84k [00:01<00:00, 6.56kKB/s]
100%|██████████| 95.0k/95.0k [00:04<00:00, 19.2kKB/s]


## Create hyper-parameters

In [5]:
hparams = prepare_hparams(yaml_file, 
                          wordEmb_file=wordEmb_file,
                          wordDict_file=wordDict_file, 
                          userDict_file=userDict_file,
                          batch_size=batch_size,
                          epochs=epochs,
                          show_step=10)
print(hparams)

HParams object with values {'support_quick_scoring': True, 'dropout': 0.2, 'attention_hidden_dim': 200, 'head_num': 20, 'head_dim': 20, 'filter_num': 200, 'window_size': 3, 'vert_emb_dim': 100, 'subvert_emb_dim': 100, 'gru_unit': 400, 'type': 'ini', 'user_emb_dim': 50, 'learning_rate': 0.0001, 'optimizer': 'adam', 'epochs': 5, 'batch_size': 32, 'show_step': 10, 'title_size': 30, 'his_size': 50, 'data_format': 'news', 'npratio': 4, 'metrics': ['group_auc', 'mean_mrr', 'ndcg@5;10'], 'word_emb_dim': 300, 'model_type': 'nrms', 'loss': 'cross_entropy_loss', 'wordEmb_file': '/var/folders/sv/970s472x5tz75kxxbpmggty40000gn/T/tmp0rkqfvdu/utils/embedding.npy', 'wordDict_file': '/var/folders/sv/970s472x5tz75kxxbpmggty40000gn/T/tmp0rkqfvdu/utils/word_dict.pkl', 'userDict_file': '/var/folders/sv/970s472x5tz75kxxbpmggty40000gn/T/tmp0rkqfvdu/utils/uid2index.pkl'}


## Train the NRMS model

In [6]:
iterator = MINDIterator

In [7]:
model = NRMSModel(hparams, iterator, seed=seed)

2023-12-06 14:41:44.862156: W tensorflow/c/c_api.cc:300] Operation '{name:'embedding/embeddings/Assign' id:26 op device:{requested: '', assigned: ''} def:{{{node embedding/embeddings/Assign}} = AssignVariableOp[_has_manual_control_dependencies=true, dtype=DT_FLOAT, validate_shape=false](embedding/embeddings, embedding/embeddings/Initializer/stateless_random_uniform)}}' was changed by setting attribute after it was run by a session. This mutation will have no effect, and will trigger an error in the future. Either don't modify nodes after running them or create a new session.
  super().__init__(name, **kwargs)


In [8]:
valid_behaviors_file

'/var/folders/sv/970s472x5tz75kxxbpmggty40000gn/T/tmp0rkqfvdu/valid/behaviors.tsv'

In [18]:
print(model.run_eval(valid_news_file, valid_behaviors_file))

  updates=self.state_updates,
2023-11-29 13:57:30.780700: W tensorflow/c/c_api.cc:300] Operation '{name:'att_layer2_3/b/Assign' id:2814 op device:{requested: '', assigned: ''} def:{{{node att_layer2_3/b/Assign}} = AssignVariableOp[_has_manual_control_dependencies=true, dtype=DT_FLOAT, validate_shape=false](att_layer2_3/b, att_layer2_3/b/Initializer/zeros)}}' was changed by setting attribute after it was run by a session. This mutation will have no effect, and will trigger an error in the future. Either don't modify nodes after running them or create a new session.
586it [00:06, 87.21it/s]
0it [00:00, ?it/s]2023-11-29 13:57:37.326249: W tensorflow/c/c_api.cc:300] Operation '{name:'att_layer2_3/Sum_1' id:2875 op device:{requested: '', assigned: ''} def:{{{node att_layer2_3/Sum_1}} = Sum[T=DT_FLOAT, Tidx=DT_INT32, _has_manual_control_dependencies=true, keep_dims=false](att_layer2_3/mul, att_layer2_3/Sum_1/reduction_indices)}}' was changed by setting attribute after it was run by a session

{'group_auc': 0.4792, 'mean_mrr': 0.2059, 'ndcg@5': 0.2045, 'ndcg@10': 0.2701}


In [9]:
%%time
model.fit(train_news_file, train_behaviors_file, valid_news_file, valid_behaviors_file)

0it [00:00, ?it/s]2023-11-26 15:15:30.074502: W tensorflow/c/c_api.cc:300] Operation '{name:'loss/mul' id:2002 op device:{requested: '', assigned: ''} def:{{{node loss/mul}} = Mul[T=DT_FLOAT, _has_manual_control_dependencies=true](loss/mul/x, loss/activation_loss/value)}}' was changed by setting attribute after it was run by a session. This mutation will have no effect, and will trigger an error in the future. Either don't modify nodes after running them or create a new session.
2023-11-26 15:15:30.183277: W tensorflow/c/c_api.cc:300] Operation '{name:'training/Adam/att_layer2_1/b/v/Assign' id:2807 op device:{requested: '', assigned: ''} def:{{{node training/Adam/att_layer2_1/b/v/Assign}} = AssignVariableOp[_has_manual_control_dependencies=true, dtype=DT_FLOAT, validate_shape=false](training/Adam/att_layer2_1/b/v, training/Adam/att_layer2_1/b/v/Initializer/zeros)}}' was changed by setting attribute after it was run by a session. This mutation will have no effect, and will trigger an er

at epoch 1
train info: logloss loss:1.5137877671758115
eval info: group_auc:0.5785, mean_mrr:0.2436, ndcg@10:0.3298, ndcg@5:0.2578
at epoch 1 , train time: 1449.7 eval time: 118.6


step 1080 , total_loss: 1.4181, data_loss: 1.3567: : 1086it [23:37,  1.31s/it]
586it [00:06, 97.22it/s] 
236it [01:47,  2.20it/s]
7538it [00:00, 18310.20it/s]


at epoch 2
train info: logloss loss:1.4181900829021883
eval info: group_auc:0.6004, mean_mrr:0.2552, ndcg@10:0.346, ndcg@5:0.2711
at epoch 2 , train time: 1417.5 eval time: 117.4


step 1080 , total_loss: 1.3796, data_loss: 1.1591: : 1086it [23:47,  1.31s/it]
586it [00:05, 98.63it/s]
236it [01:46,  2.22it/s]
7538it [00:00, 18000.46it/s]


at epoch 3
train info: logloss loss:1.3794216780372746
eval info: group_auc:0.6083, mean_mrr:0.2652, ndcg@10:0.3567, ndcg@5:0.2854
at epoch 3 , train time: 1427.3 eval time: 116.7


step 1080 , total_loss: 1.3537, data_loss: 1.1905: : 1086it [23:45,  1.31s/it]
586it [00:05, 98.78it/s] 
236it [01:47,  2.20it/s]
7538it [00:00, 17152.78it/s]


at epoch 4
train info: logloss loss:1.3537926673340315
eval info: group_auc:0.6102, mean_mrr:0.2684, ndcg@10:0.3596, ndcg@5:0.287
at epoch 4 , train time: 1425.8 eval time: 117.3


step 1080 , total_loss: 1.3297, data_loss: 1.3614: : 1086it [23:44,  1.31s/it]
586it [00:05, 99.35it/s]
236it [01:46,  2.21it/s]
7538it [00:00, 18618.19it/s]


at epoch 5
train info: logloss loss:1.3298480067244332
eval info: group_auc:0.6136, mean_mrr:0.2698, ndcg@10:0.3617, ndcg@5:0.2901
at epoch 5 , train time: 1424.3 eval time: 116.7
CPU times: user 13h 17min 40s, sys: 13min 15s, total: 13h 30min 55s
Wall time: 2h 8min 51s


<recommenders.models.newsrec.models.nrms.NRMSModel at 0x7faf3d1de9a0>

In [19]:
%%time
res_syn = model.run_eval(valid_news_file, valid_behaviors_file)
print(res_syn)


586it [00:06, 96.85it/s]
236it [01:48,  2.18it/s]
7538it [00:00, 15671.23it/s]


{'group_auc': 0.4792, 'mean_mrr': 0.2059, 'ndcg@5': 0.2045, 'ndcg@10': 0.2701}
CPU times: user 9min 33s, sys: 6.29 s, total: 9min 39s
Wall time: 1min 58s


In [20]:
sb.glue("res_syn", res_syn)

## Save the model

In [28]:
model_path = os.path.join(data_path, "model")
os.makedirs(model_path, exist_ok=True)
model.model.save_weights(os.path.join(model_path, "nrms_ckpt"))

In [11]:
checkpoint_dir = "/var/folders/sv/970s472x5tz75kxxbpmggty40000gn/T/tmp44t_e121/model/"
os.listdir(checkpoint_dir)

['nrms_ckpt.data-00000-of-00001']

In [14]:
latest = tf.train.latest_checkpoint(checkpoint_dir)
latest

In [15]:
# Load the previously saved weights
model.model.load_weights(latest)

# Re-evaluate the model
res_syn = model.run_eval(valid_news_file, valid_behaviors_file)
print(res_syn)

AttributeError: 'NoneType' object has no attribute 'endswith'

## Output Prediction File

This code segment is used to generate the prediction.zip file, which is in the same format in [MIND Competition Submission Tutorial](https://competitions.codalab.org/competitions/24122#learn_the_details-submission-guidelines).

Please change the `MIND_type` parameter to `large` if you want to submit your prediction to [MIND Competition](https://msnews.github.io/competition.html).

In [19]:
group_impr_indexes, group_labels, group_preds = model.run_fast_eval(valid_news_file, valid_behaviors_file)

  updates=self.state_updates,
2023-12-10 16:53:40.657385: W tensorflow/c/c_api.cc:300] Operation '{name:'self_attention_1/WV/Assign' id:676 op device:{requested: '', assigned: ''} def:{{{node self_attention_1/WV/Assign}} = AssignVariableOp[_has_manual_control_dependencies=true, dtype=DT_FLOAT, validate_shape=false](self_attention_1/WV, self_attention_1/WV/Initializer/random_uniform)}}' was changed by setting attribute after it was run by a session. This mutation will have no effect, and will trigger an error in the future. Either don't modify nodes after running them or create a new session.
586it [00:12, 47.16it/s]
0it [00:00, ?it/s]2023-12-10 16:53:52.633669: W tensorflow/c/c_api.cc:300] Operation '{name:'att_layer2_1/Sum_1' id:864 op device:{requested: '', assigned: ''} def:{{{node att_layer2_1/Sum_1}} = Sum[T=DT_FLOAT, Tidx=DT_INT32, _has_manual_control_dependencies=true, keep_dims=false](att_layer2_1/mul, att_layer2_1/Sum_1/reduction_indices)}}' was changed by setting attribute af

In [38]:
len(group_preds)

7538

In [43]:
valid_news_file

'/var/folders/sv/970s472x5tz75kxxbpmggty40000gn/T/tmp44t_e121/valid/news.tsv'

In [48]:
import pandas as pd
news = pd.read_table(valid_news_file)
beh = pd.read_table(valid_behaviors_file)

In [50]:
beh

Unnamed: 0,1,U41827,11/15/2019 2:41:03 PM,N15366 N12202 N27489 N19773 N21134 N18191 N6863 N11838 N22940 N11838 N7373 N14672 N20561 N16234 N12038 N25392 N26417 N19248 N7214 N20951 N27993 N21 N16998 N20854 N15119 N12563 N2694 N26376 N6178 N11315 N695 N17348 N16106 N18994 N16620 N19892 N3941 N13178 N17369,N23699-0 N21291-0 N1901-0 N27292-0 N17443-0 N18282-0 N6849-0 N13690-0 N12794-0 N8620-1 N28138-0 N23829-0 N13670-0 N6240-0 N1807-0 N3390-0 N17774-0 N14623-0 N26916-0 N13512-0 N21852-0 N28510-0 N27423-0 N21645-0 N26670-0 N21815-0 N9181-0 N6782-0
0,2,U61881,11/15/2019 10:31:42 AM,N16469 N4202 N4202 N21816 N12992 N24242 N7366 ...,N26916-0 N4641-0 N25522-0 N14893-0 N19035-0 N3...
1,3,U54180,11/15/2019 5:36:17 AM,N22427 N16386 N24242 N4385 N14672 N12242 N1852...,N13528-0 N27689-0 N10879-0 N11662-0 N14409-0 N...
2,4,U41164,11/15/2019 9:13:44 AM,N13065 N5748 N12658 N276 N7395 N16010 N13761 N...,N20150-0 N1807-1 N26916-0 N28138-0 N9576-0 N19...
3,5,U8588,11/15/2019 5:39:04 AM,N6629 N4958 N10917 N27079 N828,N21325-0 N5982-0 N19737-1 N9576-0 N20150-0 N25...
4,6,U73735,11/15/2019 11:16:40 AM,N8549 N9763 N28048 N5221 N24080 N5358 N11073 N...,N9576-0 N20946-0 N12037-0 N21325-0 N7667-0 N18...
...,...,...,...,...,...
7532,7534,U23841,11/15/2019 9:07:10 AM,N14405 N417 N16730 N23945 N15405 N25316 N13678...,N26256-0 N28117-0 N2718-0 N16798-0 N27689-0 N6...
7533,7535,U28014,11/15/2019 2:06:47 PM,N1053 N26708 N26570 N12924 N24356 N11647 N2759...,N26670-0 N12794-0 N3390-0 N17443-0 N27292-0 N2...
7534,7536,U89684,11/15/2019 9:07:58 AM,N25457 N27206 N23987 N26009 N545 N2573 N22908 ...,N17443-0 N16798-0 N24553-0 N26096-0 N15927-0 N...
7535,7537,U92611,11/15/2019 11:52:28 AM,N107 N15735 N11987 N16777 N20030 N26557 N8238 ...,N14850-0 N26647-0 N272-0 N22751-0 N21398-0 N26...


In [14]:
with open(os.path.join(data_path, 'prediction.txt'), 'w') as f: 
    for impr_index, preds in tqdm(zip(group_impr_indexes, group_preds)):
        impr_index += 1
        pred_rank = (np.argsort(np.argsort(preds)[::-1]) + 1).tolist()
        pred_rank = '[' + ','.join([str(i) for i in pred_rank]) + ']'
        f.write(' '.join([str(impr_index), pred_rank])+ '\n')

7538it [00:00, 60006.42it/s]


In [68]:
(np.argsort(np.argsort(group_preds[0])[::-1]) + 1).tolist()

[15,
 7,
 11,
 5,
 21,
 24,
 17,
 16,
 26,
 13,
 18,
 22,
 12,
 2,
 23,
 28,
 8,
 25,
 19,
 10,
 4,
 20,
 1,
 9,
 6,
 14,
 3,
 27]

In [22]:
data_path

'/var/folders/sv/970s472x5tz75kxxbpmggty40000gn/T/tmpyjb6wdcd'

In [15]:
f = zipfile.ZipFile(os.path.join(data_path, 'prediction.zip'), 'w', zipfile.ZIP_DEFLATED)
f.write(os.path.join(data_path, 'prediction.txt'), arcname='prediction.txt')
f.close()

In [16]:
a = ""
for impr_index, preds in tqdm(zip(group_impr_indexes, group_preds)):
    impr_index += 1
    pred_rank = (np.argsort(np.argsort(preds)[::-1]) + 1).tolist()
    pred_rank = '[' + ','.join([str(i) for i in pred_rank]) + ']'
    a += ' '.join([str(impr_index), pred_rank])+ '\n'

7538it [00:00, 57803.29it/s]


In [16]:
#a

In [27]:
import os
import tempfile
import urllib
import zipfile

# Temporary folder for data we need during execution of this notebook (we'll clean up
# at the end, we promise)
temp_dir = os.path.join(tempfile.gettempdir(), 'mind')
os.makedirs(temp_dir, exist_ok=True)

# The dataset is split into training and validation set, each with a large and small version.
# The format of the four files are the same.
# For demonstration purpose, we will use small version validation set only.
base_url = 'https://mind201910small.blob.core.windows.net/release'
training_small_url = f'{base_url}/MINDsmall_train.zip'
validation_small_url = f'{base_url}/MINDsmall_dev.zip'
training_large_url = f'{base_url}/MINDlarge_train.zip'
validation_large_url = f'{base_url}/MINDlarge_dev.zip'

def download_url(url,
                 destination_filename=None,
                 progress_updater=None,
                 force_download=False,
                 verbose=True):
    """
    Download a URL to a temporary file
    """
    if not verbose:
        progress_updater = None
    # This is not intended to guarantee uniqueness, we just know it happens to guarantee
    # uniqueness for this application.
    if destination_filename is None:
        url_as_filename = url.replace('://', '_').replace('/', '_')
        destination_filename = \
            os.path.join(temp_dir,url_as_filename)
    if (not force_download) and (os.path.isfile(destination_filename)):
        if verbose:
            print('Bypassing download of already-downloaded file {}'.format(
                os.path.basename(url)))
        return destination_filename
    if verbose:
        print('Downloading file {} to {}'.format(os.path.basename(url),
                                                 destination_filename),
              end='')
    urllib.request.urlretrieve(url, destination_filename, progress_updater)
    assert (os.path.isfile(destination_filename))
    nBytes = os.path.getsize(destination_filename)
    if verbose:
        print('...done, {} bytes.'.format(nBytes))
    return destination_filename

zip_path = download_url(validation_small_url, verbose=True)
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(temp_dir)

os.listdir(temp_dir)

news_path = os.path.join(temp_dir, 'news.tsv')
news_df = pd.read_table(news_path,
              header=None,
              names=[
                  'id', 'category', 'subcategory', 'title', 'abstract', 'url',
                  'title_entities', 'abstract_entities'
              ])
news_df.set_index("id", inplace=True)

Bypassing download of already-downloaded file MINDsmall_dev.zip


In [21]:
import torch
import torchtext

glove = torchtext.vocab.GloVe(name="6B", dim=50)

In [28]:
def get_category_similarity(nid_1, nid_2):
    if nid_1 not in news_df.index or nid_2 not in news_df.index:
        return 0
    
    cat1 = news_df.loc[nid_1]["category"]
    cat2 = news_df.loc[nid_2]["category"]
    
    return 1-torch.cosine_similarity(glove[cat1].unsqueeze(0), glove[cat2].unsqueeze(0)).item()


def diversity_user(recs, pred_scores, k):
    score = 0.0
    count = 0.0
    combined_data = list(zip(recs, pred_scores))
    sorted_data = sorted(combined_data, key=lambda x: x[1])

    topk_pairs = sorted_data[:k]
    for i in range(len(topk_pairs)):
        for j in range(i+1, len(topk_pairs)):
            count += 1.0
            score += (1-get_category_similarity(topk_pairs[i][0], topk_pairs[j][0]))
    return score/count
        

def diversity_eval(df, top_k=5):
    diversity_score = 0.0
    count = 0.0
    for index, row in df.iterrows():
        diversity_score += diversity_user(row['news_id'], row['pred'],top_k)
        count += 1.0
    return diversity_score/count

In [29]:
user_ids = []
news_rec_lists = []
pred_prob = []
i = 0
with open(valid_behaviors_file, 'r') as rd:
        impr_index = 0
        for line in rd:
            uid, time, history, impr = line.strip("\n").split('\t')[-4:]

            impr_news = [i.split("-")[0] for i in impr.split()]
            user_ids.append(uid)
            news_rec_lists.append(impr_news)
            pred_prob.append(group_preds[i])
            i+=1
user_rec_df = pd.DataFrame({'user_id' : user_ids, 'news_id': news_rec_lists, 'pred' : pred_prob})
user_rec_df

Unnamed: 0,user_id,news_id,pred
0,U41827,"[N23699, N21291, N1901, N27292, N17443, N18282...","[-0.016029868, 0.028213965, 0.00374203, 0.0324..."
1,U61881,"[N26916, N4641, N25522, N14893, N19035, N3877,...","[-0.08410066, -0.07030145, -0.14174062, -0.028..."
2,U54180,"[N13528, N27689, N10879, N11662, N14409, N6849...","[-0.027243517, -0.02682752, 0.014758418, -0.04..."
3,U41164,"[N20150, N1807, N26916, N28138, N9576, N19737,...","[-0.07029299, -0.06975028, -0.051979825, -0.05..."
4,U8588,"[N21325, N5982, N19737, N9576, N20150, N25701,...","[-0.009216756, -0.0034534195, 0.00040009664, -..."
...,...,...,...
7533,U23841,"[N26256, N28117, N2718, N16798, N27689, N6280,...","[-0.04028006, -0.06562622, 0.016713316, -0.003..."
7534,U28014,"[N26670, N12794, N3390, N17443, N27292, N21852...","[-0.0031647116, -0.019335303, -0.042360865, -0..."
7535,U89684,"[N17443, N16798, N24553, N26096, N15927, N2625...","[-0.026421571, -0.008083752, -0.014075289, -0...."
7536,U92611,"[N14850, N26647, N272, N22751, N21398, N26916,...","[-0.039575763, -0.094993755, -0.020876817, -0...."


In [30]:
print(diversity_eval(user_rec_df, 5))

0.8884769746792819


In [31]:
def normalize_array(arr):
    return (arr - np.min(arr)) / (np.max(arr) - np.min(arr))

user_rec_norm_df = user_rec_df
user_rec_norm_df['pred'] = user_rec_norm_df['pred'].apply(lambda x: normalize_array(x))
user_rec_norm_df

Unnamed: 0,user_id,news_id,pred
0,U41827,"[N23699, N21291, N1901, N27292, N17443, N18282...","[0.62596655, 0.81248397, 0.7093183, 0.8301947,..."
1,U61881,"[N26916, N4641, N25522, N14893, N19035, N3877,...","[0.39407927, 0.4602602, 0.11763841, 0.66133386..."
2,U54180,"[N13528, N27689, N10879, N11662, N14409, N6849...","[0.564781, 0.5672953, 0.8186482, 0.46154946, 0..."
3,U41164,"[N20150, N1807, N26916, N28138, N9576, N19737,...","[0.15490122, 0.1590584, 0.29518163, 0.23938416..."
4,U8588,"[N21325, N5982, N19737, N9576, N20150, N25701,...","[0.50092185, 0.8000171, 1.0, 0.6226852, 0.7715..."
...,...,...,...
7533,U23841,"[N26256, N28117, N2718, N16798, N27689, N6280,...","[0.456568, 0.29370573, 0.82278013, 0.6918397, ..."
7534,U28014,"[N26670, N12794, N3390, N17443, N27292, N21852...","[0.9205003, 0.54074275, 0.0, 0.70027363, 0.941..."
7535,U89684,"[N17443, N16798, N24553, N26096, N15927, N2625...","[0.30298424, 0.67076075, 0.5505967, 0.5666304,..."
7536,U92611,"[N14850, N26647, N272, N22751, N21398, N26916,...","[0.38258368, 0.15424332, 0.45962948, 0.3384243..."


In [32]:
# nid_1 and nid_2 are news ids (strings)
# returns similarity of categories of nid_1 and nid_2 using cosine similarity
# if categories don't exist returns 0
def get_category_similarity(nid_1, nid_2):
    if nid_1 not in news_df.index or nid_2 not in news_df.index:
        return 0
    
    cat1 = news_df.loc[nid_1]["category"]
    cat2 = news_df.loc[nid_2]["category"]
    
    return 1 - torch.cosine_similarity(glove[cat1].unsqueeze(0), glove[cat2].unsqueeze(0)).item()


# Calculates mmr score for a given item
# item: news id
# pred: relevance of item
# recs_so_far: list of news ids recommended so far
def mmr_item(item, pred, recs_so_far, lamda):
    return (lamda * pred) - (1 - lamda) * np.max([1 - get_category_similarity(item, x) for x in recs_so_far])

# Calculates list of recommendations
# recs is a list of news ids
# pred scores is a list of relevance scores, same order as recs
# lamda is a weight parameter
# k is how many items should be in the recommendation; assume k >= 1
def mmr_user(recs, pred_scores, lamda, k):
    list_so_far = [recs[0]]
    while len(list_so_far) < k:
        max_mmr = -2 # mmr can range from -1 to 1
        max_mmr_id = ''
        for i in range(0, len(recs)): #should be a better way to do this
            if recs[i] not in list_so_far:
                mmr_score = mmr_item(recs[i], pred_scores[i], list_so_far, lamda)
                if mmr_score > max_mmr:
                    max_mmr = mmr_score
                    max_mmr_id = recs[i]
        list_so_far.append(max_mmr_id)
    return list_so_far
        
# Calculates recommendations according to mmr for all users
# df is a Pandas dataframe with cols user, news_id, pred where pred[i] is the relevance score for news_id[i]
# lamda is a weight parameter
# k is how many items should be in the recommendation; assume k >= 1
def mmr_all(df, lamda, k):
    result_df = {}
    for index, row in df.iterrows():
        result_df[index] = mmr_user(row['news_id'], row['pred'], lamda, k)
    return result_df


In [33]:
mmr_rerank_data = mmr_all(user_rec_norm_df, 0, 5)
mmr_rerank_df = pd.DataFrame(list(mmr_rerank_data.items()), columns=['user_id', 'news_id'])

mmr_rerank_df

Unnamed: 0,user_id,news_id
0,0,"[N23699, N13670, N1901, N28138, N27292]"
1,1,"[N26916, N4641, N25522, N14893, N19035]"
2,2,"[N13528, N10879, N10908, N13690, N20309]"
3,3,"[N20150, N28138, N19737, N1807, N26916]"
4,4,"[N21325, N10908, N20150, N25701, N13530]"
...,...,...
7533,7533,"[N26256, N28117, N2718, N16798, N27689]"
7534,7534,"[N26670, N12794, N3390, N17443, N27292]"
7535,7535,"[N17443, N23094, N25522, N20309, N18356]"
7536,7536,"[N14850, N27137, N272, N22130, N13670]"


In [34]:
def diversity_user_mmr(recs):
    score = 0.0
    count = 0.0
    for i in range(len(recs)):
        for j in range(i+1, len(recs)):
            count += 1.0
            score += 1-get_category_similarity(recs[i], recs[j])
    return score/count

def diversity_eval_mmr(df):
    diversity_score = 0.0
    count = 0.0
    for index, row in df.iterrows():
        diversity_score += diversity_user_mmr(row['news_id'])
        count += 1.0
    return diversity_score/count
        

In [35]:
print(diversity_eval_mmr(mmr_rerank_df))

0.6016681731163404


## Reference
\[1\] Wu et al. "Neural News Recommendation with Multi-Head Self-Attention." in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)<br>
\[2\] Wu, Fangzhao, et al. "MIND: A Large-scale Dataset for News Recommendation" Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. https://msnews.github.io/competition.html <br>
\[3\] GloVe: Global Vectors for Word Representation. https://nlp.stanford.edu/projects/glove/