<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>

<i>Licensed under the MIT License.</i>

# LSTUR: Neural News Recommendation with Long- and Short-term User Representations
LSTUR \[1\] is a news recommendation approach capturing users' both long-term preferences and short-term interests. The core of LSTUR is a news encoder and a user encoder.  In the news encoder, we learn representations of news from their titles. In user encoder, we propose to learn long-term
user representations from the embeddings of their IDs. In addition, we propose to learn short-term user representations from their recently browsed news via GRU network. Besides, we propose two methods to combine
long-term and short-term user representations. The first one is using the long-term user representation to initialize the hidden state of the GRU network in short-term user representation. The second one is concatenating both
long- and short-term user representations as a unified user vector.

## Properties of LSTUR:
- LSTUR captures users' both long-term and short term preference.
- It uses embeddings of users' IDs to learn long-term user representations.
- It uses users' recently browsed news via GRU network to learn short-term user representations.

## Data format:
For quicker training and evaluaiton, we sample MINDdemo dataset of 5k users from [MIND small dataset](https://msnews.github.io/). The MINDdemo dataset has the same file format as MINDsmall and MINDlarge. If you want to try experiments on MINDsmall and MINDlarge, please change the dowload source. Select the MIND_type parameter from ['large', 'small', 'demo'] to choose dataset.
 
**MINDdemo_train** is used for training, and **MINDdemo_dev** is used for evaluation. Training data and evaluation data are composed of a news file and a behaviors file. You can find more detailed data description in [MIND repo](https://github.com/msnews/msnews.github.io/blob/master/assets/doc/introduction.md)

### news data
This file contains news information including newsid, category, subcatgory, news title, news abstarct, news url and entities in news title, entities in news abstarct.
One simple example: <br>

`N46466	lifestyle	lifestyleroyals	The Brands Queen Elizabeth, Prince Charles, and Prince Philip Swear By	Shop the notebooks, jackets, and more that the royals can't live without.	https://www.msn.com/en-us/lifestyle/lifestyleroyals/the-brands-queen-elizabeth,-prince-charles,-and-prince-philip-swear-by/ss-AAGH0ET?ocid=chopendata	[{"Label": "Prince Philip, Duke of Edinburgh", "Type": "P", "WikidataId": "Q80976", "Confidence": 1.0, "OccurrenceOffsets": [48], "SurfaceForms": ["Prince Philip"]}, {"Label": "Charles, Prince of Wales", "Type": "P", "WikidataId": "Q43274", "Confidence": 1.0, "OccurrenceOffsets": [28], "SurfaceForms": ["Prince Charles"]}, {"Label": "Elizabeth II", "Type": "P", "WikidataId": "Q9682", "Confidence": 0.97, "OccurrenceOffsets": [11], "SurfaceForms": ["Queen Elizabeth"]}]	[]`
<br>

In general, each line in data file represents information of one piece of news: <br>

`[News ID] [Category] [Subcategory] [News Title] [News Abstrct] [News Url] [Entities in News Title] [Entities in News Abstract] ...`

<br>

We generate a word_dict file to tranform words in news title to word indexes, and a embedding matrix is initted from pretrained glove embeddings.

### behaviors data
One simple example: <br>
`1	U82271	11/11/2019 3:28:58 PM	N3130 N11621 N12917 N4574 N12140 N9748	N13390-0 N7180-0 N20785-0 N6937-0 N15776-0 N25810-0 N20820-0 N6885-0 N27294-0 N18835-0 N16945-0 N7410-0 N23967-0 N22679-0 N20532-0 N26651-0 N22078-0 N4098-0 N16473-0 N13841-0 N15660-0 N25787-0 N2315-0 N1615-0 N9087-0 N23880-0 N3600-0 N24479-0 N22882-0 N26308-0 N13594-0 N2220-0 N28356-0 N17083-0 N21415-0 N18671-0 N9440-0 N17759-0 N10861-0 N21830-0 N8064-0 N5675-0 N15037-0 N26154-0 N15368-1 N481-0 N3256-0 N20663-0 N23940-0 N7654-0 N10729-0 N7090-0 N23596-0 N15901-0 N16348-0 N13645-0 N8124-0 N20094-0 N27774-0 N23011-0 N14832-0 N15971-0 N27729-0 N2167-0 N11186-0 N18390-0 N21328-0 N10992-0 N20122-0 N1958-0 N2004-0 N26156-0 N17632-0 N26146-0 N17322-0 N18403-0 N17397-0 N18215-0 N14475-0 N9781-0 N17958-0 N3370-0 N1127-0 N15525-0 N12657-0 N10537-0 N18224-0`
<br>

In general, each line in data file represents one instance of an impression. The format is like: <br>

`[Impression ID] [User ID] [Impression Time] [User Click History] [Impression News]`

<br>

User Click History is the user historical clicked news before Impression Time. Impression News is the displayed news in an impression, which format is:<br>

`[News ID 1]-[label1] ... [News ID n]-[labeln]`

<br>
Label represents whether the news is clicked by the user. All information of news in User Click History and Impression News can be found in news data file.

## Global settings and imports

In [3]:
import sys
import os
import numpy as np
import zipfile
from tqdm import tqdm
import scrapbook as sb
from tempfile import TemporaryDirectory
import tensorflow as tf
tf.get_logger().setLevel('ERROR') # only show error messages

from recommenders.models.deeprec.deeprec_utils import download_deeprec_resources 
from recommenders.models.newsrec.newsrec_utils import prepare_hparams
from recommenders.models.newsrec.models.lstur import LSTURModel
from recommenders.models.newsrec.io.mind_iterator import MINDIterator
from recommenders.models.newsrec.newsrec_utils import get_mind_data_set

print("System version: {}".format(sys.version))
print("Tensorflow version: {}".format(tf.__version__))


System version: 3.9.12 (main, Apr  5 2022, 01:53:17) 
[Clang 12.0.0 ]
Tensorflow version: 2.12.0


In [56]:
import pandas as pd

# Prepare Parameters

In [5]:
epochs = 5
seed = 40
batch_size = 32

# Options: demo, small, large
MIND_type = 'demo'

## Download and load data

In [6]:
tmpdir = TemporaryDirectory()
data_path = tmpdir.name

train_news_file = os.path.join(data_path, 'train', r'news.tsv')
train_behaviors_file = os.path.join(data_path, 'train', r'behaviors.tsv')
valid_news_file = os.path.join(data_path, 'valid', r'news.tsv')
valid_behaviors_file = os.path.join(data_path, 'valid', r'behaviors.tsv')
wordEmb_file = os.path.join(data_path, "utils", "embedding.npy")
userDict_file = os.path.join(data_path, "utils", "uid2index.pkl")
wordDict_file = os.path.join(data_path, "utils", "word_dict.pkl")
yaml_file = os.path.join(data_path, "utils", r'lstur.yaml')

mind_url, mind_train_dataset, mind_dev_dataset, mind_utils = get_mind_data_set(MIND_type)

if not os.path.exists(train_news_file):
    download_deeprec_resources(mind_url, os.path.join(data_path, 'train'), mind_train_dataset)
    
if not os.path.exists(valid_news_file):
    download_deeprec_resources(mind_url, \
                               os.path.join(data_path, 'valid'), mind_dev_dataset)
if not os.path.exists(yaml_file):
    download_deeprec_resources(r'https://recodatasets.z20.web.core.windows.net/newsrec/', \
                               os.path.join(data_path, 'utils'), mind_utils)

100%|██████████| 17.0k/17.0k [00:03<00:00, 4.92kKB/s]
100%|██████████| 9.84k/9.84k [00:01<00:00, 7.98kKB/s]
100%|██████████| 95.0k/95.0k [00:12<00:00, 7.73kKB/s]


## Create hyper-parameters

In [7]:
hparams = prepare_hparams(yaml_file, 
                          wordEmb_file=wordEmb_file,
                          wordDict_file=wordDict_file, 
                          userDict_file=userDict_file,
                          batch_size=batch_size,
                          epochs=epochs)
print(hparams)

HParams object with values {'support_quick_scoring': True, 'dropout': 0.2, 'attention_hidden_dim': 200, 'head_num': 4, 'head_dim': 100, 'filter_num': 400, 'window_size': 3, 'vert_emb_dim': 100, 'subvert_emb_dim': 100, 'gru_unit': 400, 'type': 'ini', 'user_emb_dim': 50, 'learning_rate': 0.0001, 'optimizer': 'adam', 'epochs': 5, 'batch_size': 32, 'show_step': 100000, 'title_size': 30, 'his_size': 50, 'data_format': 'news', 'npratio': 4, 'metrics': ['group_auc', 'mean_mrr', 'ndcg@5;10'], 'word_emb_dim': 300, 'cnn_activation': 'relu', 'model_type': 'lstur', 'loss': 'cross_entropy_loss', 'wordEmb_file': '/var/folders/sv/970s472x5tz75kxxbpmggty40000gn/T/tmpm5ej7ep3/utils/embedding.npy', 'wordDict_file': '/var/folders/sv/970s472x5tz75kxxbpmggty40000gn/T/tmpm5ej7ep3/utils/word_dict.pkl', 'userDict_file': '/var/folders/sv/970s472x5tz75kxxbpmggty40000gn/T/tmpm5ej7ep3/utils/uid2index.pkl'}


In [8]:
iterator = MINDIterator

## Train the LSTUR model

In [9]:
model = LSTURModel(hparams, iterator, seed=seed)

2023-12-06 14:44:00.699947: W tensorflow/c/c_api.cc:300] Operation '{name:'embedding/embeddings/Assign' id:27 op device:{requested: '', assigned: ''} def:{{{node embedding/embeddings/Assign}} = AssignVariableOp[_has_manual_control_dependencies=true, dtype=DT_FLOAT, validate_shape=false](embedding/embeddings, embedding/embeddings/Initializer/stateless_random_uniform)}}' was changed by setting attribute after it was run by a session. This mutation will have no effect, and will trigger an error in the future. Either don't modify nodes after running them or create a new session.


Tensor("conv1d/Relu:0", shape=(None, 30, 400), dtype=float32)
Tensor("att_layer2/Sum_1:0", shape=(None, 400), dtype=float32)


  super().__init__(name, **kwargs)


In [10]:
print(model.run_eval(valid_news_file, valid_behaviors_file))

  updates=self.state_updates,
2023-12-06 14:44:03.849637: W tensorflow/c/c_api.cc:300] Operation '{name:'conv1d/bias/Assign' id:62 op device:{requested: '', assigned: ''} def:{{{node conv1d/bias/Assign}} = AssignVariableOp[_has_manual_control_dependencies=true, dtype=DT_FLOAT, validate_shape=false](conv1d/bias, conv1d/bias/Initializer/zeros)}}' was changed by setting attribute after it was run by a session. This mutation will have no effect, and will trigger an error in the future. Either don't modify nodes after running them or create a new session.
586it [00:09, 62.14it/s]
0it [00:00, ?it/s]2023-12-06 14:44:13.105162: W tensorflow/c/c_api.cc:300] Operation '{name:'gru/strided_slice_2' id:958 op device:{requested: '', assigned: ''} def:{{{node gru/strided_slice_2}} = StridedSlice[Index=DT_INT32, T=DT_FLOAT, _has_manual_control_dependencies=true, begin_mask=0, ellipsis_mask=0, end_mask=0, new_axis_mask=0, shrink_axis_mask=1](gru/TensorArrayV2Stack/TensorListStack, gru/strided_slice_2/s

{'group_auc': 0.5201, 'mean_mrr': 0.2214, 'ndcg@5': 0.2292, 'ndcg@10': 0.2912}


In [10]:
%%time
model.fit(train_news_file, train_behaviors_file, valid_news_file, valid_behaviors_file)

0it [00:00, ?it/s]2023-11-29 15:39:04.049869: W tensorflow/c/c_api.cc:300] Operation '{name:'loss/mul' id:2061 op device:{requested: '', assigned: ''} def:{{{node loss/mul}} = Mul[T=DT_FLOAT, _has_manual_control_dependencies=true](loss/mul/x, loss/activation_loss/value)}}' was changed by setting attribute after it was run by a session. This mutation will have no effect, and will trigger an error in the future. Either don't modify nodes after running them or create a new session.
2023-11-29 15:39:04.319160: W tensorflow/c/c_api.cc:300] Operation '{name:'training/Adam/gru/gru_cell/recurrent_kernel/v/Assign' id:2750 op device:{requested: '', assigned: ''} def:{{{node training/Adam/gru/gru_cell/recurrent_kernel/v/Assign}} = AssignVariableOp[_has_manual_control_dependencies=true, dtype=DT_FLOAT, validate_shape=false](training/Adam/gru/gru_cell/recurrent_kernel/v, training/Adam/gru/gru_cell/recurrent_kernel/v/Initializer/zeros)}}' was changed by setting attribute after it was run by a sessio

at epoch 1
train info: logloss loss:1.487765562688009
eval info: group_auc:0.5924, mean_mrr:0.2564, ndcg@10:0.3438, ndcg@5:0.2809
at epoch 1 , train time: 1423.5 eval time: 81.8


1086it [22:56,  1.27s/it]
586it [00:04, 127.10it/s]
236it [01:18,  3.01it/s]
7538it [00:00, 16086.97it/s]


at epoch 2
train info: logloss loss:1.4049629973004096
eval info: group_auc:0.6169, mean_mrr:0.2777, ndcg@10:0.3681, ndcg@5:0.3046
at epoch 2 , train time: 1376.5 eval time: 87.1


1086it [48:05,  2.66s/it]
586it [00:04, 132.57it/s]
236it [01:15,  3.12it/s]
7538it [00:00, 15095.72it/s]


at epoch 3
train info: logloss loss:1.3577124978626631
eval info: group_auc:0.6285, mean_mrr:0.2877, ndcg@10:0.3799, ndcg@5:0.3163
at epoch 3 , train time: 2885.3 eval time: 84.4


1086it [23:03,  1.27s/it]
586it [00:04, 129.52it/s]
236it [01:17,  3.03it/s]
7538it [00:00, 16914.32it/s]


at epoch 4
train info: logloss loss:1.320989074188921
eval info: group_auc:0.6275, mean_mrr:0.2833, ndcg@10:0.3769, ndcg@5:0.3135
at epoch 4 , train time: 1383.6 eval time: 86.6


1086it [2:16:19,  7.53s/it] 
586it [00:04, 129.15it/s]
236it [01:14,  3.18it/s]
7538it [00:00, 15665.07it/s]


at epoch 5
train info: logloss loss:1.2866800133255525
eval info: group_auc:0.6424, mean_mrr:0.2949, ndcg@10:0.3901, ndcg@5:0.327
at epoch 5 , train time: 8179.2 eval time: 83.0
CPU times: user 13h 53min 21s, sys: 12min 55s, total: 14h 6min 16s
Wall time: 4h 21min 10s


<recommenders.models.newsrec.models.lstur.LSTURModel at 0x7fa90e391b20>

In [11]:
%%time
res_syn = model.run_eval(valid_news_file, valid_behaviors_file)
print(res_syn)

586it [00:04, 131.36it/s]
236it [01:14,  3.17it/s]
7538it [00:00, 18118.93it/s]


{'group_auc': 0.6424, 'mean_mrr': 0.2949, 'ndcg@5': 0.327, 'ndcg@10': 0.3901}
CPU times: user 8min 48s, sys: 6.29 s, total: 8min 54s
Wall time: 1min 23s


In [12]:
sb.glue("res_syn", res_syn)

## Save the model

In [13]:
model_path = os.path.join(data_path, "model")

In [19]:
model_path

'/var/folders/sv/970s472x5tz75kxxbpmggty40000gn/T/tmprxpytgkb/model'

In [20]:
os.makedirs(model_path, exist_ok=True)

model.model.save_weights(os.path.join(model_path, "lstur_ckpt"))

## Output Prediction File
This code segment is used to generate the prediction.zip file, which is in the same format in [MIND Competition Submission Tutorial](https://competitions.codalab.org/competitions/24122#learn_the_details-submission-guidelines).

Please change the `MIND_type` parameter to `large` if you want to submit your prediction to [MIND Competition](https://msnews.github.io/competition.html).

In [17]:
len(valid_news_file) # 18723 news
valid_behaviors_file # 7538 users


'/var/folders/sv/970s472x5tz75kxxbpmggty40000gn/T/tmpm5ej7ep3/valid/behaviors.tsv'

In [11]:
group_impr_indexes, group_labels, group_preds = model.run_fast_eval(valid_news_file, valid_behaviors_file)

586it [00:08, 70.38it/s]
236it [02:51,  1.37it/s]
7538it [00:00, 7911.75it/s]


In [1]:
#model.train_iterator.uid2index

In [71]:
user_ids = []
news_rec_lists = []
user_clicks = defaultdict(())
with open(valid_behaviors_file, 'r') as rd:
        impr_index = 0
        for line in rd:
            uid, time, history, impr = line.strip("\n").split('\t')[-4:]
            impr_news = [i.split("-")[0] for i in impr.split()]
            click_news = ()
            for i in impr.split():
                impr_news.append(i.split("-")[0])
                if i.split("-")[1] == '1':
                    click_news.add(i.split("-")[0])
            user_ids.append(uid)
            news_rec_lists.append(impr_news)
            user_clicks[uid] = click_news
user_rec_df = pd.DataFrame({'user_id' : user_ids, 'news_id': news_rec_lists})
user_rec_df

Unnamed: 0,user_id,news_id
0,U41827,"[N23699, N21291, N1901, N27292, N17443, N18282..."
1,U61881,"[N26916, N4641, N25522, N14893, N19035, N3877,..."
2,U54180,"[N13528, N27689, N10879, N11662, N14409, N6849..."
3,U41164,"[N20150, N1807, N26916, N28138, N9576, N19737,..."
4,U8588,"[N21325, N5982, N19737, N9576, N20150, N25701,..."
...,...,...
7533,U23841,"[N26256, N28117, N2718, N16798, N27689, N6280,..."
7534,U28014,"[N26670, N12794, N3390, N17443, N27292, N21852..."
7535,U89684,"[N17443, N16798, N24553, N26096, N15927, N2625..."
7536,U92611,"[N14850, N26647, N272, N22751, N21398, N26916,..."


In [115]:
user_train_ids = []
news_rec_train_lists = []
news_rec_imp_train_lists = []
with open(train_behaviors_file, 'r') as rd:
        impr_index = 0
        for line in rd:
            uid, time, history, impr = line.strip("\n").split('\t')[-4:]
            impr_news = [i.split("-")[0] for i in impr.split()]
            impr_news_click = [i.split("-")[1] for i in impr.split()]
            
            user_train_ids.append(uid)
            news_rec_lists.append(impr_news)
            news_rec_imp_train_lists.append(impr_news_click)
user_rec_impr_df = pd.DataFrame({'user_id' : user_train_ids, 'news_id': news_rec_lists, 'click' : impr_news_click})
user_rec_impr_df

ValueError: All arrays must be of the same length

In [94]:
import os
import tempfile
import urllib
import zipfile

# Temporary folder for data we need during execution of this notebook (we'll clean up
# at the end, we promise)
temp_dir = os.path.join(tempfile.gettempdir(), 'mind')
os.makedirs(temp_dir, exist_ok=True)

# The dataset is split into training and validation set, each with a large and small version.
# The format of the four files are the same.
# For demonstration purpose, we will use small version validation set only.
base_url = 'https://mind201910small.blob.core.windows.net/release'
training_small_url = f'{base_url}/MINDsmall_train.zip'
validation_small_url = f'{base_url}/MINDsmall_dev.zip'
training_large_url = f'{base_url}/MINDlarge_train.zip'
validation_large_url = f'{base_url}/MINDlarge_dev.zip'

def download_url(url,
                 destination_filename=None,
                 progress_updater=None,
                 force_download=False,
                 verbose=True):
    """
    Download a URL to a temporary file
    """
    if not verbose:
        progress_updater = None
    # This is not intended to guarantee uniqueness, we just know it happens to guarantee
    # uniqueness for this application.
    if destination_filename is None:
        url_as_filename = url.replace('://', '_').replace('/', '_')
        destination_filename = \
            os.path.join(temp_dir,url_as_filename)
    if (not force_download) and (os.path.isfile(destination_filename)):
        if verbose:
            print('Bypassing download of already-downloaded file {}'.format(
                os.path.basename(url)))
        return destination_filename
    if verbose:
        print('Downloading file {} to {}'.format(os.path.basename(url),
                                                 destination_filename),
              end='')
    urllib.request.urlretrieve(url, destination_filename, progress_updater)
    assert (os.path.isfile(destination_filename))
    nBytes = os.path.getsize(destination_filename)
    if verbose:
        print('...done, {} bytes.'.format(nBytes))
    return destination_filename

zip_path = download_url(validation_small_url, verbose=True)
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(temp_dir)

os.listdir(temp_dir)

news_path = os.path.join(temp_dir, 'news.tsv')
news_df = pd.read_table(news_path,
              header=None,
              names=[
                  'id', 'category', 'subcategory', 'title', 'abstract', 'url',
                  'title_entities', 'abstract_entities'
              ])
news_df.set_index("id", inplace=True)

Bypassing download of already-downloaded file MINDsmall_dev.zip


In [80]:
import torch
import torchtext

glove = torchtext.vocab.GloVe(name="6B", dim=50)

.vector_cache/glove.6B.zip: 862MB [02:39, 5.42MB/s]                               
100%|█████████▉| 399999/400000 [00:16<00:00, 24486.24it/s]


In [168]:
def get_category_similarity(nid_1, nid_2):
    if nid_1 not in news_df.index or nid_2 not in news_df.index:
        return 0
    
    cat1 = news_df.loc[nid_1]["category"]
    cat2 = news_df.loc[nid_2]["category"]
    if cat1 == cat2:
        return 0
    return 1-torch.cosine_similarity(glove[cat1].unsqueeze(0), glove[cat2].unsqueeze(0)).item()


def diversity_user(user_id, recs, pred_scores, k):
    score = 0.0
    count = 0.0
    combined_data = list(zip(recs, pred_scores))
    sorted_data = sorted(combined_data, key=lambda x: x[1])

    topk_pairs = sorted_data[:k]
    actual_set = user_clicks[user_id]
    predicted_set = set(topk_pairs)
    intersection_size = len(actual_set.intersection(predicted_set))
    union_size = len(actual_set.union(predicted_set))
    
    accuracy = intersection_size / union_size if union_size > 0 else 1.0
    for i in range(len(topk_pairs)):
        for j in range(i+1, len(topk_pairs)):
            count += 1.0
            score += (get_category_similarity(topk_pairs[i][0], topk_pairs[j][0]))
    return (score/count, accuracy)
        

def diversity_eval(df, top_k=5):
    diversity_score = 0.0
    accuracy_score = 0.0
    count = 0.0
    for index, row in df.iterrows():
        div_score, accuracy = diversity_user(row['user_id'], row['news_id'], row['pred'],top_k)
        diversity_score += div_score
        accuracy_score += accuracy
        
        count += 1.0
    return (diversity_score/count, accuracy_score/count)


In [107]:
# news_file = pd.read_csv(valid_news_file, sep='\t', header=None)
# news_dict = {}
# for i in range(len(news_file[0])):
#     news_dict[i] = news_file[0][i]
# news_dict

In [137]:
with open(os.path.join(data_path, 'prediction.txt'), 'w') as f:
    for impr_index, preds in tqdm(zip(group_impr_indexes, group_preds)):
        impr_index += 1
        pred_rank = (np.argsort(np.argsort(preds)[::-1]) + 1).tolist()
        pred_rank = '[' + ','.join([str(i) for i in pred_rank]) + ']'
        f.write(' '.join([str(impr_index), pred_rank])+ '\n')

7538it [00:00, 30719.83it/s]


In [23]:
f = zipfile.ZipFile(os.path.join(data_path, 'prediction.zip'), 'w', zipfile.ZIP_DEFLATED)
f.write(os.path.join(data_path, 'prediction.txt'), arcname='prediction.txt')
f.close()

In [24]:
a = ""
for impr_index, preds in tqdm(zip(group_impr_indexes, group_preds)):
    impr_index += 1
    pred_rank = (np.argsort(np.argsort(preds)[::-1]) + 1).tolist()
    pred_rank = '[' + ','.join([str(i) for i in pred_rank]) + ']'
    a += ' '.join([str(impr_index), pred_rank])+ '\n'

7538it [00:00, 37373.05it/s]


In [112]:
#a

In [145]:
user_ids = []
news_rec_lists = []
pred_prob = []
i = 0
with open(valid_behaviors_file, 'r') as rd:
        impr_index = 0
        for line in rd:
            uid, time, history, impr = line.strip("\n").split('\t')[-4:]

            impr_news = [i.split("-")[0] for i in impr.split()]
            user_ids.append(uid)
            news_rec_lists.append(impr_news)
            pred_prob.append(group_preds[i])
            i+=1
user_rec_df = pd.DataFrame({'user_id' : user_ids, 'news_id': news_rec_lists, 'pred' : pred_prob})
user_rec_df

Unnamed: 0,user_id,news_id,pred
0,U41827,"[N23699, N21291, N1901, N27292, N17443, N18282...","[-0.23009975, -0.29446548, -0.31063515, -0.284..."
1,U61881,"[N26916, N4641, N25522, N14893, N19035, N3877,...","[-0.12344916, -0.19023736, -0.11021778, -0.353..."
2,U54180,"[N13528, N27689, N10879, N11662, N14409, N6849...","[-0.20686844, -0.24127425, -0.4198683, -0.4167..."
3,U41164,"[N20150, N1807, N26916, N28138, N9576, N19737,...","[0.0710382, -0.3026214, -0.18058312, -0.148491..."
4,U8588,"[N21325, N5982, N19737, N9576, N20150, N25701,...","[0.045522336, -0.15960245, -0.11037267, -0.096..."
...,...,...,...
7533,U23841,"[N26256, N28117, N2718, N16798, N27689, N6280,...","[-0.40440995, -0.27444282, -0.17943214, -0.174..."
7534,U28014,"[N26670, N12794, N3390, N17443, N27292, N21852...","[-0.51609087, -0.42063886, -0.37046754, -0.493..."
7535,U89684,"[N17443, N16798, N24553, N26096, N15927, N2625...","[-0.26791376, -0.15503119, -0.21906719, -0.547..."
7536,U92611,"[N14850, N26647, N272, N22751, N21398, N26916,...","[-0.21738777, -0.31038123, -0.06966715, -0.147..."


In [169]:
print(diversity_eval(user_rec_df, 5))

0.15471620098955108


In [125]:
arr = user_rec_df[user_rec_df['user_id'] == 'U41827']['pred'][0]

In [135]:
(arr)

array([-0.23009975, -0.29446548, -0.31063515, -0.28458637, -0.52098835,
       -0.40606982, -0.3717472 , -0.42147577, -0.4371817 , -0.21698667,
       -0.18725532, -0.23469843, -0.4772066 , -0.31384993, -0.33005893,
       -0.25722098, -0.05341837, -0.2671283 , -0.24351975, -0.30540362,
       -0.38846433, -0.26063126, -0.28985044, -0.16100723, -0.4308247 ,
       -0.14284346, -0.08029664, -0.3594852 ], dtype=float32)

In [131]:
np.argsort(arr)

array([ 4, 12,  8, 24,  7,  5, 20,  6, 27, 14, 13,  2, 19,  1, 22,  3, 17,
       21, 15, 18, 11,  0,  9, 10, 23, 25, 26, 16])

In [130]:
np.argsort(arr)[::-1]

array([16, 26, 25, 23, 10,  9,  0, 11, 18, 15, 21, 17,  3, 22,  1, 19,  2,
       13, 14, 27,  6, 20,  5,  7, 24,  8, 12,  4])

In [128]:
(np.argsort(np.argsort(arr)[::-1]) + 1)

array([ 7, 15, 17, 13, 28, 23, 21, 24, 26,  6,  5,  8, 27, 18, 19, 10,  1,
       12,  9, 16, 22, 11, 14,  4, 25,  3,  2, 20])

In [146]:
def normalize_array(arr):
    return (arr - np.min(arr)) / (np.max(arr) - np.min(arr))

user_rec_norm_df = user_rec_df
user_rec_norm_df['pred'] = user_rec_norm_df['pred'].apply(lambda x: normalize_array(x))
user_rec_norm_df

Unnamed: 0,user_id,news_id,pred
0,U41827,"[N23699, N21291, N1901, N27292, N17443, N18282...","[0.6221285, 0.48446837, 0.44988602, 0.505597, ..."
1,U61881,"[N26916, N4641, N25522, N14893, N19035, N3877,...","[0.7075086, 0.6030081, 0.72821116, 0.34779397,..."
2,U54180,"[N13528, N27689, N10879, N11662, N14409, N6849...","[0.50076586, 0.45842782, 0.23865907, 0.2425101..."
3,U41164,"[N20150, N1807, N26916, N28138, N9576, N19737,...","[1.0, 0.21484806, 0.47128087, 0.5387131, 0.522..."
4,U8588,"[N21325, N5982, N19737, N9576, N20150, N25701,...","[0.9240295, 0.27624315, 0.43171135, 0.47573537..."
...,...,...,...
7533,U23841,"[N26256, N28117, N2718, N16798, N27689, N6280,...","[0.20291752, 0.3977028, 0.54009795, 0.5481512,..."
7534,U28014,"[N26670, N12794, N3390, N17443, N27292, N21852...","[0.017089143, 0.32717785, 0.49016613, 0.089560..."
7535,U89684,"[N17443, N16798, N24553, N26096, N15927, N2625...","[0.3492988, 0.49014106, 0.4102441, 0.0, 0.3336..."
7536,U92611,"[N14850, N26647, N272, N22751, N21398, N26916,...","[0.3207426, 0.16066132, 0.5750326, 0.4410269, ..."


In [174]:
# nid_1 and nid_2 are news ids (strings)
# returns similarity of categories of nid_1 and nid_2 using cosine similarity
# if categories don't exist returns 0
def get_category_similarity(nid_1, nid_2):
    if nid_1 not in news_df.index or nid_2 not in news_df.index:
        return 0
    
    cat1 = news_df.loc[nid_1]["category"]
    cat2 = news_df.loc[nid_2]["category"]
    if cat1 == cat2:
        return 0
    return 1 - torch.cosine_similarity(glove[cat1].unsqueeze(0), glove[cat2].unsqueeze(0)).item()


# Calculates mmr score for a given item
# item: news id
# pred: relevance of item
# recs_so_far: list of news ids recommended so far
def mmr_item(item, pred, recs_so_far, lamda):
    return (lamda * pred) - (1 - lamda) * np.max([1 - get_category_similarity(item, x) for x in recs_so_far])

# Calculates list of recommendations
# recs is a list of news ids
# pred scores is a list of relevance scores, same order as recs
# lamda is a weight parameter
# k is how many items should be in the recommendation; assume k >= 1
def mmr_user(recs, pred_scores, lamda, k):
    combined_data = list(zip(recs, pred_scores))
    sorted_data = sorted(combined_data, key=lambda x: x[1], reverse=True)
    
    list_so_far = [sorted_data[0][0]]
    while len(list_so_far) < k:
        max_mmr = -2 # mmr can range from -1 to 1
        max_mmr_id = ''
        for i in range(0, len(recs)): #should be a better way to do this
            if recs[i] not in list_so_far:
                mmr_score = mmr_item(recs[i], pred_scores[i], list_so_far, lamda)
                if mmr_score > max_mmr:
                    max_mmr = mmr_score
                    max_mmr_id = recs[i]
        list_so_far.append(max_mmr_id)
    return list_so_far
        
# Calculates recommendations according to mmr for all users
# df is a Pandas dataframe with cols user, news_id, pred where pred[i] is the relevance score for news_id[i]
# lamda is a weight parameter
# k is how many items should be in the recommendation; assume k >= 1
def mmr_all(df, lamda, k):
    result_df = {}
    for index, row in df.iterrows():
        result_df[row['user_id']] = mmr_user(row['news_id'], row['pred'], lamda, k)
    return result_df


In [175]:
mmr_rerank_data = mmr_all(user_rec_norm_df, 0, 5)
mmr_rerank_df = pd.DataFrame(list(mmr_rerank_data.items()), columns=['user_id', 'news_id'])

mmr_rerank_df

Unnamed: 0,user_id,news_id
0,0,"[N17774, N1901, N23699, N21291, N13670]"
1,1,"[N21325, N14893, N19829, N25522, N3877]"
2,2,"[N28493, N10879, N20309, N28275, N13690]"
3,3,"[N20150, N28138, N19737, N1807, N26916]"
4,4,"[N20150, N10908, N21325, N5982, N19737]"
...,...,...
7533,7533,"[N28275, N28117, N18356, N27689, N697]"
7534,7534,"[N9576, N26670, N12794, N3390, N17443]"
7535,7535,"[N28275, N20309, N18356, N23094, N22207]"
7536,7536,"[N28275, N27137, N272, N18356, N13670]"


In [170]:
def diversity_user_mmr(recs):
    score = 0.0
    count = 0.0
    for i in range(len(recs)):
        for j in range(i+1, len(recs)):
            count += 1.0
            score += get_category_similarity(recs[i], recs[j])
    return score/count

def diversity_eval_mmr(df):
    diversity_score = 0.0
    count = 0.0
    for index, row in df.iterrows():
        diversity_score += diversity_user_mmr(row['news_id'])
        count += 1.0
    return diversity_score/count
        

In [176]:
print(diversity_eval_mmr(mmr_rerank_df))

0.4740145313376873


In [None]:
accuracies = []

for i, row in enumerate(mmr_rerank_df.iterrows()):
    actual_set = user_clicks[row['user_id']]
    predicted_set = set(row['news_id'])
    
    intersection_size = len(actual_set.intersection(predicted_set))
    union_size = len(actual_set.union(predicted_set))
    
    accuracy = intersection_size / union_size if union_size > 0 else 1.0
    accuracies.append(accuracy)

# Calculate mean accuracy
mean_accuracy = np.mean(accuracies)
print(mean_accuracy)

In [177]:
mmr_rerank_data_75 = mmr_all(user_rec_norm_df, 0.75, 5)
mmr_rerank_df_75 = pd.DataFrame(list(mmr_rerank_data_75.items()), columns=['user_id', 'news_id'])

mmr_rerank_df_75

Unnamed: 0,user_id,news_id
0,0,"[N17774, N9181, N21815, N14623, N23699]"
1,1,"[N21325, N19829, N27355, N18356, N27787]"
2,2,"[N28493, N21780, N28275, N21325, N20150]"
3,3,"[N20150, N21325, N28138, N9576, N19737]"
4,4,"[N20150, N21325, N10908, N9576, N19737]"
...,...,...
7533,7533,"[N28275, N20150, N21325, N18356, N7067]"
7534,7534,"[N9576, N21815, N23829, N26916, N21645]"
7535,7535,"[N28275, N20150, N21325, N20309, N18356]"
7536,7536,"[N28275, N24445, N21325, N27137, N18356]"


In [178]:
print(diversity_eval_mmr(mmr_rerank_df_75))

0.28146573441342115


In [None]:
accuracies = []

for i, row in enumerate(mmr_rerank_df_75.iterrows()):
    actual_set = user_clicks[row['user_id']]
    predicted_set = set(row['news_id'])
    
    intersection_size = len(actual_set.intersection(predicted_set))
    union_size = len(actual_set.union(predicted_set))
    
    accuracy = intersection_size / union_size if union_size > 0 else 1.0
    accuracies.append(accuracy)

# Calculate mean accuracy
mean_accuracy = np.mean(accuracies)
print(mean_accuracy)

## Reference
\[1\] Mingxiao An, Fangzhao Wu, Chuhan Wu, Kun Zhang, Zheng Liu and Xing Xie: Neural News Recommendation with Long- and Short-term User Representations, ACL 2019<br>
\[2\] Wu, Fangzhao, et al. "MIND: A Large-scale Dataset for News Recommendation" Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. https://msnews.github.io/competition.html <br>
\[3\] GloVe: Global Vectors for Word Representation. https://nlp.stanford.edu/projects/glove/