<a href="https://colab.research.google.com/github/cs229-mind/mind/blob/main/NRMS/NRMS_pytorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NRMS: Neural News Recommendation with Multi-Head Self-Attention
The following documents are copied from MIND public document.

[NRMS](https://wuch15.github.io/paper/EMNLP2019-NRMS.pdf) is a neural news recommendation approach with multi-head selfattention. The core of NRMS is a news encoder and a user encoder. In the newsencoder, a multi-head self-attentions is used to learn news representations from news titles by modeling the interactions between words. In the user encoder, we learn representations of users from their browsed news and use multihead self-attention to capture the relatedness between the news. Besides, we apply additive
attention to learn more informative news and user 
representations by selecting important words and news.

## Properties of NRMS:
- NRMS is a content-based neural news recommendation approach.
- It uses multi-self attention to learn news representations by modeling the iteractions between words and learn user representations by capturing the relationship between user browsed news.
- NRMS uses additive attentions to learn informative news and user representations by selecting important words and news.

## Data format:
For quicker training and evaluaiton, we sample MINDdemo dataset of 5k users from [MIND small dataset](https://msnews.github.io/). The MINDdemo dataset has the same file format as MINDsmall and MINDlarge. If you want to try experiments on MINDsmall and MINDlarge, please change the dowload source. Select the MIND_type parameter from ['large', 'small', 'demo'] to choose dataset.
 
**MINDdemo_train** is used for training, and **MINDdemo_dev** is used for evaluation. Training data and evaluation data are composed of a news file and a behaviors file. You can find more detailed data description in [MIND repo](https://github.com/msnews/msnews.github.io/blob/master/assets/doc/introduction.md)

### news data
This file contains news information including newsid, category, subcatgory, news title, news abstarct, news url and entities in news title, entities in news abstarct.
One simple example: <br>

`N46466	lifestyle	lifestyleroyals	The Brands Queen Elizabeth, Prince Charles, and Prince Philip Swear By	Shop the notebooks, jackets, and more that the royals can't live without.	https://www.msn.com/en-us/lifestyle/lifestyleroyals/the-brands-queen-elizabeth,-prince-charles,-and-prince-philip-swear-by/ss-AAGH0ET?ocid=chopendata	[{"Label": "Prince Philip, Duke of Edinburgh", "Type": "P", "WikidataId": "Q80976", "Confidence": 1.0, "OccurrenceOffsets": [48], "SurfaceForms": ["Prince Philip"]}, {"Label": "Charles, Prince of Wales", "Type": "P", "WikidataId": "Q43274", "Confidence": 1.0, "OccurrenceOffsets": [28], "SurfaceForms": ["Prince Charles"]}, {"Label": "Elizabeth II", "Type": "P", "WikidataId": "Q9682", "Confidence": 0.97, "OccurrenceOffsets": [11], "SurfaceForms": ["Queen Elizabeth"]}]	[]`
<br>

In general, each line in data file represents information of one piece of news: <br>

`[News ID] [Category] [Subcategory] [News Title] [News Abstrct] [News Url] [Entities in News Title] [Entities in News Abstract] ...`

<br>

We generate a word_dict file to tranform words in news title to word indexes, and a embedding matrix is initted from pretrained glove embeddings.

### behaviors data
One simple example: <br>
`1	U82271	11/11/2019 3:28:58 PM	N3130 N11621 N12917 N4574 N12140 N9748	N13390-0 N7180-0 N20785-0 N6937-0 N15776-0 N25810-0 N20820-0 N6885-0 N27294-0 N18835-0 N16945-0 N7410-0 N23967-0 N22679-0 N20532-0 N26651-0 N22078-0 N4098-0 N16473-0 N13841-0 N15660-0 N25787-0 N2315-0 N1615-0 N9087-0 N23880-0 N3600-0 N24479-0 N22882-0 N26308-0 N13594-0 N2220-0 N28356-0 N17083-0 N21415-0 N18671-0 N9440-0 N17759-0 N10861-0 N21830-0 N8064-0 N5675-0 N15037-0 N26154-0 N15368-1 N481-0 N3256-0 N20663-0 N23940-0 N7654-0 N10729-0 N7090-0 N23596-0 N15901-0 N16348-0 N13645-0 N8124-0 N20094-0 N27774-0 N23011-0 N14832-0 N15971-0 N27729-0 N2167-0 N11186-0 N18390-0 N21328-0 N10992-0 N20122-0 N1958-0 N2004-0 N26156-0 N17632-0 N26146-0 N17322-0 N18403-0 N17397-0 N18215-0 N14475-0 N9781-0 N17958-0 N3370-0 N1127-0 N15525-0 N12657-0 N10537-0 N18224-0`
<br>

In general, each line in data file represents one instance of an impression. The format is like: <br>

`[Impression ID] [User ID] [Impression Time] [User Click History] [Impression News]`

<br>

User Click History is the user historical clicked news before Impression Time. Impression News is the displayed news in an impression, which format is:<br>

`[News ID 1]-[label1] ... [News ID n]-[labeln]`

<br>
Label represents whether the news is clicked by the user. All information of news in User Click History and Impression News can be found in news data file.




## Global settings and imports

In [1]:

import sys
#define util module path
path = '/home/rain/mind/NRMS'
sys.path.append(path)
data_path = '/home/rain/data'

import os
import numpy as np
import zipfile
from tqdm import tqdm
#import scrapbook as sb
from tempfile import TemporaryDirectory
import tensorflow as tf
tf.get_logger().setLevel('ERROR') # only show error messages

from mind_model.dataloader import DataLoaderTrain, DataLoaderTest, test_iterator, train_iterator
from torch.utils.data import Dataset, DataLoader
from mind_model.preprocess import read_news, read_news_bert, get_doc_input, get_doc_input_bert
from mind_model.model_bert import ModelBert
from mind_model.parameters import parse_args
import torch.optim as optim
from pathlib import Path
import mind_model.utils as utils
import os
import sys
import csv
import logging
import datetime
from transformers import AutoTokenizer, AutoModel, AutoConfig
import torch
from torch import nn
from mind_model.nrms import TwoTowerModel as TwoTower

from mind_model.metrics import roc_auc_score, ndcg_score, mrr_score, ctr_score, cal_metric
import csv
import datetime
import time

print("System version: {}".format(sys.version))
print("Tensorflow version: {}".format(tf.__version__))
print("Pytorch version: {}".format(torch.__version__))



System version: 3.9.7 (default, Sep 16 2021, 13:09:58) 
[GCC 7.5.0]
Tensorflow version: 2.4.1
Pytorch version: 1.11.0.dev20211124+cu113


## Prepare parameters

In [10]:
utils.setuplogger()
args = parse_args()
args.enable_hvd = False
args.batch_size = 64
args.news_dim = 400
args.word_emb_size = 768
#data_path = '/content/gdrive/My Drive' # '/content/drive/MyDrive'
train_dir = os.path.join(data_path, 'MINDsmall_train')
train_news_file = os.path.join(train_dir, r'news.tsv')
train_behaviors_file = os.path.join(train_dir, r'behaviors.tsv')

valid_dir = os.path.join(data_path, 'MINDsmall_dev')
valid_news_file = os.path.join(valid_dir, r'news.tsv')
valid_behaviors_file = os.path.join(valid_dir, r'behaviors.tsv')

test_dir = os.path.join(data_path, 'MINDsmall_test')
test_news_file = os.path.join(test_dir, r'news.tsv')
test_behaviors_file = os.path.join(test_dir, r'behaviors.tsv')
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
config = AutoConfig.from_pretrained("bert-base-uncased", output_hidden_states=True)
bert_model = AutoModel.from_pretrained("bert-base-uncased",config=config)
bert_model = bert_model.cuda()

finetuneset={
'encoder.layer.10.attention.self.query.weight',
'encoder.layer.10.attention.self.query.bias',
'encoder.layer.10.attention.self.key.weight',
'encoder.layer.10.attention.self.key.bias',
'encoder.layer.10.attention.self.value.weight',
'encoder.layer.10.attention.self.value.bias',
'encoder.layer.10.attention.output.dense.weight',
'encoder.layer.10.attention.output.dense.bias',
'encoder.layer.10.attention.output.LayerNorm.weight',
'encoder.layer.10.attention.output.LayerNorm.bias',
'encoder.layer.10.intermediate.dense.weight',
'encoder.layer.10.intermediate.dense.bias',
'encoder.layer.10.output.dense.weight',
'encoder.layer.10.output.dense.bias',
'encoder.layer.10.output.LayerNorm.weight',
'encoder.layer.10.output.LayerNorm.bias',
'encoder.layer.11.attention.self.query.weight',
'encoder.layer.11.attention.self.query.bias',
'encoder.layer.11.attention.self.key.weight',
'encoder.layer.11.attention.self.key.bias',
'encoder.layer.11.attention.self.value.weight',
'encoder.layer.11.attention.self.value.bias',
'encoder.layer.11.attention.output.dense.weight',
'encoder.layer.11.attention.output.dense.bias',
'encoder.layer.11.attention.output.LayerNorm.weight',
'encoder.layer.11.attention.output.LayerNorm.bias',
'encoder.layer.11.intermediate.dense.weight',
'encoder.layer.11.intermediate.dense.bias',
'encoder.layer.11.output.dense.weight',
'encoder.layer.11.output.dense.bias',
'encoder.layer.11.output.LayerNorm.weight',
'encoder.layer.11.output.LayerNorm.bias',
'pooler.dense.weight',
'pooler.dense.bias',
'rel_pos_bias.weight',
'classifier.weight',
'classifier.bias'}
for name,param in bert_model.named_parameters():
    if name not in finetuneset:
        param.requires_grad = False


[INFO 2021-11-24 18:29:03,730] Namespace(mode='test', root_data_dir='~/mind/', dataset='MIND', train_dir='train', test_dir='test', filename_pat='behaviors*.tsv', model_dir='./model', batch_size=512, npratio=1, enable_gpu=True, enable_hvd=True, shuffle_buffer_size=10000, num_workers=2, filter_num=0, log_steps=100, epochs=4, lr=0.0001, news_attributes=['title'], process_uet=False, process_bing=False, num_words_title=24, num_words_abstract=50, num_words_body=50, num_words_uet=16, num_words_bing=26, user_log_length=50, word_embedding_dim=768, embedding_source='random', freeze_embedding=False, use_padded_news_embedding=False, padded_news_different_word_index=False, news_dim=64, news_query_vector_dim=200, user_query_vector_dim=200, num_attention_heads=20, user_log_mask=True, drop_rate=0.2, save_steps=500, max_steps_per_epoch=200000, load_ckpt_name='epoch-1-40000.pt', title_share_encoder=False, use_pretrain_news_encoder=False, pretrain_news_encoder_path='.', uet_agg_method='attention', model_

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## Download and load data

In [3]:
import transformers
print(transformers.__version__)

4.12.5


In [4]:
# def train_iterator(train_dir, args, tokenizer):
    
#     train_news_file = os.path.join(train_dir, r'news.tsv')
    
#     news, news_index, category_dict, domain_dict, subcategory_dict = read_news_bert(
#         train_news_file, 
#         args,
#         tokenizer
#     )

#     news_title, news_title_type, news_title_attmask, \
#     news_abstract, news_abstract_type, news_abstract_attmask, \
#     news_body, news_body_type, news_body_attmask, \
#     news_category, news_domain, news_subcategory = get_doc_input_bert(
#         news, news_index, category_dict, domain_dict, subcategory_dict, args)

#     word_dict = None
#     #args.enable_hvd = False
#     hvd_size, hvd_rank, hvd_local_rank = utils.init_hvd_cuda(
#             args.enable_hvd, args.enable_gpu)
#     news_combined = np.concatenate([
#             x for x in
#             [news_title, news_title_type, news_title_attmask, \
#                 news_abstract, news_abstract_type, news_abstract_attmask, \
#                 news_body, news_body_type, news_body_attmask, \
#                 news_category, news_domain, news_subcategory]
#             if x is not None], axis=1)
#     dataloader = DataLoaderTrain(
#             news_index=news_index,
#             news_combined=news_combined,
#             word_dict=word_dict,
#             data_dir=train_dir,
#             filename_pat=args.filename_pat,
#             args=args,
#             worker_size=hvd_size,
#             worker_rank=hvd_rank,
#             cuda_device_idx=hvd_local_rank,
#             enable_prefetch=True,
#             enable_shuffle=True,
#             enable_gpu=args.enable_gpu,
#         )
#     return {"iterator": dataloader,
#             "category_dict":category_dict,
#             "domain_dict":domain_dict,
#             "subcategory_dict":subcategory_dict,
#            }


51282it [00:08, 5906.40it/s]
100%|██████████| 51282/51282 [00:00<00:00, 149789.37it/s]


## Train Model

In [5]:
train_load = train_iterator(train_dir=train_dir, args=args, tokenizer=tokenizer)
category_dict = train_load["category_dict"]
domain_dict = train_load["domain_dict"]
subcategory_dict = train_load["subcategory_dict"]
dataloader = train_load["iterator"]
#from mind_model.nrms import TwoTowerModel as TwoTower
model = TwoTower(args, bert_model, len(category_dict), len(domain_dict), len(subcategory_dict))
args.optimizer = 'Adam'
args.enable_lr_scheduler = False
if args.enable_gpu:
    model = model.cuda()
if args.optimizer == 'Adam':
    optimizer = optim.Adam(model.parameters(), lr=args.lr)
elif args.optimizer == 'AdamW':
    optimizer = AdamW(model.parameters(), lr=args.lr, weight_decay=args.weight_decay, correct_bias=args.correct_bias)
else:
    optimizer = AdamW(model.parameters(), lr=args.lr)
if args.enable_lr_scheduler:
    lr_scheduler = get_scheduler(
        "linear",
        optimizer=optimizer,
        num_warmup_steps=args.num_warmup_steps,
        num_training_steps=num_training_steps
    )
logging.info('Training...')
LOSS, ACC = [], []
args.epochs=1
args.max_steps_per_epoch = 1000
for ep in range(args.epochs):
    loss = 0.0
    accuracy = 0.0
    tqdm_util = tqdm(enumerate(dataloader))
    for cnt, (log_ids, log_mask, input_ids, targets) in tqdm_util:
        if cnt > args.max_steps_per_epoch:
            break

        if args.enable_gpu:
            log_ids = log_ids.cuda(non_blocking=True)
            log_mask = log_mask.cuda(non_blocking=True)
            input_ids = input_ids.cuda(non_blocking=True)
            #user_ids = user_ids.cuda(non_blocking=True)
            targets = targets.cuda(non_blocking=True)

        bz_loss, y_hat = model(input_ids, log_ids, log_mask, targets)
        # summary(model, [input_ids.shape, log_ids.shape, log_mask.shape, targets.shape], batch_size=16, device='cuda' if args.enable_gp else 'cpu')
        optimizer.zero_grad()
        bz_loss.backward()
        optimizer.step()
        if args.enable_lr_scheduler:
            lr_scheduler.step()

        loss += (bz_loss.data.float() - loss) / (cnt + 1)
        accuracy += (utils.acc(targets, y_hat) - accuracy) / (cnt + 1)
        if cnt % args.log_steps == 0:
            LOSS.append(loss.data)
            ACC.append(accuracy)
            #logging.info(
            tqdm_util.set_description(
                '[{}] Ed: {}, train_loss: {:.5f}, acc: {:.5f}'.format(
                    hvd_rank, cnt * args.batch_size, loss.data,
                    accuracy))

        # save model minibatch
        #logging.info('[{}] Ed: {} {} {}'.format(hvd_rank, cnt, args.save_steps, cnt % args.save_steps))
    logging.info('epoch: {} loss: {:.5f} accuracy {:.5f}'.format(ep + 1, loss, accuracy))


[INFO 2021-11-24 16:47:50,467] Training...
[INFO 2021-11-24 16:47:50,468] DataLoader __iter__()
get_files: /home/rain/data/MINDsmall_train, behaviors*.tsv


0it [00:00, ?it/s]

files: ['/home/rain/data/MINDsmall_train/behaviors.tsv']
[INFO 2021-11-24 16:47:50,472] worker_rank:0, worker_size:1, shuffle:True, seed:0, directory:/home/rain/data/MINDsmall_train, files:['/home/rain/data/MINDsmall_train/behaviors.tsv']
[INFO 2021-11-24 16:47:50,493] data_paths: ['/home/rain/data/MINDsmall_train/behaviors.tsv']
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Index'


  user_feature_batch = torch.LongTensor(user_feature_batch).cuda()
[0] Ed: 2000, train_loss: 0.92485, acc: 0.53447: : 1001it [01:48,  9.20it/s]

[INFO 2021-11-24 16:49:39,250] epoch: 1 loss: 0.92485 accuracy 0.53447





In [None]:
#args.log_steps

In [11]:
# #args.max_steps_per_epoch
# class NewsDataset(Dataset):
#     def __init__(self, data):
#         self.data = data

#     def __getitem__(self, idx):
#         return self.data[idx]

#     def __len__(self):
#         return self.data.shape[0]
    
    
# def news_collate_fn(arr):
#     arr = torch.LongTensor(arr)
#     return arr


# def test_iterator(model, valid_dir, args, tokenizer):
#     valid_news_file = os.path.join(valid_dir, r'news.tsv')
#     news, news_index, category_dict, domain_dict, subcategory_dict = read_news_bert(
#         valid_news_file, 
#         args,
#         tokenizer
#     )

#     news_title, news_title_type, news_title_attmask, \
#     news_abstract, news_abstract_type, news_abstract_attmask, \
#     news_body, news_body_type, news_body_attmask, \
#     news_category, news_domain, news_subcategory = get_doc_input_bert(
#         news, news_index, category_dict, domain_dict, subcategory_dict, args)

#     word_dict = None
#     #args.enable_hvd = False
#     hvd_size, hvd_rank, hvd_local_rank = utils.init_hvd_cuda(
#             args.enable_hvd, args.enable_gpu)
#     news_combined = np.concatenate([
#             x for x in
#             [news_title, news_title_type, news_title_attmask, \
#                 news_abstract, news_abstract_type, news_abstract_attmask, \
#                 news_body, news_body_type, news_body_attmask, \
#                 news_category, news_domain, news_subcategory]
#             if x is not None], axis=1)

#     news_dataset = NewsDataset(news_combined)
#     news_dataloader = DataLoader(news_dataset,
#                                 batch_size=args.batch_size * 4,
#                                 num_workers=args.num_workers,
#                                 collate_fn=news_collate_fn)

#     news_scoring = []
#     with torch.no_grad():
#         for input_ids in tqdm(news_dataloader):
#             if args.enable_gpu:
#                 input_ids = input_ids.cuda()
#             news_vec = model.news_encoder(input_ids)
#             news_vec = news_vec.to(torch.device("cpu")).detach().numpy()
#             news_scoring.extend(news_vec)

#     news_scoring = np.array(news_scoring)    

#     dev_dataloader = DataLoaderTest(
#             news_index=news_index,
#             news_scoring=news_scoring,
#             word_dict=word_dict,
#             data_dir=valid_dir,
#             filename_pat=args.filename_pat,
#             args=args,
#             worker_size=hvd_size,
#             worker_rank=hvd_rank,
#             cuda_device_idx=hvd_local_rank,
#             enable_prefetch=True,
#             enable_shuffle=True,
#             enable_gpu=args.enable_gpu,
#         )
#     return {"iterator": dev_dataloader,
#             "category_dict":category_dict,
#             "domain_dict":domain_dict,
#             "subcategory_dict":subcategory_dict,
#            }


42416it [00:07, 5939.14it/s]
100%|██████████| 42416/42416 [00:00<00:00, 144090.45it/s]
100%|██████████| 166/166 [00:27<00:00,  5.95it/s]


In [32]:
news_scoring.shape

(42417, 400)

In [None]:

start_time = time.time()

AUC, MRR, nDCG5, nDCG10, SCORE = [], [], [], [], []
count = 0
outfile = os.path.join(valid_dir, "prediction_{}_{}.tsv".format(datetime.datetime.utcnow().strftime("%Y%m%d%H%M%S"), hvd_local_rank))


# def get_mean(arr):
#     return [np.array(i).mean() for i in arr]

# def write_score(SCORE, outfile):
#     # format the score: ImpressionID [Rank-of-News1,Rank-of-News2,...,Rank-of-NewsN]
#     for score in tqdm(SCORE):
#         argsort = np.argsort(-score[1])
#         ranks = np.empty_like(argsort)
#         ranks[argsort] = np.arange(len(score[1]))
#         score[1] = (ranks + 1).tolist()

#     # save the prediction result
#     def write_tsv(score):
#         with open(outfile, 'a') as out_file:
#             tsv_writer = csv.writer(out_file, delimiter='\t')
#             tsv_writer.writerows(score)
#     write_tsv(SCORE)
# tqdm_util_eval = tqdm(enumerate(dev_dataloader))
# def print_metrics(hvd_local_rank, cnt, x):
#     tqdm_util.set_description("[{}] Ed: {}: {}".format(hvd_local_rank, cnt, \
#         '\t'.join(["{:0.2f}".format(i * 100) for i in x])))
# def cal_metric(labels, preds, metrics):
#     """Calculate metrics.

#     Available options are: `auc`, `rmse`, `logloss`, `acc` (accurary), `f1`, `mean_mrr`,
#     `ndcg` (format like: ndcg@2;4;6;8), `hit` (format like: hit@2;4;6;8), `group_auc`.

#     Args:
#         labels (array-like): Labels.
#         preds (array-like): Predictions.
#         metrics (list): List of metric names.

#     Return:
#         dict: Metrics.

#     Examples:
#         >>> cal_metric(labels, preds, ["ndcg@2;4;6", "group_auc"])
#         {'ndcg@2': 0.4026, 'ndcg@4': 0.4953, 'ndcg@6': 0.5346, 'group_auc': 0.8096}

#     """
#     res = {}
#     for metric in metrics:
#         if metric == "auc":
#             auc = roc_auc_score(np.asarray(labels), np.asarray(preds))
#             res["auc"] = round(auc, 4)
#         elif metric == "mean_mrr":
#             mean_mrr = np.mean(
#                 [
#                     mrr_score(each_labels, each_preds)
#                     for each_labels, each_preds in zip(labels, preds)
#                 ]
#             )
#             res["mean_mrr"] = round(mean_mrr, 4)
#         elif metric.startswith("ndcg"):  # format like:  ndcg@2;4;6;8
#             ndcg_list = [1, 2]
#             ks = metric.split("@")
#             if len(ks) > 1:
#                 ndcg_list = [int(token) for token in ks[1].split(";")]
#             for k in ndcg_list:
#                 ndcg_temp = np.mean(
#                     [
#                         ndcg_score(each_labels, each_preds, k)
#                         for each_labels, each_preds in zip(labels, preds)
#                     ]
#                 )
#                 res["ndcg@{0}".format(k)] = round(ndcg_temp, 4)
#         elif metric == "group_auc":
#             group_auc = np.mean(
#                 [
#                     roc_auc_score(each_labels, each_preds)
#                     for each_labels, each_preds in zip(labels, preds)
#                 ]
#             )
#             res["group_auc"] = round(group_auc, 4)
#         else:
#             raise ValueError("Metric {0} not defined".format(metric))
#     return res

test_load = test_iterator(model=model, valid_dir=valid_dir, args=args, tokenizer=tokenizer)

dev_dataloader = test_load["iterator"]


eval_metrics = ['group_auc', 'mean_mrr', 'ndcg@5;10']
group_impr_indexes = []
group_labels = []
group_preds = []

tqdm_util_eval = tqdm(enumerate(dev_dataloader))

with torch.no_grad():    
    for cnt, (impression_ids, log_vecs, log_mask, news_vecs, news_bias, labels) in tqdm_util_eval:
        count = cnt
        if args.enable_gpu:
            #user_ids = user_ids.cuda(non_blocking=True)
            log_vecs = log_vecs.cuda(non_blocking=True)
            log_mask = log_mask.cuda(non_blocking=True)
        user_vecs = model.user_encoder(log_vecs, log_mask).to(torch.device("cpu")).detach().numpy()

        for impression_id, user_vec, news_vec, bias, label in zip(
                impression_ids, user_vecs, news_vecs, news_bias, labels):
            
            score = np.dot(
                news_vec, user_vec
            )
            group_impr_indexes.append(impression_id)
            group_labels.append(label)
            group_preds.append(score)
            
    # print and save metrics
    #logging.info("Print final metrics")
    print("Print final metrics")
    eval_res = cal_metric(labels=group_labels, preds=group_preds, metrics=eval_metrics)
    eval_info = ", ".join(
                [
                    str(item[0]) + ":" + str(item[1])
                    for item in sorted(eval_res.items(), key=lambda x: x[0])
                ]
            )

    logging.info(f"Time taken: {time.time() - start_time}")
    print(f"eval info: {eval_info} and Time taken: {time.time() - start_time}")

[INFO 2021-11-24 22:14:55,895] DataLoader __iter__()
[INFO 2021-11-24 22:14:55,895] DataLoader __iter__()
[INFO 2021-11-24 22:14:55,898] shut down pool.
[INFO 2021-11-24 22:14:55,898] shut down pool.
get_files: /home/rain/data/MINDsmall_dev, behaviors*.tsv

0it [00:00, ?it/s]


files: ['/home/rain/data/MINDsmall_dev/behaviors.tsv']
[INFO 2021-11-24 22:14:55,908] worker_rank:0, worker_size:1, shuffle:True, seed:2, directory:/home/rain/data/MINDsmall_dev, files:['/home/rain/data/MINDsmall_dev/behaviors.tsv']
[INFO 2021-11-24 22:14:55,908] worker_rank:0, worker_size:1, shuffle:True, seed:2, directory:/home/rain/data/MINDsmall_dev, files:['/home/rain/data/MINDsmall_dev/behaviors.tsv']
[INFO 2021-11-24 22:14:55,910] visible_devices:[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]
[INFO 2021-11-24 22:14:55,910] visible_devices:[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]
[INFO 2021-11-24 22:14:55,912] [StreamReader] path_len:1, paths: ['/home/rain/data/MINDsmall_dev/behaviors.tsv']
[INFO 2021-11-24 22:14:55,912] [StreamReader] path_len:1, paths: ['/home/rain/data/MINDsmall_dev/behaviors.tsv']


1143it [02:42,  7.25it/s]

In [27]:
#np.array(news_vecs).shape, user_vecs.shape

for new_vec in news_vecs[:10]:
    print(new_vec.shape)
user_vecs.shape

(20, 400)
(9, 400)
(54, 400)
(55, 400)
(26, 400)
(9, 400)
(7, 400)
(53, 400)
(36, 400)
(19, 400)


(64, 400)

In [29]:
(news_vecs[0] @ user_vecs[0]).shape

(20,)

In [24]:
user_vecs.shape
#score = torch.bmm(news_vec, user_vecs.expand_dims(-1)).squeeze(dim=-1)

(64, 400)

In [25]:

#loss = criterion(score, targets)


np.dot(
                np.stack([i for i in news_vecs], axis=0),
                user_vecs,
            )

ValueError: all input arrays must have the same shape

In [None]:
loss.data

## Train the NRMS model

In [None]:
iterator = MINDIterator

In [None]:
model = NRMSModel(hparams, iterator, seed=seed)

In [None]:
_= model.test_iterator.load_data_from_file(valid_news_file, valid_behaviors_file)

In [None]:
dir(model)

In [None]:
dir(model.test_iterator)


In [None]:
??model.test_iterator.load_data_from_file

In [None]:
#news_file
model.test_iterator.init_news(valid_news_file)

In [None]:
print(model.run_eval(valid_news_file, valid_behaviors_file))

In [None]:
%%time
model.fit(train_news_file, train_behaviors_file, valid_news_file, valid_behaviors_file)

In [None]:
%%time
res_syn = model.run_eval(valid_news_file, valid_behaviors_file)
print(res_syn)


In [None]:
sb.glue("res_syn", res_syn)

## Save the model

In [None]:
model_path = os.path.join(data_path, "model")
os.makedirs(model_path, exist_ok=True)

model.model.save_weights(os.path.join(model_path, "nrms_ckpt"))

## Output Predcition File
This code segment is used to generate the prediction.zip file, which is in the same format in [MIND Competition Submission Tutorial](https://competitions.codalab.org/competitions/24122#learn_the_details-submission-guidelines).

Please change the `MIND_type` parameter to `large` if you want to submit your prediction to [MIND Competition](https://msnews.github.io/competition.html).

In [None]:
test_news_file = os.path.join(data_path, 'MINDlarge_test', r'news.tsv')
test_behaviors_file = os.path.join(data_path, 'MINDlarge_test', r'behaviors.tsv')


In [None]:
group_impr_indexes, group_labels, group_preds = model.run_fast_eval(test_news_file, test_behaviors_file)

In [None]:
with open(os.path.join(data_path, 'prediction.txt'), 'w') as f:
    for impr_index, preds in tqdm(zip(group_impr_indexes, group_preds)):
        impr_index += 1
        pred_rank = (np.argsort(np.argsort(preds)[::-1]) + 1).tolist()
        pred_rank = '[' + ','.join([str(i) for i in pred_rank]) + ']'
        f.write(' '.join([str(impr_index), pred_rank])+ '\n')

In [None]:
f = zipfile.ZipFile(os.path.join(data_path, 'prediction.zip'), 'w', zipfile.ZIP_DEFLATED)
f.write(os.path.join(data_path, 'prediction.txt'), arcname='prediction.txt')
f.close()

## Reference
\[1\] Wu et al. "Neural News Recommendation with Multi-Head Self-Attention." in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)<br>
\[2\] Wu, Fangzhao, et al. "MIND: A Large-scale Dataset for News Recommendation" Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. https://msnews.github.io/competition.html <br>
\[3\] GloVe: Global Vectors for Word Representation. https://nlp.stanford.edu/projects/glove/