<a href="https://colab.research.google.com/github/cs229-mind/mind/blob/main/NRMS/NRMS_pytorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NRMS: Neural News Recommendation with Multi-Head Self-Attention
The following documents are copied from MIND public document.

[NRMS](https://wuch15.github.io/paper/EMNLP2019-NRMS.pdf) is a neural news recommendation approach with multi-head selfattention. The core of NRMS is a news encoder and a user encoder. In the newsencoder, a multi-head self-attentions is used to learn news representations from news titles by modeling the interactions between words. In the user encoder, we learn representations of users from their browsed news and use multihead self-attention to capture the relatedness between the news. Besides, we apply additive
attention to learn more informative news and user 
representations by selecting important words and news.

## Properties of NRMS:
- NRMS is a content-based neural news recommendation approach.
- It uses multi-self attention to learn news representations by modeling the iteractions between words and learn user representations by capturing the relationship between user browsed news.
- NRMS uses additive attentions to learn informative news and user representations by selecting important words and news.

## Data format:
For quicker training and evaluaiton, we sample MINDdemo dataset of 5k users from [MIND small dataset](https://msnews.github.io/). The MINDdemo dataset has the same file format as MINDsmall and MINDlarge. If you want to try experiments on MINDsmall and MINDlarge, please change the dowload source. Select the MIND_type parameter from ['large', 'small', 'demo'] to choose dataset.
 
**MINDdemo_train** is used for training, and **MINDdemo_dev** is used for evaluation. Training data and evaluation data are composed of a news file and a behaviors file. You can find more detailed data description in [MIND repo](https://github.com/msnews/msnews.github.io/blob/master/assets/doc/introduction.md)

### news data
This file contains news information including newsid, category, subcatgory, news title, news abstarct, news url and entities in news title, entities in news abstarct.
One simple example: <br>

`N46466	lifestyle	lifestyleroyals	The Brands Queen Elizabeth, Prince Charles, and Prince Philip Swear By	Shop the notebooks, jackets, and more that the royals can't live without.	https://www.msn.com/en-us/lifestyle/lifestyleroyals/the-brands-queen-elizabeth,-prince-charles,-and-prince-philip-swear-by/ss-AAGH0ET?ocid=chopendata	[{"Label": "Prince Philip, Duke of Edinburgh", "Type": "P", "WikidataId": "Q80976", "Confidence": 1.0, "OccurrenceOffsets": [48], "SurfaceForms": ["Prince Philip"]}, {"Label": "Charles, Prince of Wales", "Type": "P", "WikidataId": "Q43274", "Confidence": 1.0, "OccurrenceOffsets": [28], "SurfaceForms": ["Prince Charles"]}, {"Label": "Elizabeth II", "Type": "P", "WikidataId": "Q9682", "Confidence": 0.97, "OccurrenceOffsets": [11], "SurfaceForms": ["Queen Elizabeth"]}]	[]`
<br>

In general, each line in data file represents information of one piece of news: <br>

`[News ID] [Category] [Subcategory] [News Title] [News Abstrct] [News Url] [Entities in News Title] [Entities in News Abstract] ...`

<br>

We generate a word_dict file to tranform words in news title to word indexes, and a embedding matrix is initted from pretrained glove embeddings.

### behaviors data
One simple example: <br>
`1	U82271	11/11/2019 3:28:58 PM	N3130 N11621 N12917 N4574 N12140 N9748	N13390-0 N7180-0 N20785-0 N6937-0 N15776-0 N25810-0 N20820-0 N6885-0 N27294-0 N18835-0 N16945-0 N7410-0 N23967-0 N22679-0 N20532-0 N26651-0 N22078-0 N4098-0 N16473-0 N13841-0 N15660-0 N25787-0 N2315-0 N1615-0 N9087-0 N23880-0 N3600-0 N24479-0 N22882-0 N26308-0 N13594-0 N2220-0 N28356-0 N17083-0 N21415-0 N18671-0 N9440-0 N17759-0 N10861-0 N21830-0 N8064-0 N5675-0 N15037-0 N26154-0 N15368-1 N481-0 N3256-0 N20663-0 N23940-0 N7654-0 N10729-0 N7090-0 N23596-0 N15901-0 N16348-0 N13645-0 N8124-0 N20094-0 N27774-0 N23011-0 N14832-0 N15971-0 N27729-0 N2167-0 N11186-0 N18390-0 N21328-0 N10992-0 N20122-0 N1958-0 N2004-0 N26156-0 N17632-0 N26146-0 N17322-0 N18403-0 N17397-0 N18215-0 N14475-0 N9781-0 N17958-0 N3370-0 N1127-0 N15525-0 N12657-0 N10537-0 N18224-0`
<br>

In general, each line in data file represents one instance of an impression. The format is like: <br>

`[Impression ID] [User ID] [Impression Time] [User Click History] [Impression News]`

<br>

User Click History is the user historical clicked news before Impression Time. Impression News is the displayed news in an impression, which format is:<br>

`[News ID 1]-[label1] ... [News ID n]-[labeln]`

<br>
Label represents whether the news is clicked by the user. All information of news in User Click History and Impression News can be found in news data file.




## Global settings and imports

In [1]:

import sys
#define util module path
path = '/home/jupyter/mind/NRMS'
sys.path.append(path)
data_path = '/home/jupyter/data'


import os
import numpy as np
import zipfile
from tqdm import tqdm
#import scrapbook as sb
from tempfile import TemporaryDirectory
import tensorflow as tf
tf.get_logger().setLevel('ERROR') # only show error messages

from mind_model.dataloader import DataLoaderTrain, DataLoaderTest, test_iterator, train_iterator
from torch.utils.data import Dataset, DataLoader
from mind_model.preprocess import read_news, read_news_bert, get_doc_input, get_doc_input_bert
from mind_model.model_bert import ModelBert
from mind_model.parameters import parse_args
import torch.optim as optim
from pathlib import Path
import mind_model.utils as utils
import os
import sys
import csv
import logging
import datetime
from transformers import AutoTokenizer, AutoModel, AutoConfig, get_scheduler, AdamW
import torch
from torch import nn
from mind_model.nrms import TwoTowerModel as TwoTower

from mind_model.metrics import roc_auc_score, ndcg_score, mrr_score, ctr_score, cal_metric
import csv
import datetime
import time
import transformers

print("System version: {}".format(sys.version))
print("Transformers version: {}".format(transformers.__version__))
print("Tensorflow version: {}".format(tf.__version__))
print("Pytorch version: {}".format(torch.__version__))



System version: 3.7.10 | packaged by conda-forge | (default, Oct 13 2021, 20:51:14) 
[GCC 9.4.0]
Transformers version: 4.12.5
Tensorflow version: 2.7.0
Pytorch version: 1.9.0


## Prepare parameters

In [2]:
model_path = os.path.join(data_path, "model_chkpt")
os.makedirs(model_path, exist_ok=True)


class EarlyStopping:
    def __init__(self, patience=5):
        self.patience = patience
        self.counter = 0
        self.best_loss = np.Inf

    def __call__(self, val_loss):
        """
        if you use other metrics where a higher value is better, e.g. accuracy,
        call this with its corresponding negative value
        """
        if val_loss < self.best_loss:
            early_stop = False
            get_better = True
            self.counter = 0
            self.best_loss = val_loss
        else:
            get_better = False
            self.counter += 1
            if self.counter >= self.patience:
                early_stop = True
            else:
                early_stop = False

        return early_stop, get_better

    
def run_eval(dev_dataloader, model, run_eval_metric=True, early_stopping=None):
    eval_metrics = ['group_auc', 'mean_mrr', 'ndcg@5;10']
    group_impr_indexes = []
    group_labels = []
    group_preds = []
    start_time = time.time()

    tqdm_util_eval = tqdm(enumerate(dev_dataloader))
    early_stop, get_better = False, False

    with torch.no_grad():    
        for cnt, (impression_ids, user_ids, log_vecs, log_mask, news_vecs, news_bias, labels) in tqdm_util_eval:

            #count = cnt
            if args.enable_gpu:
                #user_ids = user_ids.cuda(non_blocking=True)
                log_vecs = log_vecs.cuda(non_blocking=True)
                log_mask = log_mask.cuda(non_blocking=True)
            user_vecs = model.user_encoder(log_vecs, log_mask).to(torch.device("cpu")).detach().numpy()

            for impression_id, user_vec, news_vec, bias, label in zip(
                    impression_ids, user_vecs, news_vecs, news_bias, labels):

                score = np.dot(
                    news_vec, user_vec
                )
                group_impr_indexes.append(impression_id)
                group_labels.append(label)
                group_preds.append(score)
    if run_eval_metric:
        print("Print final metrics")
        eval_res = cal_metric(labels=group_labels, preds=group_preds, metrics=eval_metrics)
        eval_info = ", ".join(
                    [
                        str(item[0]) + ":" + str(item[1])
                        for item in sorted(eval_res.items(), key=lambda x: x[0])
                    ]
                )

        logging.info(f"Time taken: {time.time() - start_time :.0f}s")
        print(f"Eval info: {eval_info} and time taken: {time.time() - start_time:.0f}s")
        if early_stopping is not None:
            early_stop, get_better = early_stopping(-eval_res["group_auc"])
            
    else:
        print(f"Time taken: {time.time() - start_time}s")
    return group_impr_indexes, group_labels, group_preds, (early_stop, get_better)

#group_impr_indexes, group_labels, group_preds = run_eval(dev_dataloader, model)

def model_fit(model, optimizer, train_dataloader, valid_dir, args, tokenizer, lr_scheduler=None, early_stopping=None):
    logging.info('Training...')
    LOSS, ACC = [], []

    for ep in range(args.epochs):
        loss = 0.0
        accuracy = 0.0
        tqdm_util = tqdm(enumerate(train_dataloader))
        model_name = "bert"
        #series = 0
        model_entity = os.path.join(model_path, f"model_chkpt{datetime.datetime.utcnow().strftime('%Y%m%d')}{model_name}{ep}")
        torch.save(model, model_entity)
        model.train()
        for cnt, (user_ids, log_ids, log_mask, input_ids, targets) in tqdm_util:

            if cnt > args.max_steps_per_epoch and args.subsampling:
                break

            if args.enable_gpu:
                log_ids = log_ids.cuda(non_blocking=True)
                log_mask = log_mask.cuda(non_blocking=True)
                input_ids = input_ids.cuda(non_blocking=True)
                #user_ids = user_ids.cuda(non_blocking=True)
                targets = targets.cuda(non_blocking=True)

            bz_loss, y_hat = model(input_ids, log_ids, log_mask, targets)
            optimizer.zero_grad()
            bz_loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 0.5)
            optimizer.step()
            if args.enable_lr_scheduler:
                lr_scheduler.step()

            loss += (bz_loss.data.float() - loss) / (cnt + 1)
            accuracy += (utils.acc(targets, y_hat) - accuracy) / (cnt + 1)
            if loss.isnan() or accuracy.isnan():
                break
            if cnt % args.log_steps == 0:
                #LOSS.append(loss.data)
                #ACC.append(accuracy)
                #logging.info(
                tqdm_util.set_description(
                    '[{}] Ed: {}, train_loss: {:.5f}, acc: {:.5f}'.format(
                        hvd_rank, cnt * args.batch_size, loss.data,
                        accuracy))

        print('epoch: {} loss: {:.5f} accuracy {:.5f}'.format(ep + 1, loss, accuracy))
        
        dev_load = test_iterator(model=model, valid_dir=valid_dir, args=args, tokenizer=tokenizer)
        dev_dataloader = dev_load["iterator"]  
        model.eval()
        _, _, _, (early_stop, get_better) = run_eval(dev_dataloader, model, early_stopping)
        if get_better:
            try:
                print("Save better model")
                torch.save(
                    {
                        'model_state_dict': model.state_dict(),
                        'optimizer_state_dict': optimizer.state_dict(),
                        'step':cnt * args.batch_size,
                        'early_stop_value': -val_auc
                    }, os.path.join(model_path, f"{datetime.datetime.utcnow().strftime('%Y%m%d')}ckpt-{cnt * args.batch_size}.pth"))
            except OSError as error:
                print(f"OS error: {error}")
    return model



In [3]:
#args.clip_grad

In [4]:
utils.setuplogger()
args = parse_args()
args.enable_hvd = False
args.batch_size = 48
args.news_dim = 400
args.word_emb_size = 768
args.user_attributes = None


train_dir = os.path.join(data_path, 'MINDlarge_train')
train_news_file = os.path.join(train_dir, r'news.tsv')
train_behaviors_file = os.path.join(train_dir, r'behaviors.tsv')

valid_dir = os.path.join(data_path, 'MINDlarge_dev')
valid_news_file = os.path.join(valid_dir, r'news.tsv')
valid_behaviors_file = os.path.join(valid_dir, r'behaviors.tsv')



tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
config = AutoConfig.from_pretrained("bert-base-uncased", output_hidden_states=True)
bert_model = AutoModel.from_pretrained("bert-base-uncased",config=config)
bert_model = bert_model.cuda()

finetuneset={
'encoder.layer.10.attention.self.query.weight',
'encoder.layer.10.attention.self.query.bias',
'encoder.layer.10.attention.self.key.weight',
'encoder.layer.10.attention.self.key.bias',
'encoder.layer.10.attention.self.value.weight',
'encoder.layer.10.attention.self.value.bias',
'encoder.layer.10.attention.output.dense.weight',
'encoder.layer.10.attention.output.dense.bias',
'encoder.layer.10.attention.output.LayerNorm.weight',
'encoder.layer.10.attention.output.LayerNorm.bias',
'encoder.layer.10.intermediate.dense.weight',
'encoder.layer.10.intermediate.dense.bias',
'encoder.layer.10.output.dense.weight',
'encoder.layer.10.output.dense.bias',
'encoder.layer.10.output.LayerNorm.weight',
'encoder.layer.10.output.LayerNorm.bias',
'encoder.layer.11.attention.self.query.weight',
'encoder.layer.11.attention.self.query.bias',
'encoder.layer.11.attention.self.key.weight',
'encoder.layer.11.attention.self.key.bias',
'encoder.layer.11.attention.self.value.weight',
'encoder.layer.11.attention.self.value.bias',
'encoder.layer.11.attention.output.dense.weight',
'encoder.layer.11.attention.output.dense.bias',
'encoder.layer.11.attention.output.LayerNorm.weight',
'encoder.layer.11.attention.output.LayerNorm.bias',
'encoder.layer.11.intermediate.dense.weight',
'encoder.layer.11.intermediate.dense.bias',
'encoder.layer.11.output.dense.weight',
'encoder.layer.11.output.dense.bias',
'encoder.layer.11.output.LayerNorm.weight',
'encoder.layer.11.output.LayerNorm.bias',
'pooler.dense.weight',
'pooler.dense.bias',
'rel_pos_bias.weight',
'classifier.weight',
'classifier.bias'}
for name,param in bert_model.named_parameters():
    if name not in finetuneset:
        param.requires_grad = False

        


[INFO 2021-12-01 04:17:40,952] Namespace(batch_size=512, config_name='../Turing/unilm2-base-uncased-config.json', dataset='MIND', do_lower_case=True, drop_rate=0.2, embedding_source='random', enable_gpu=True, enable_hvd=True, epochs=4, f='/home/jupyter/.local/share/jupyter/runtime/kernel-e581a6b9-c751-44da-a76b-d951640bb674.json', filename_pat='behaviors*.tsv', filter_num=0, freeze_embedding=False, load_ckpt_name='epoch-1-40000.pt', log_steps=100, lr=0.0001, max_steps_per_epoch=200000, mode='test', model_dir='./model', model_name_or_path='../Turing/unilm2-base-uncased.bin', model_type='tnlrv3', news_attributes=['title'], news_dim=64, news_query_vector_dim=200, npratio=1, num_attention_heads=20, num_words_abstract=50, num_words_bing=26, num_words_body=50, num_words_title=24, num_words_uet=16, num_workers=2, padded_news_different_word_index=False, pretrain_news_encoder_path='.', process_bing=False, process_uet=False, root_data_dir='~/mind/', save_steps=500, shuffle_buffer_size=10000, tes

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## Train Model

In [5]:

hvd_size, hvd_rank, hvd_local_rank = utils.init_hvd_cuda(
    args.enable_hvd, args.enable_gpu)

##Load Train data 
train_load = train_iterator(train_dir=train_dir, args=args, tokenizer=tokenizer)
category_dict = train_load["category_dict"]
domain_dict = train_load["domain_dict"]
subcategory_dict = train_load["subcategory_dict"]
train_dataloader = train_load["iterator"]


model = TwoTower(args, bert_model, len(category_dict), len(domain_dict), len(subcategory_dict))
args.optimizer = 'Adam'
args.enable_lr_scheduler = False
args.epochs=1
args.max_steps_per_epoch = 8000
args.subsampling = True

if args.enable_gpu:
    model = model.cuda()
    
if args.optimizer == 'Adam':
    optimizer = optim.Adam(model.parameters(), lr=args.lr)
elif args.optimizer == 'AdamW':
    optimizer = AdamW(model.parameters(), lr=args.lr, weight_decay=args.weight_decay, correct_bias=args.correct_bias)
else:
    optimizer = AdamW(model.parameters(), lr=args.lr)
    

if args.enable_lr_scheduler:
    lr_scheduler = get_scheduler(
        "linear",
        optimizer=optimizer,
        num_warmup_steps=args.num_warmup_steps,
        num_training_steps=num_training_steps
    )

    
    

early_stopping = EarlyStopping()


101527it [00:15, 6610.56it/s]
100%|██████████| 101527/101527 [00:01<00:00, 98064.80it/s]


In [None]:
## Initial fit
args.epochs=1
args.max_steps_per_epoch = 10
args.subsampling = True
model = model_fit(model=model, optimizer=optimizer, 
          train_dataloader=train_dataloader, 
          valid_dir=valid_dir, args=args, 
          tokenizer=tokenizer, lr_scheduler=None,)

In [22]:
## Continue fit
test_model = torch.load(model_path + "/model_chkpt20211127bert")
optimizer = optim.Adam(test_model.parameters(), lr=0.00001)
early_stopping = EarlyStopping()



args.epochs = 1
args.max_steps_per_epoch = 20000
args.subsampling = True
test_model = model_fit(model=test_model, optimizer=optimizer, 
                       train_dataloader=train_dataloader, 
                       valid_dir=valid_dir, args=args, 
                       tokenizer=tokenizer, lr_scheduler=None, 
                       early_stopping=early_stopping)
# Print optimizer's state_dict
print("Finished!")
#for var_name in optimizer.state_dict():
#    print(var_name, "\t", optimizer.state_dict()[var_name])




[INFO 2021-11-29 08:09:19,847] Training...
[INFO 2021-11-29 08:09:19,847] Training...
[INFO 2021-11-29 08:09:19,847] Training...
[INFO 2021-11-29 08:09:19,847] Training...
[INFO 2021-11-29 08:09:19,849] DataLoader __iter__()
[INFO 2021-11-29 08:09:19,849] DataLoader __iter__()
[INFO 2021-11-29 08:09:19,849] DataLoader __iter__()
[INFO 2021-11-29 08:09:19,849] DataLoader __iter__()
[INFO 2021-11-29 08:09:19,853] shut down pool.
[INFO 2021-11-29 08:09:19,853] shut down pool.
[INFO 2021-11-29 08:09:19,853] shut down pool.
[INFO 2021-11-29 08:09:19,853] shut down pool.
get_files: /home/jupyter/data/MINDlarge_train, behaviors*.tsv
files: ['/home/jupyter/data/MINDlarge_train/behaviors.tsv']
[INFO 2021-11-29 08:09:19,858] worker_rank:0, worker_size:1, shuffle:True, seed:2, directory:/home/jupyter/data/MINDlarge_train, files:['/home/jupyter/data/MINDlarge_train/behaviors.tsv']


0it [00:00, ?it/s]

[INFO 2021-11-29 08:09:19,858] worker_rank:0, worker_size:1, shuffle:True, seed:2, directory:/home/jupyter/data/MINDlarge_train, files:['/home/jupyter/data/MINDlarge_train/behaviors.tsv']
[INFO 2021-11-29 08:09:19,858] worker_rank:0, worker_size:1, shuffle:True, seed:2, directory:/home/jupyter/data/MINDlarge_train, files:['/home/jupyter/data/MINDlarge_train/behaviors.tsv']
[INFO 2021-11-29 08:09:19,858] worker_rank:0, worker_size:1, shuffle:True, seed:2, directory:/home/jupyter/data/MINDlarge_train, files:['/home/jupyter/data/MINDlarge_train/behaviors.tsv']
[INFO 2021-11-29 08:09:19,861] data_paths: ['/home/jupyter/data/MINDlarge_train/behaviors.tsv']
[INFO 2021-11-29 08:09:19,861] data_paths: ['/home/jupyter/data/MINDlarge_train/behaviors.tsv']
[INFO 2021-11-29 08:09:19,861] data_paths: ['/home/jupyter/data/MINDlarge_train/behaviors.tsv']
[INFO 2021-11-29 08:09:19,861] data_paths: ['/home/jupyter/data/MINDlarge_train/behaviors.tsv']


[0] Ed: 960000, train_loss: 0.50733, acc: 0.74781: : 20001it [6:48:39,  1.23s/it]


epoch: 1 loss: 0.50733 accuracy 0.74781


72023it [00:12, 5712.64it/s]
100%|██████████| 72023/72023 [00:00<00:00, 86142.68it/s]
  0%|          | 0/376 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


  0%|          | 1/376 [00:00<01:18,  4.80it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 376/376 [00:28<00:00, 13.03it/s]


[INFO 2021-11-29 14:58:42,065] DataLoader __iter__()
[INFO 2021-11-29 14:58:42,065] DataLoader __iter__()
[INFO 2021-11-29 14:58:42,065] DataLoader __iter__()
[INFO 2021-11-29 14:58:42,065] DataLoader __iter__()
get_files: /home/jupyter/data/MINDlarge_dev, behaviors*.tsv
files: ['/home/jupyter/data/MINDlarge_dev/behaviors.tsv']
[INFO 2021-11-29 14:58:42,071] worker_rank:0, worker_size:1, shuffle:False, seed:0, directory:/home/jupyter/data/MINDlarge_dev, files:['/home/jupyter/data/MINDlarge_dev/behaviors.tsv']


0it [00:00, ?it/s]

[INFO 2021-11-29 14:58:42,071] worker_rank:0, worker_size:1, shuffle:False, seed:0, directory:/home/jupyter/data/MINDlarge_dev, files:['/home/jupyter/data/MINDlarge_dev/behaviors.tsv']
[INFO 2021-11-29 14:58:42,071] worker_rank:0, worker_size:1, shuffle:False, seed:0, directory:/home/jupyter/data/MINDlarge_dev, files:['/home/jupyter/data/MINDlarge_dev/behaviors.tsv']
[INFO 2021-11-29 14:58:42,071] worker_rank:0, worker_size:1, shuffle:False, seed:0, directory:/home/jupyter/data/MINDlarge_dev, files:['/home/jupyter/data/MINDlarge_dev/behaviors.tsv']
[INFO 2021-11-29 14:58:42,074] visible_devices:[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]
[INFO 2021-11-29 14:58:42,074] visible_devices:[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]
[INFO 2021-11-29 14:58:42,074] visible_devices:[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]
[INFO 2021-11-29 14:58:42,074] visible_devices:[PhysicalDevice(name='/physical_device:CPU:0', device_ty

7844it [17:22,  7.53it/s]


Print final metrics
[INFO 2021-11-29 15:21:13,075] Time taken: 1351s
[INFO 2021-11-29 15:21:13,075] Time taken: 1351s
[INFO 2021-11-29 15:21:13,075] Time taken: 1351s
[INFO 2021-11-29 15:21:13,075] Time taken: 1351s
Eval info: group_auc:0.6724, mean_mrr:0.3198, ndcg@10:0.417, ndcg@5:0.3523 and time taken: 1351s
[INFO 2021-11-29 15:21:13,221] DataLoader __iter__()
[INFO 2021-11-29 15:21:13,221] DataLoader __iter__()
[INFO 2021-11-29 15:21:13,221] DataLoader __iter__()
[INFO 2021-11-29 15:21:13,221] DataLoader __iter__()
[INFO 2021-11-29 15:21:13,227] shut down pool.
[INFO 2021-11-29 15:21:13,227] shut down pool.
[INFO 2021-11-29 15:21:13,227] shut down pool.
[INFO 2021-11-29 15:21:13,227] shut down pool.
get_files: /home/jupyter/data/MINDlarge_train, behaviors*.tsv


0it [00:00, ?it/s]

files: ['/home/jupyter/data/MINDlarge_train/behaviors.tsv']
[INFO 2021-11-29 15:21:13,234] worker_rank:0, worker_size:1, shuffle:True, seed:3, directory:/home/jupyter/data/MINDlarge_train, files:['/home/jupyter/data/MINDlarge_train/behaviors.tsv']
[INFO 2021-11-29 15:21:13,234] worker_rank:0, worker_size:1, shuffle:True, seed:3, directory:/home/jupyter/data/MINDlarge_train, files:['/home/jupyter/data/MINDlarge_train/behaviors.tsv']
[INFO 2021-11-29 15:21:13,234] worker_rank:0, worker_size:1, shuffle:True, seed:3, directory:/home/jupyter/data/MINDlarge_train, files:['/home/jupyter/data/MINDlarge_train/behaviors.tsv']
[INFO 2021-11-29 15:21:13,234] worker_rank:0, worker_size:1, shuffle:True, seed:3, directory:/home/jupyter/data/MINDlarge_train, files:['/home/jupyter/data/MINDlarge_train/behaviors.tsv']
[INFO 2021-11-29 15:21:13,252] data_paths: ['/home/jupyter/data/MINDlarge_train/behaviors.tsv']
[INFO 2021-11-29 15:21:13,252] data_paths: ['/home/jupyter/data/MINDlarge_train/behaviors.ts

[0] Ed: 960000, train_loss: 0.49871, acc: 0.75381: : 20001it [6:48:14,  1.22s/it]


epoch: 2 loss: 0.49871 accuracy 0.75381


72023it [00:11, 6119.62it/s]
100%|██████████| 72023/72023 [00:00<00:00, 87562.81it/s]
  0%|          | 0/376 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 376/376 [00:29<00:00, 12.91it/s]


[INFO 2021-11-29 22:10:10,510] DataLoader __iter__()
[INFO 2021-11-29 22:10:10,510] DataLoader __iter__()
[INFO 2021-11-29 22:10:10,510] DataLoader __iter__()
[INFO 2021-11-29 22:10:10,510] DataLoader __iter__()
get_files: /home/jupyter/data/MINDlarge_dev, behaviors*.tsv
files: ['/home/jupyter/data/MINDlarge_dev/behaviors.tsv']
[INFO 2021-11-29 22:10:10,516] worker_rank:0, worker_size:1, shuffle:False, seed:0, directory:/home/jupyter/data/MINDlarge_dev, files:['/home/jupyter/data/MINDlarge_dev/behaviors.tsv']


0it [00:00, ?it/s]

[INFO 2021-11-29 22:10:10,516] worker_rank:0, worker_size:1, shuffle:False, seed:0, directory:/home/jupyter/data/MINDlarge_dev, files:['/home/jupyter/data/MINDlarge_dev/behaviors.tsv']
[INFO 2021-11-29 22:10:10,516] worker_rank:0, worker_size:1, shuffle:False, seed:0, directory:/home/jupyter/data/MINDlarge_dev, files:['/home/jupyter/data/MINDlarge_dev/behaviors.tsv']
[INFO 2021-11-29 22:10:10,516] worker_rank:0, worker_size:1, shuffle:False, seed:0, directory:/home/jupyter/data/MINDlarge_dev, files:['/home/jupyter/data/MINDlarge_dev/behaviors.tsv']
[INFO 2021-11-29 22:10:10,519] visible_devices:[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]
[INFO 2021-11-29 22:10:10,519] visible_devices:[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]
[INFO 2021-11-29 22:10:10,519] visible_devices:[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]
[INFO 2021-11-29 22:10:10,519] visible_devices:[PhysicalDevice(name='/physical_device:CPU:0', device_ty

7844it [18:28,  7.08it/s]


Print final metrics
[INFO 2021-11-29 22:33:56,422] Time taken: 1426s
[INFO 2021-11-29 22:33:56,422] Time taken: 1426s
[INFO 2021-11-29 22:33:56,422] Time taken: 1426s
[INFO 2021-11-29 22:33:56,422] Time taken: 1426s
Eval info: group_auc:0.669, mean_mrr:0.3154, ndcg@10:0.4124, ndcg@5:0.3461 and time taken: 1426s
Finished!


In [6]:
## Continue fit
test_model = torch.load(model_path + "/model_chkpt20211201bert1")
optimizer = optim.Adam(test_model.parameters(), lr=0.00001)
early_stopping = EarlyStopping()



args.epochs = 1
args.max_steps_per_epoch = 5000
args.subsampling = True
test_model = model_fit(model=test_model, optimizer=optimizer, 
                       train_dataloader=train_dataloader, 
                       valid_dir=valid_dir, args=args, 
                       tokenizer=tokenizer, lr_scheduler=None, 
                       early_stopping=early_stopping)
# Print optimizer's state_dict
print("Finished!")
#for var_name in optimizer.state_dict():
#    print(var_name, "\t", optimizer.state_dict()[var_name])



[INFO 2021-12-01 04:18:32,907] Training...
[INFO 2021-12-01 04:18:32,907] DataLoader __iter__()
get_files: /home/jupyter/data/MINDlarge_train, behaviors*.tsv


0it [00:00, ?it/s]

files: ['/home/jupyter/data/MINDlarge_train/behaviors.tsv']
[INFO 2021-12-01 04:18:32,911] worker_rank:0, worker_size:1, shuffle:True, seed:0, directory:/home/jupyter/data/MINDlarge_train, files:['/home/jupyter/data/MINDlarge_train/behaviors.tsv']


2021-12-01 04:18:33.185907: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero


[INFO 2021-12-01 04:18:33,222] data_paths: ['/home/jupyter/data/MINDlarge_train/behaviors.tsv']


2021-12-01 04:18:33.209356: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-12-01 04:18:33.209985: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-12-01 04:18:33.279489: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[0] Ed: 240000, train_loss: 0.49448, acc: 0.75585: : 5001it [1:43:24,  1.24s/it]


epoch: 1 loss: 0.49448 accuracy 0.75585


72023it [00:11, 6293.23it/s]
100%|██████████| 72023/72023 [00:00<00:00, 94590.01it/s]
  0%|          | 1/376 [00:00<01:06,  5.60it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|█████████▉| 375/376 [00:28<00:00, 13.09it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 376/376 [00:28<00:00, 13.07it/s]


[INFO 2021-12-01 06:02:38,463] DataLoader __iter__()
get_files: /home/jupyter/data/MINDlarge_dev, behaviors*.tsv
files: ['/home/jupyter/data/MINDlarge_dev/behaviors.tsv']
[INFO 2021-12-01 06:02:38,466] worker_rank:0, worker_size:1, shuffle:False, seed:0, directory:/home/jupyter/data/MINDlarge_dev, files:['/home/jupyter/data/MINDlarge_dev/behaviors.tsv']


0it [00:00, ?it/s]

[INFO 2021-12-01 06:02:38,467] visible_devices:[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]
[INFO 2021-12-01 06:02:38,468] [StreamReader] path_len:1, paths: ['/home/jupyter/data/MINDlarge_dev/behaviors.tsv']


7844it [17:51,  7.32it/s]


Print final metrics
[INFO 2021-12-01 06:25:34,326] Time taken: 1376s
Eval info: group_auc:0.6712, mean_mrr:0.32, ndcg@10:0.4168, ndcg@5:0.3518 and time taken: 1376s
Finished!


In [None]:
#early_stopping = EarlyStopping()

args.epochs=2
args.max_steps_per_epoch = 5000
args.subsampling = True
test_model = model_fit(model=test_model, optimizer=optimizer, 
          train_dataloader=train_dataloader, 
          valid_dir=valid_dir, args=args, 
          tokenizer=tokenizer, lr_scheduler=None,early_stopping=early_stopping)
# Print optimizer's state_dict
print("Finished!")
#for var_name in optimizer.state_dict():
#    print(var_name, "\t", optimizer.state_dict()[var_name])




[INFO 2021-11-30 23:28:25,459] Training...
[INFO 2021-11-30 23:28:25,460] DataLoader __iter__()
[INFO 2021-11-30 23:28:25,464] shut down pool.
get_files: /home/jupyter/data/MINDlarge_train, behaviors*.tsv
files: ['/home/jupyter/data/MINDlarge_train/behaviors.tsv']
[INFO 2021-11-30 23:28:25,468] worker_rank:0, worker_size:1, shuffle:True, seed:1, directory:/home/jupyter/data/MINDlarge_train, files:['/home/jupyter/data/MINDlarge_train/behaviors.tsv']


0it [00:00, ?it/s]

[INFO 2021-11-30 23:28:25,469] data_paths: ['/home/jupyter/data/MINDlarge_train/behaviors.tsv']


[0] Ed: 240000, train_loss: 0.49696, acc: 0.75564: : 5001it [1:43:52,  1.25s/it]


epoch: 1 loss: 0.49696 accuracy 0.75564


72023it [00:11, 6157.79it/s]
100%|██████████| 72023/72023 [00:00<00:00, 89695.42it/s]
  0%|          | 0/376 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|█████████▉| 375/376 [00:29<00:00, 13.03it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 376/376 [00:29<00:00, 12.82it/s]


[INFO 2021-12-01 01:13:00,054] DataLoader __iter__()
get_files: /home/jupyter/data/MINDlarge_dev, behaviors*.tsv


0it [00:00, ?it/s]

files: ['/home/jupyter/data/MINDlarge_dev/behaviors.tsv']
[INFO 2021-12-01 01:13:00,059] worker_rank:0, worker_size:1, shuffle:False, seed:0, directory:/home/jupyter/data/MINDlarge_dev, files:['/home/jupyter/data/MINDlarge_dev/behaviors.tsv']
[INFO 2021-12-01 01:13:00,059] visible_devices:[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]
[INFO 2021-12-01 01:13:00,060] [StreamReader] path_len:1, paths: ['/home/jupyter/data/MINDlarge_dev/behaviors.tsv']


7844it [17:56,  7.29it/s]


Print final metrics
[INFO 2021-12-01 01:36:08,425] Time taken: 1388s
Eval info: group_auc:0.6639, mean_mrr:0.3155, ndcg@10:0.411, ndcg@5:0.3455 and time taken: 1388s
[INFO 2021-12-01 01:36:08,586] DataLoader __iter__()
[INFO 2021-12-01 01:36:08,591] shut down pool.
get_files: /home/jupyter/data/MINDlarge_train, behaviors*.tsv
files: ['/home/jupyter/data/MINDlarge_train/behaviors.tsv']
[INFO 2021-12-01 01:36:08,594] worker_rank:0, worker_size:1, shuffle:True, seed:2, directory:/home/jupyter/data/MINDlarge_train, files:['/home/jupyter/data/MINDlarge_train/behaviors.tsv']


0it [00:00, ?it/s]

[INFO 2021-12-01 01:36:08,595] data_paths: ['/home/jupyter/data/MINDlarge_train/behaviors.tsv']


[0] Ed: 240000, train_loss: 0.49436, acc: 0.75674: : 5001it [1:43:23,  1.24s/it]


epoch: 2 loss: 0.49436 accuracy 0.75674


72023it [00:10, 6594.95it/s]
100%|██████████| 72023/72023 [00:00<00:00, 92668.14it/s]
  0%|          | 0/376 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|█████████▉| 375/376 [00:28<00:00, 13.10it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 376/376 [00:29<00:00, 12.87it/s]


[INFO 2021-12-01 03:20:13,348] DataLoader __iter__()
get_files: /home/jupyter/data/MINDlarge_dev, behaviors*.tsv


0it [00:00, ?it/s]

files: ['/home/jupyter/data/MINDlarge_dev/behaviors.tsv']
[INFO 2021-12-01 03:20:13,356] worker_rank:0, worker_size:1, shuffle:False, seed:0, directory:/home/jupyter/data/MINDlarge_dev, files:['/home/jupyter/data/MINDlarge_dev/behaviors.tsv']
[INFO 2021-12-01 03:20:13,356] visible_devices:[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]
[INFO 2021-12-01 03:20:13,357] [StreamReader] path_len:1, paths: ['/home/jupyter/data/MINDlarge_dev/behaviors.tsv']


2704it [05:46,  8.20it/s]

## Save the model

In [37]:
model.eval()
_, _, _, res = run_eval(dev_dataloader, model, run_eval_metric=True, early_stopping=early_stopping)

[INFO 2021-11-28 20:10:01,600] DataLoader __iter__()
[INFO 2021-11-28 20:10:01,602] shut down pool.
get_files: /datadrive/mind/mind/MINDsmall_dev, behaviors*.tsv


0it [00:00, ?it/s]

files: ['/datadrive/mind/mind/MINDsmall_dev/behaviors.tsv']
[INFO 2021-11-28 20:10:01,606] worker_rank:0, worker_size:1, shuffle:False, seed:6, directory:/datadrive/mind/mind/MINDsmall_dev, files:['/datadrive/mind/mind/MINDsmall_dev/behaviors.tsv']
[INFO 2021-11-28 20:10:01,606] visible_devices:[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]
[INFO 2021-11-28 20:10:01,607] [StreamReader] path_len:1, paths: ['/datadrive/mind/mind/MINDsmall_dev/behaviors.tsv']


2286it [02:08, 17.82it/s]


Print final metrics
[INFO 2021-11-28 20:13:20,862] Time taken: 199s
Eval info: group_auc:0.6509, mean_mrr:0.3002, ndcg@10:0.393, ndcg@5:0.3271 and time taken: 199s


In [21]:
model_path = os.path.join(data_path, "model_chkpt")
os.makedirs(model_path, exist_ok=True)
model_name = "bert"
series = "adam"
model_entity = os.path.join(model_path, f"model_chkpt{datetime.datetime.utcnow().strftime('%Y%m%d')}{model_name}{series}")
#model.model.save_weights(os.path.join(model_path, "nrms_ckpt"))
torch.save(model, model_entity)



## Output Predcition File

In [9]:
test_dir = os.path.join(data_path, 'MINDsmall_test')
test_news_file = os.path.join(test_dir, r'news.tsv')
test_behaviors_file = os.path.join(test_dir, r'behaviors.tsv')

test_load = test_iterator(model=model, valid_dir=test_dir, args=args, tokenizer=tokenizer)
test_dataloader = test_load["iterator"]

72023it [00:11, 6076.04it/s]
100%|██████████| 72023/72023 [00:00<00:00, 138716.10it/s]
100%|██████████| 563/563 [00:50<00:00, 11.17it/s]


In [10]:
#group_impr_indexes, group_labels, group_preds = model.run_fast_eval(test_news_file, test_behaviors_file)

group_impr_indexes, group_labels, group_preds = run_eval(test_dataloader, model, run_eval_metric=False)

[INFO 2021-11-25 14:09:41,468] DataLoader __iter__()
get_files: /home/rain/data/MINDsmall_test, behaviors*.tsv


0it [00:00, ?it/s]

files: ['/home/rain/data/MINDsmall_test/behaviors.tsv']
[INFO 2021-11-25 14:09:41,473] worker_rank:0, worker_size:1, shuffle:False, seed:0, directory:/home/rain/data/MINDsmall_test, files:['/home/rain/data/MINDsmall_test/behaviors.tsv']
[INFO 2021-11-25 14:09:41,474] visible_devices:[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]
[INFO 2021-11-25 14:09:41,475] [StreamReader] path_len:1, paths: ['/home/rain/data/MINDsmall_test/behaviors.tsv']


32it [00:02, 14.21it/s]

Time taken: 2.257753372192383s





In [11]:
group_impr_indexes1, group_labels1, group_preds1 = run_eval(test_dataloader, test_model, run_eval_metric=False)

[INFO 2021-11-25 14:09:43,732] DataLoader __iter__()
[INFO 2021-11-25 14:09:43,733] shut down pool.
get_files: /home/rain/data/MINDsmall_test, behaviors*.tsv


0it [00:00, ?it/s]

files: ['/home/rain/data/MINDsmall_test/behaviors.tsv']
[INFO 2021-11-25 14:09:43,737] worker_rank:0, worker_size:1, shuffle:False, seed:1, directory:/home/rain/data/MINDsmall_test, files:['/home/rain/data/MINDsmall_test/behaviors.tsv']
[INFO 2021-11-25 14:09:43,738] visible_devices:[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]
[INFO 2021-11-25 14:09:43,738] [StreamReader] path_len:1, paths: ['/home/rain/data/MINDsmall_test/behaviors.tsv']


32it [00:02, 13.40it/s]

Time taken: 2.394899606704712s





In [26]:

predfile = f'prediction{series}.txt'
zip_file = f'prediction{series}.zip'
with open(os.path.join(data_path, predfile), 'w') as f:
    for impr_index, preds in tqdm(zip(group_impr_indexes, group_preds)):
        #impr_index += 1
        pred_rank = (np.argsort(np.argsort(preds)[::-1]) + 1).tolist()
        pred_rank = '[' + ','.join([str(i) for i in pred_rank]) + ']'
        f.write(' '.join([str(impr_index), pred_rank])+ '\n')

1000it [00:00, 33231.69it/s]


In [28]:
#import zipfile
f = zipfile.ZipFile(os.path.join(data_path, 'prediction.zip'), 'w', zipfile.ZIP_DEFLATED)
f.write(os.path.join(data_path, predfile), arcname='prediction.txt')
f.close()

## Reference
\[1\] Wu et al. "Neural News Recommendation with Multi-Head Self-Attention." in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)<br>
\[2\] Wu, Fangzhao, et al. "MIND: A Large-scale Dataset for News Recommendation" Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. https://msnews.github.io/competition.html <br>
\[3\] GloVe: Global Vectors for Word Representation. https://nlp.stanford.edu/projects/glove/