# Introduction

This repository is the implementation of [KRED: Knowledge-Aware Document Representation for News Recommendations](https://arxiv.org/abs/1910.11494) [1]


## Model description



KRED is a knowledge enhanced framework which enhance a document embedding with knowledge information for multiple news recommendation tasks. The framework mainly contains two part: representation enhancement part(left) and multi-task training part(right).

![](./framework.PNG)

## Dataset description and download

MIND dataset [2] is a large-scale English news dataset. It was collected from anonymized behavior logs of Microsoft News website. MIND contains 1,000,000 users, 161,013 news articles and 15,777,377 impression logs. Every news article contains rich textual content including title, abstract, body, category and entities. Each impression log contains the click events, non-clicked events and historical news click behaviors of this user before this impression.

For quicker training and evaluaiton, we sample MINDdemo dataset of 5k users from MIND small dataset. The MINDdemo dataset has the same file format as MINDsmall and MINDlarge. If you want to try experiments on MINDsmall and MINDlarge, please change the dowload source. Select the MIND_type parameter from ['large', 'small', 'demo'] to choose dataset.

MINDdemo_train is used for training, and MINDdemo_dev is used for evaluation. Training data and evaluation data are composed of a news file and a behaviors file. You can find more detailed data description in [MIND repo](https://github.com/msnews/msnews.github.io/blob/master/assets/doc/introduction.md)

In [None]:
!pip install -U sentence-transformers

In [None]:

import os
from utils.util import *
from train_test import *

MIND_type = 'small'
data_path = "./data/"

train_news_file = os.path.join(data_path, 'train', r'news.tsv')
train_behaviors_file = os.path.join(data_path, 'train', r'behaviors.tsv')
valid_news_file = os.path.join(data_path, 'valid', r'news.tsv')
valid_behaviors_file = os.path.join(data_path, 'valid', r'behaviors.tsv')
knowledge_graph_file = os.path.join(data_path, 'kg/wikidata-graph', r'wikidata-graph.tsv')
entity_embedding_file = os.path.join(data_path, 'kg/wikidata-graph', r'entity2vecd100.vec')
relation_embedding_file = os.path.join(data_path, 'kg/wikidata-graph', r'relation2vecd100.vec')

mind_url, mind_train_dataset, mind_dev_dataset, _ = get_mind_data_set(MIND_type)


if not os.path.exists(train_news_file):
    download_deeprec_resources(mind_url, os.path.join(data_path, 'train'), mind_train_dataset)
    
if not os.path.exists(valid_news_file):
    download_deeprec_resources(mind_url, \
                               os.path.join(data_path, 'valid'), mind_dev_dataset)

kg_url = "https://kredkg.blob.core.windows.net/wikidatakg/"

if not os.path.exists(knowledge_graph_file):
    download_deeprec_resources(kg_url, os.path.join(data_path, 'kg'), "kg.zip")

100%|██████████| 17.0k/17.0k [00:03<00:00, 4.73kKB/s]
100%|██████████| 9.84k/9.84k [00:02<00:00, 3.72kKB/s]
100%|██████████| 3.22M/3.22M [34:50<00:00, 1.54kKB/s] 


## loading config

In [None]:
import sys
import os
#sys.path.append('')

import argparse
from parse_config import ConfigParser


parser = argparse.ArgumentParser(description='KRED')

parser.add_argument('-f')


parser.add_argument('-c', '--config', default="./config.json", type=str,
                    help='config file path (default: None)')
parser.add_argument('-r', '--resume', default=None, type=str,
                    help='path to latest checkpoint (default: None)')
parser.add_argument('-d', '--device', default=None, type=str,
                    help='indices of GPUs to enable (default: all)')

config = ConfigParser.from_args(parser)




SystemExit: 2



## Create hyper-parameters

In [None]:
epochs = 5
batch_size = 64
train_type = "single_task"
task = "user2item" # task should be within: user2item, item2item, vert_classify, pop_predict

config['trainer']['epochs'] = epochs
config['data_loader']['batch_size'] = batch_size
config['trainer']['training_type'] = train_type
config['trainer']['task'] = task

#config['data']['knowledge_graph'] 



## Process dataset

Since MIND dataset do not contain user's location information, we can not use local news 


In [None]:
data = load_data_mind(config)

constructing embedding ... 
constructing adjacency matrix ... 
constructing news features ... 
constructing user2item dataset ... 


## Train the KRED model

In [None]:
if train_type == "single_task":
    single_task_training(config, data)
else:
    multi_task_training(config, data)

at epoch 1
train info: logloss loss:1.49154
eval info: auc:0.5789
at epoch 2
train info: logloss loss:1.41551
eval info: auc:0.607
at epoch 3
train info: logloss loss:1.36690
eval info: auc:0.6199
at epoch 4
train info: logloss loss:1.34379
eval info: auc:0.6329
at epoch 5
train info: logloss loss:1.31986
eval info: auc:0.6272
{eval info: 'auc': 0.6272, 'ndcg@10': 0.376}


## Evaluate the KRED model

In [None]:
test_data = data[-1]
testing(test_data, config)

{eval info: 'auc': 0.6272, 'ndcg@10': 0.376}


## Performance on MINDlarge

we test the performance on MINDlarge dev dataset for your reference:

| Models | AUC | NDCG@10 |
| :------- | :------- | :------- |
| KRED(single task training) | 0.6702 | 0.4018 |
| KRED(multi task training) |  0.6731 | 0.4039|


## Reference

[1] Liu, Danyang, et al. "KRED: Knowledge-Aware Document Representation for News Recommendations." Fourteenth ACM Conference on Recommender Systems. 2020.

[2] Wu, Fangzhao, et al. "MIND: A Large-scale Dataset for News Recommendation" Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.