# Introduction

This repository is the implementation of [KRED: Knowledge-Aware Document Representation for News Recommendations](https://arxiv.org/abs/1910.11494) [1]


## Model description



KRED is a knowledge enhanced framework which enhance a document embedding with knowledge information for multiple news recommendation tasks. The framework mainly contains two part: representation enhancement part(left) and multi-task training part(right).

![](./framework.PNG)

## Dataset description and download

MIND dataset [2] is a large-scale English news dataset. It was collected from anonymized behavior logs of Microsoft News website. MIND contains 1,000,000 users, 161,013 news articles and 15,777,377 impression logs. Every news article contains rich textual content including title, abstract, body, category and entities. Each impression log contains the click events, non-clicked events and historical news click behaviors of this user before this impression.

For quicker training and evaluaiton, we sample MINDdemo dataset of 5k users from MIND small dataset. The MINDdemo dataset has the same file format as MINDsmall and MINDlarge. If you want to try experiments on MINDsmall and MINDlarge, please change the dowload source. Select the MIND_type parameter from ['large', 'small', 'demo'] to choose dataset.

MINDdemo_train is used for training, and MINDdemo_dev is used for evaluation. Training data and evaluation data are composed of a news file and a behaviors file. You can find more detailed data description in [MIND repo](https://github.com/msnews/msnews.github.io/blob/master/assets/doc/introduction.md)

In [1]:

import torch
  
print(f"Is CUDA supported by this system? {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")
  
# Storing ID of current CUDA device
cuda_id = torch.cuda.current_device()
print(f"ID of current CUDA device: {torch.cuda.current_device()}")
        
print(f"Name of current CUDA device:{torch.cuda.get_device_name(cuda_id)}")

Is CUDA supported by this system? True
CUDA version: 11.6
ID of current CUDA device: 0
Name of current CUDA device:NVIDIA RTX A4000


In [2]:
%pip install -U ipykernel 
%pip install -U sentence-transformers
%pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116

Collecting ipykernel
  Downloading ipykernel-6.19.2-py3-none-any.whl (145 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m145.1/145.1 kB[0m [31m22.0 MB/s[0m eta [36m0:00:00[0m
Collecting traitlets>=5.4.0
  Downloading traitlets-5.7.1-py3-none-any.whl (109 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m109.9/109.9 kB[0m [31m26.4 MB/s[0m eta [36m0:00:00[0m
Collecting comm>=0.1.1
  Downloading comm-0.1.2-py3-none-any.whl (6.5 kB)
Installing collected packages: traitlets, comm, ipykernel
  Attempting uninstall: traitlets
    Found existing installation: traitlets 5.3.0
    Uninstalling traitlets-5.3.0:
      Successfully uninstalled traitlets-5.3.0
  Attempting uninstall: ipykernel
    Found existing installation: ipykernel 6.15.1
    Uninstalling ipykernel-6.15.1:
      Successfully uninstalled ipykernel-6.15.1
Successfully installed comm-0.1.2 ipykernel-6.19.2 traitlets-5.7.1
[0mNote: you may need to restart the kernel to 

In [None]:
/datasets/mind_kg

In [2]:
#from .autonotebook import tqdm as notebook_tqdm
import os
from utils.util import *
from train_test import *

MIND_type = 'small'
data_path = "/datasets/"

train_news_file = os.path.join(data_path, 'mind_train', r'news.tsv')
train_behaviors_file = os.path.join(data_path, 'mind_train', r'behaviors.tsv')
valid_news_file = os.path.join(data_path, 'valid', r'news.tsv')
valid_behaviors_file = os.path.join(data_path, 'valid', r'behaviors.tsv')
knowledge_graph_file = os.path.join(data_path, 'kg/wikidata-graph', r'wikidata-graph.tsv')
entity_embedding_file = os.path.join(data_path, 'kg/wikidata-graph', r'entity2vecd100.vec')
relation_embedding_file = os.path.join(data_path, 'kg/wikidata-graph', r'relation2vecd100.vec')

mind_url, mind_train_dataset, mind_dev_dataset, _ = get_mind_data_set(MIND_type)


if not os.path.exists(train_news_file):
    download_deeprec_resources(mind_url, os.path.join(data_path, 'train'), mind_train_dataset)
    
if not os.path.exists(valid_news_file):
    download_deeprec_resources(mind_url, \
                               os.path.join(data_path, 'valid'), mind_dev_dataset)

kg_url = "https://kredkg.blob.core.windows.net/wikidatakg/"

if not os.path.exists(knowledge_graph_file):
    download_deeprec_resources(kg_url, os.path.join(data_path, 'kg'), "kg.zip")

In [None]:
import tqdm as notebook_tqdm

## loading config

In [6]:
import sys
import os
#sys.path.append('')

import argparse
from parse_config import ConfigParser


parser = argparse.ArgumentParser(description='KRED')




parser.add_argument('-c', '--config', default="./config.json", type=str,
                    help='config file path (default: None)')
parser.add_argument('-r', '--resume', default=None, type=str,
                    help='path to latest checkpoint (default: None)')
parser.add_argument('-d', '--device', default=None, type=str,
                    help='indices of GPUs to enable (default: all)')

parser.add_argument("-f", "--fff", help="a dummy argument to fool ipython", default="1")
parser.add_argument("-i", "--ip", help="a dummy argument to fool ipython", default="1")
parser.add_argument(
            "-s", "--stdin", help="a dummy argument to fool ipython", default="1")
parser.add_argument("-cc", "--control", help="a dummy argument to fool ipython", default="1")
parser.add_argument(
            "-b", "--hb", help="a dummy argument to fool ipython", default="1")
parser.add_argument(
            "-K", "--Session.key", help="a dummy argument to fool ipython", default="1")
parser.add_argument(
            "-S", "--Session.signature_scheme", help="a dummy argument to fool ipython", default="1")
parser.add_argument(
            "-l", "--shell", help="a dummy argument to fool ipython", default="1")
parser.add_argument(
            "-t", "--transport", help="a dummy argument to fool ipython", default="1")
parser.add_argument(
            "-o", "--iopub", help="a dummy argument to fool ipython", default="1")
#print(parser)
#args, unknown = parser.parse_known_args()
config = ConfigParser.from_args(parser)
print(config.from_args(parser))

FileExistsError: [WinError 183] Impossibile creare un file, se il file esiste già: 'out\\saved\\models\\KRED\\1125_173321'

## Create hyper-parameters

In [13]:
epochs = 5
batch_size = 64
train_type = "single_task"
task = "user2item" # task should be within: user2item, item2item, vert_classify, pop_predict

config['trainer']['epochs'] = epochs
config['data_loader']['batch_size'] = batch_size
config['trainer']['training_type'] = train_type
config['trainer']['task'] = task

#config['data']['knowledge_graph'] 

In [14]:
config['trainer']['task']

'user2item'

## Process dataset

Since MIND dataset do not contain user's location information, we can not use local news 


In [15]:
data = load_data_mind(config)

constructing adjacency matrix ...
Load pretrained SentenceTransformer: distilbert-base-nli-stsb-mean-tokens


Downloading: 100%|██████████| 345/345 [00:00<00:00, 230kB/s]
Downloading: 100%|██████████| 190/190 [00:00<00:00, 190kB/s]
Downloading: 100%|██████████| 4.01k/4.01k [00:00<00:00, 4.01MB/s]
Downloading: 100%|██████████| 555/555 [00:00<00:00, 555kB/s]
Downloading: 100%|██████████| 122/122 [00:00<00:00, 81.1kB/s]
Downloading: 100%|██████████| 265M/265M [00:23<00:00, 11.4MB/s] 
Downloading: 100%|██████████| 53.0/53.0 [00:00<00:00, 53.0kB/s]
Downloading: 100%|██████████| 112/112 [00:00<00:00, 112kB/s]
Downloading: 100%|██████████| 466k/466k [00:00<00:00, 1.12MB/s]
Downloading: 100%|██████████| 505/505 [00:00<00:00, 505kB/s]
Downloading: 100%|██████████| 232k/232k [00:00<00:00, 311kB/s]  
Downloading: 100%|██████████| 229/229 [00:00<00:00, 153kB/s]


Use pytorch device: cpu


Batches: 100%|██████████| 1/1 [00:00<00:00,  8.79it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 40.74it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 23.45it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 39.16it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 30.24it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 37.66it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 31.60it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 38.41it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 39.93it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 16.46it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 33.28it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 19.86it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 38.41it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 35.04it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 37.66it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 43.41it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 45.36it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 32.60it/s]
Batches: 1

## Train the KRED model

In [None]:
print(1)

In [None]:
if train_type == "single_task":
    single_task_training(config, data)
else:
    multi_task_training(config, data)

## Evaluate the KRED model

In [None]:
test_data = data[-1]
testing(test_data, config)

## Performance on MINDlarge

we test the performance on MINDlarge dev dataset for your reference:

| Models | AUC | NDCG@10 |
| :------- | :------- | :------- |
| KRED(single task training) | 0.6702 | 0.4018 |
| KRED(multi task training) |  0.6731 | 0.4039|


## Reference

[1] Liu, Danyang, et al. "KRED: Knowledge-Aware Document Representation for News Recommendations." Fourteenth ACM Conference on Recommender Systems. 2020.

[2] Wu, Fangzhao, et al. "MIND: A Large-scale Dataset for News Recommendation" Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.