<i>Copyright (c) Recommenders contributors.</i>

<i>Licensed under the MIT License.</i>

# DKN : Deep Knowledge-Aware Network for News Recommendation

DKN \[1\] is a deep learning model which incorporates information from knowledge graph for better news recommendation. Specifically, DKN uses TransX \[2\] method for knowledge graph representation learning, then applies a CNN framework, named KCNN, to combine entity embedding with word embedding and generate a final embedding vector for a news article. CTR prediction is made via an attention-based neural scorer. 

## Properties of DKN:

- DKN is a content-based deep model for CTR prediction rather than traditional ID-based collaborative filtering. 
- It makes use of knowledge entities and common sense in news content via joint learning from semantic-level and knowledge-level representations of news articles.
- DKN uses an attention module to dynamically calculate a user's aggregated historical representation.


## Data format

DKN takes several files as input as follows:

- **training / validation / test files**: each line in these files represents one instance. Impressionid is used to evaluate performance within an impression session, so it is only used when evaluating, you can set it to 0 for training data. The format is : <br> 
`[label] [userid] [CandidateNews]%[impressionid] `<br> 
e.g., `1 train_U1 N1%0` <br> 

- **user history file**: each line in this file represents a users' click history. You need to set `history_size` parameter in the config file, which is the max number of user's click history we use. We will automatically keep the last `history_size` number of user click history, if user's click history is more than `history_size`, and we will automatically pad with 0 if user's click history is less than `history_size`. the format is : <br> 
`[Userid] [newsid1,newsid2...]`<br>
e.g., `train_U1 N1,N2` <br> 

- **document feature file**: It contains the word and entity features for news articles. News articles are represented by aligned title words and title entities. To take a quick example, a news title may be: <i>"Trump to deliver State of the Union address next week"</i>, then the title words value may be `CandidateNews:34,45,334,23,12,987,3456,111,456,432` and the title entitie value may be: `entity:45,0,0,0,0,0,0,0,0,0`. Only the first value of entity vector is non-zero due to the word "Trump". The title value and entity value is hashed from 1 to `n` (where `n` is the number of distinct words or entities). Each feature length should be fixed at k (`doc_size` parameter), if the number of words in document is more than k, you should truncate the document to k words, and if the number of words in document is less than k, you should pad 0 to the end. 
the format is like: <br> 
`[Newsid] [w1,w2,w3...wk] [e1,e2,e3...ek]`

- **word embedding/entity embedding/ context embedding files**: These are `*.npy` files of pretrained embeddings. After loading, each file is a `[n+1,k]` two-dimensional matrix, n is the number of words(or entities) of their hash dictionary, k is dimension of the embedding, note that we keep embedding 0 for zero padding. 

In this experiment, we used GloVe\[4\] vectors to initialize the word embedding. We trained entity embedding using TransE\[2\] on knowledge graph and context embedding is the average of the entity's neighbors in the knowledge graph.<br>

## MIND dataset

MIND dataset\[3\] is a large-scale English news dataset. It was collected from anonymized behavior logs of Microsoft News website. MIND contains 1,000,000 users, 161,013 news articles and 15,777,377 impression logs. Every news article contains rich textual content including title, abstract, body, category and entities. Each impression log contains the click events, non-clicked events and historical news click behaviors of this user before this impression.

A smaller version, [MIND-small](https://azure.microsoft.com/en-us/services/open-datasets/catalog/microsoft-news-dataset/), is a small version of the MIND dataset by randomly sampling 50,000 users and their behavior logs from the MIND dataset.

The datasets contains these files for both training and validation data:

#### behaviors.tsv

The behaviors.tsv file contains the impression logs and users' news click hostories. It has 5 columns divided by the tab symbol:

+ Impression ID. The ID of an impression.
+ User ID. The anonymous ID of a user.
+ Time. The impression time with format "MM/DD/YYYY HH:MM:SS AM/PM".
+ History. The news click history (ID list of clicked news) of this user before this impression.
+ Impressions. List of news displayed in this impression and user's click behaviors on them (1 for click and 0 for non-click).

One simple example: 

`1    U82271    11/11/2019 3:28:58 PM    N3130 N11621 N12917 N4574 N12140 N9748    N13390-0 N7180-0 N20785-0 N6937-0 N15776-0 N25810-0 N20820-0 N6885-0 N27294-0 N18835-0 N16945-0 N7410-0 N23967-0 N22679-0 N20532-0 N26651-0 N22078-0 N4098-0 N16473-0 N13841-0 N15660-0 N25787-0 N2315-0 N1615-0 N9087-0 N23880-0 N3600-0 N24479-0 N22882-0 N26308-0 N13594-0 N2220-0 N28356-0 N17083-0 N21415-0 N18671-0 N9440-0 N17759-0 N10861-0 N21830-0 N8064-0 N5675-0 N15037-0 N26154-0 N15368-1 N481-0 N3256-0 N20663-0 N23940-0 N7654-0 N10729-0 N7090-0 N23596-0 N15901-0 N16348-0 N13645-0 N8124-0 N20094-0 N27774-0 N23011-0 N14832-0 N15971-0 N27729-0 N2167-0 N11186-0 N18390-0 N21328-0 N10992-0 N20122-0 N1958-0 N2004-0 N26156-0 N17632-0 N26146-0 N17322-0 N18403-0 N17397-0 N18215-0 N14475-0 N9781-0 N17958-0 N3370-0 N1127-0 N15525-0 N12657-0 N10537-0 N18224-0 `

#### news.tsv

The news.tsv file contains the detailed information of news articles involved in the behaviors.tsv file. It has 7 columns, which are divided by the tab symbol:

+ News ID
+ Category
+ SubCategory
+ Title
+ Abstract
+ URL
+ Title Entities (entities contained in the title of this news)
+ Abstract Entities (entites contained in the abstract of this news)

One simple example: 

`N46466    lifestyle    lifestyleroyals    The Brands Queen Elizabeth, Prince Charles, and Prince Philip Swear By    Shop the notebooks, jackets, and more that the royals can't live without.    https://www.msn.com/en-us/lifestyle/lifestyleroyals/the-brands-queen-elizabeth,-prince-charles,-and-prince-philip-swear-by/ss-AAGH0ET?ocid=chopendata    [{"Label": "Prince Philip, Duke of Edinburgh", "Type": "P", "WikidataId": "Q80976", "Confidence": 1.0, "OccurrenceOffsets": [48], "SurfaceForms": ["Prince Philip"]}, {"Label": "Charles, Prince of Wales", "Type": "P", "WikidataId": "Q43274", "Confidence": 1.0, "OccurrenceOffsets": [28], "SurfaceForms": ["Prince Charles"]}, {"Label": "Elizabeth II", "Type": "P", "WikidataId": "Q9682", "Confidence": 0.97, "OccurrenceOffsets": [11], "SurfaceForms": ["Queen Elizabeth"]}]    [] `

#### entity_embedding.vec & relation_embedding.vec

The entity_embedding.vec and relation_embedding.vec files contain the 100-dimensional embeddings of the entities and relations learned from the subgraph (from WikiData knowledge graph) by TransE method. In both files, the first column is the ID of entity/relation, and the other columns are the embedding vector values.

One simple example: 

`Q42306013  0.014516 -0.106958 0.024590 ... -0.080382`


## DKN architecture

The following figure shows the architecture of DKN.

![](https://recodatasets.z20.web.core.windows.net/images/dkn_architecture.png)

DKN takes one piece of candidate news and one piece of a user’s clicked news as input. For each piece of news, a specially designed KCNN is used to process its title and generate an embedding vector. KCNN is an extension of traditional CNN that allows flexibility in incorporating symbolic knowledge from a knowledge graph into sentence representation learning. 

With the KCNN, we obtain a set of embedding vectors for a user’s clicked history. To get final embedding of the user with
respect to the current candidate news, we use an attention-based method to automatically match the candidate news to each piece
of his clicked news, and aggregate the user’s historical interests with different weights. The candidate news embedding and the user embedding are concatenated and fed into a deep neural network (DNN) to calculate the predicted probability that the user will click the candidate news.

## Global settings and imports

In [1]:
import os
import sys
from tempfile import TemporaryDirectory
import logging

import tensorflow as tf
tf.get_logger().setLevel('ERROR') # only show error messages

from recommenders.datasets.download_utils import maybe_download
from recommenders.datasets.mind import (download_mind, 
                                     extract_mind, 
                                     read_clickhistory, 
                                     get_train_input, 
                                     get_valid_input, 
                                     get_user_history,
                                     get_words_and_entities,
                                     generate_embeddings) 
from recommenders.models.deeprec.deeprec_utils import prepare_hparams
from recommenders.models.deeprec.models.dkn import DKN
from recommenders.models.deeprec.io.dkn_iterator import DKNTextIterator
from recommenders.utils.notebook_utils import store_metadata

print(f"System version: {sys.version}")
print(f"Tensorflow version: {tf.__version__}")

2024-06-16 23:57:57.040376: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-06-16 23:57:57.108327: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-06-16 23:57:57.538949: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.7/lib64:/usr/local/cuda-11.7/lib64
2024-06-16 23:57:57.538986: W tensorflow/com

System version: 3.9.13 (main, Apr 27 2024, 15:35:35) 
[GCC 9.4.0]
Tensorflow version: 2.11.0


In [2]:
# Temp dir
tmpdir = TemporaryDirectory()


In [3]:
# DKN parameters
epochs = 10
history_size = 50
batch_size = 100


In [6]:
# Mind parameters
# MIND_SIZE = "small"


# Paths
# data_path = os.path.join(tmpdir.name, "mind-dkn")
# train_file = os.path.join(data_path, "train_mind.txt")
# valid_file = os.path.join(data_path, "valid_mind.txt")
# user_history_file = os.path.join(data_path, "user_history.txt")
# infer_embedding_file = os.path.join(data_path, "infer_embedding.txt")


## Data preparation

In this example, let's go through a real case on how to apply DKN on a raw news dataset from the very beginning. We will download a copy of open-source MIND dataset, in its original raw format. Then we will process the raw data files into DKN's input data format, which is stated previously. 

In [None]:
# train_zip, valid_zip = download_mind(size=MIND_SIZE, dest_path=data_path)
# train_path, valid_path = extract_mind(train_zip, valid_zip)

In [4]:
import pandas as pd
import random
import json

from nltk.tokenize import RegexpTokenizer
from sklearn.model_selection import train_test_split

In [5]:
def read_behavior(df):  
    userid_history = {}
    sessions = []
    
    for userid, imp_time, click, imps in zip(
        df.user_id, 
        df.time,
        df.clicked_news,
        df.impressions
    ):
        clicks = click.split(" ")
        pos = []
        neg = []
        imps = imps.split(" ")
        for imp in imps:
            if imp.split("-")[1] == "1":
                pos.append(imp.split("-")[0])
            else:
                neg.append(imp.split("-")[0])
        userid_history[userid] = clicks
        sessions.append([userid, clicks, pos, neg])
    return sessions, userid_history

In [6]:
def read＿testing_behavior(df):  
    userid_history = {}
    sessions = []
    
    for userid, imp_time, click, imps in zip(
        df.user_id, 
        df.time,
        df.clicked_news,
        df.impressions
    ):
        clicks = click.split(" ")
        pos = []
        neg = []
        imps = imps.split(" ")
        for imp in imps:
            # if imp.split("-")[1] == "1":
            pos.append(imp)
            # else:
            #     neg.append(imp.split("-")[0])
        userid_history[userid] = clicks
        sessions.append([userid, clicks, pos, neg])
    return sessions, userid_history

In [7]:
def _read_news(df, news_words, news_entities, tokenizer):
    for news_id, category, subcategory, title, abstract, URL, title_entities, abstract_entities in zip(
        df.news_id,
        df.category,
        df.subcategory,
        df.title,
        df.abstract,
        df.URL,
        df.title_entities,
        df.abstract_entities
    ):
        news_words[news_id] = tokenizer.tokenize(title.lower())
        news_entities[news_id] = []
        for entity in json.loads(title_entities):
            news_entities[news_id].append(
                (entity["SurfaceForms"], entity["WikidataId"])
            )
    return news_words, news_entities

In [8]:
def get_words_and_entities(df):
    """Load words and entities

    Args:
        train_news (str): News train file.
        valid_news (str): News validation file.

    Returns:
        dict, dict: Words and entities dictionaries.
    """
    news_words = {}
    news_entities = {}
    tokenizer = RegexpTokenizer(r"\w+")
    news_words, news_entities = _read_news(
        df, news_words, news_entities, tokenizer
    )
    # news_words, news_entities = _read_news(
    #     valid_new_df, news_words, news_entities, tokenizer
    # )
    return news_words, news_entities

In [9]:
def _newsample(nnn, ratio):
    if ratio > len(nnn):
        return random.sample(nnn * (ratio // len(nnn) + 1), ratio)
    else:
        return random.sample(nnn, ratio)
        
def get_train_input(session, train_file_path, npratio=4):
    """Generate train file.

    Args:
        session (list): List of user session with user_id, clicks, positive and negative interactions.
        train_file_path (str): Path to file.
        npration (int): Ratio for negative sampling.
    """
    fp_train = open(train_file_path, "w", encoding="utf-8")
    for sess_id in range(len(session)):
        sess = session[sess_id]
        userid, _, poss, negs = sess
        for i in range(len(poss)):
            pos = poss[i]
            neg = _newsample(negs, npratio)
            fp_train.write("1 " + "train_" + userid + " " + pos + "\n")
            for neg_ins in neg:
                fp_train.write("0 " + "train_" + userid + " " + neg_ins + "\n")
    fp_train.close()
    if os.path.isfile(train_file_path):
        print(f"Train file {train_file_path} successfully generated")
    else:
        raise FileNotFoundError(f"Error when generating {train_file_path}")

def get_val_input(session, train_file_path, npratio=4):
    """Generate train file.

    Args:
        session (list): List of user session with user_id, clicks, positive and negative interactions.
        train_file_path (str): Path to file.
        npration (int): Ratio for negative sampling.
    """
    fp_train = open(train_file_path, "w", encoding="utf-8")
    for sess_id in range(len(session)):
        sess = session[sess_id]
        userid, _, poss, negs = sess
        for i in range(len(poss)):
            pos = poss[i]
            neg = _newsample(negs, npratio)
            fp_train.write("1 " + "valid_" + userid + " " + pos + "\n")
            for neg_ins in neg:
                fp_train.write("0 " + "valid_" + userid + " " + neg_ins + "\n")
    fp_train.close()
    if os.path.isfile(train_file_path):
        print(f"Train file {train_file_path} successfully generated")
    else:
        raise FileNotFoundError(f"Error when generating {train_file_path}")

In [10]:
def get_test_input(session, train_file_path, npratio=4):
    """Generate train file.

    Args:
        session (list): List of user session with user_id, clicks, positive and negative interactions.
        train_file_path (str): Path to file.
        npration (int): Ratio for negative sampling.
    """
    fp_train = open(train_file_path, "w", encoding="utf-8")
    for sess_id in range(len(session)):
        sess = session[sess_id]
        userid, _, poss, negs = sess
        for i in range(len(poss)):
            pos = poss[i]
            # neg = _newsample(negs, npratio)
            fp_train.write("1 " + "test_" + userid + " " + pos + "\n")
            # for neg_ins in neg:
            #     fp_train.write("0 " + "test_" + userid + " " + neg_ins + "\n")
    fp_train.close()
    if os.path.isfile(train_file_path):
        print(f"Train file {train_file_path} successfully generated")
    else:
        raise FileNotFoundError(f"Error when generating {train_file_path}")

In [21]:
def get_user_history(train_history, valid_history, test_history, user_history_path):
    """Generate user history file.

    Args:
        train_history (list): Train history.
        valid_history (list): Validation history
        user_history_path (str): Path to file.
    """
    fp_user_history = open(user_history_path, "w", encoding="utf-8")
    for userid in train_history:
        fp_user_history.write(
            "train_" + userid + " " + ",".join(train_history[userid]) + "\n"
        )
    for userid in valid_history:
        fp_user_history.write(
            "valid_" + userid + " " + ",".join(valid_history[userid]) + "\n"
        )
    for userid in test_history:
        fp_user_history.write(
            "test_" + userid + " " + ",".join(test_history[userid]) + "\n"
        )
    fp_user_history.close()
    if os.path.isfile(user_history_path):
        print(f"User history file {user_history_path} successfully generated")
    else:
        raise FileNotFoundError(f"Error when generating {user_history_path}")

In [71]:
df = pd.read_csv("./train/train_behaviors.tsv", sep='\t')
train_df, valid_df = train_test_split(df, test_size=0.2)
# train_df = train_df.iloc[:500]
# valid_df = valid_df.iloc[:500]

In [72]:
test_df = pd.read_csv("./test/test_behaviors.tsv", sep='\t')

In [73]:
train_session, train_history = read_behavior(train_df)
val_session, val_history = read_behavior(valid_df)

In [74]:
test_session, test_history = read_testing_behavior(test_df)

In [75]:
get_train_input(train_session, "./train/train.txt")
get_val_input(val_session, "./train/val.txt")

Train file ./train/train.txt successfully generated
Train file ./train/val.txt successfully generated


In [76]:
get_test_input(test_session, "./test/test.txt")

Train file ./test/test.txt successfully generated


In [77]:
get_user_history(train_history, val_history, test_history, "./train/user_history.txt")

User history file ./train/user_history.txt successfully generated


In [78]:
user_history_file = "./train/user_history.txt"

In [79]:
train_news = pd.read_csv("./train/train_news.tsv", sep='\t')
# display(train_news.query("title_entities.isnull()"))
train_news["title_entities"] = train_news["title_entities"].fillna("[]")
# train_news = train_news.query("~title_entities.isnull()")

In [80]:
test_news = pd.read_csv("./test/test_news.tsv", sep='\t')
test_news["title_entities"] = test_news["title_entities"].fillna("[]")

In [81]:
# train_news = pd.read_csv("./train/train_news.tsv", sep='\t')

train_news_df, valid_news_df = train_test_split(train_news, test_size=0.2)
news_words, news_entities = get_words_and_entities(pd.concat([train_news_df, valid_news_df, test_news]))

In [82]:
train_entities = os.path.join("./train/train_entity_embedding.vec")
valid_entities = os.path.join("./train/train_entity_embedding.vec")
news_feature_file, word_embeddings_file, entity_embeddings_file = generate_embeddings(
    "./train",
    news_words,
    news_entities,
    train_entities,
    valid_entities,
    max_sentence=10,
    word_embedding_dim=100,
)

In [83]:
yaml_file = maybe_download(url="https://recodatasets.z20.web.core.windows.net/deeprec/deeprec/dkn/dkn_MINDsmall.yaml", 
                           work_directory="./train")
hparams = prepare_hparams(yaml_file,
                          news_feature_file=news_feature_file,
                          user_history_file=user_history_file,
                          wordEmb_file=word_embeddings_file,
                          entityEmb_file=entity_embeddings_file,
                          epochs=epochs,
                          history_size=history_size,
                          batch_size=batch_size)

In [84]:
model = DKN(hparams, DKNTextIterator)

  curr_attention_layer = tf.compat.v1.layers.batch_normalization(
  curr_hidden_nn_layer = tf.compat.v1.layers.batch_normalization(
2024-06-17 00:21:01.181599: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-06-17 00:21:01.181764: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-06-17 00:21:01.181819: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-06-17 00:21:01.181940: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but th

In [85]:
model.fit("train/train.txt", "train/val.txt")

step 10000 , total_loss: 0.4534, data_loss: 0.4530
at epoch 1
train info: logloss loss:0.4612756512581832
eval info: auc:0.711, group_auc:0.711, mean_mrr:0.0001, ndcg@10:0.8416, ndcg@5:0.8539
at epoch 1 , train time: 1027.5 eval time: 73.4


KeyboardInterrupt: 

In [86]:
model.predict('test/test.txt', 'result')

<recommenders.models.deeprec.models.dkn.DKN at 0x7f19e41cd550>

In [87]:
result_list = []
with open('result','r') as f:
    for line in f:
        result_list.append(line.split()[0])

In [88]:
len(result_list)

694980

In [89]:
694980/15

46332.0

In [93]:
import numpy as np
result_df = pd.DataFrame(np.array(result_list).reshape(len(result_list) // 15, 15), columns=[f'p{i+1}' for i in range(15)])
result_df['id'] = range(len(result_df))
display(result_df)

Unnamed: 0,p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,p11,p12,p13,p14,p15,id
0,0.36175606,0.04509889,0.17911519,0.53075224,0.06669196,0.083261505,0.29081,0.057372935,0.0811602,0.047518328,0.11509859,0.15095384,0.12012795,0.11362242,0.5369745,0
1,0.18135929,0.10899555,0.059672,0.16693313,0.26880142,0.094061226,0.18493365,0.21089295,0.45597854,0.22970092,0.21434873,0.0970766,0.35896006,0.1618471,0.19513726,1
2,0.119015545,0.3693068,0.4820874,0.059106633,0.28944468,0.12457587,0.4640087,0.14587691,0.2636757,0.18052788,0.29806885,0.54376644,0.3656495,0.09000298,0.15022962,2
3,0.46198213,0.031377327,0.17136249,0.26808223,0.26628342,0.5194608,0.23092362,0.5174912,0.14719352,0.2061417,0.06942703,0.23925517,0.30125222,0.2624396,0.33627108,3
4,0.031812295,0.4957133,0.12268076,0.18156593,0.09164113,0.19375113,0.15019779,0.26275617,0.12086844,0.122133136,0.053790092,0.08981569,0.37976313,0.10551541,0.23934072,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
46327,0.06995096,0.049044095,0.30138427,0.19764103,0.19944179,0.23753273,0.0858201,0.24321157,0.44900772,0.23109291,0.09677192,0.57003915,0.3928931,0.12638804,0.15141757,46327
46328,0.03916842,0.11566103,0.2755931,0.14698729,0.43751428,0.098587655,0.21313521,0.13319813,0.24292974,0.32554367,0.12606357,0.43438226,0.073093645,0.026386108,0.12904869,46328
46329,0.121474564,0.24804963,0.6095199,0.16438577,0.07415296,0.33927238,0.055135734,0.0981009,0.5243851,0.12745003,0.4761079,0.09974697,0.063629456,0.3440112,0.3333002,46329
46330,0.5147637,0.1249591,0.1431812,0.4633263,0.2303275,0.2269931,0.24251199,0.25169572,0.29808298,0.1235264,0.0899706,0.12469008,0.0955164,0.25364244,0.10863317,46330


In [94]:
result_df.to_csv('result.csv', index=False)