<i>Copyright (c) Recommenders contributors.</i>

<i>Licensed under the MIT License.</i>

# DKN : Deep Knowledge-Aware Network for News Recommendation

DKN \[1\] is a deep learning model which incorporates information from knowledge graph for better news recommendation. Specifically, DKN uses TransX \[2\] method for knowledge graph representation learning, then applies a CNN framework, named KCNN, to combine entity embedding with word embedding and generate a final embedding vector for a news article. CTR prediction is made via an attention-based neural scorer. 

## Properties of DKN:

- DKN is a content-based deep model for CTR prediction rather than traditional ID-based collaborative filtering. 
- It makes use of knowledge entities and common sense in news content via joint learning from semantic-level and knowledge-level representations of news articles.
- DKN uses an attention module to dynamically calculate a user's aggregated historical representaition.


## Data format:

### DKN takes several files as input as follows:

- **training / validation / test files**: each line in these files represents one instance. Impressionid is used to evaluate performance within an impression session, so it is only used when evaluating, you can set it to 0 for training data. The format is : <br> 
`[label] [userid] [CandidateNews]%[impressionid] `<br> 
e.g., `1 train_U1 N1%0` <br> 

- **user history file**: each line in this file represents a users' click history. You need to set `history_size` parameter in the config file, which is the max number of user's click history we use. We will automatically keep the last `history_size` number of user click history, if user's click history is more than `history_size`, and we will automatically pad with 0 if user's click history is less than `history_size`. the format is : <br> 
`[Userid] [newsid1,newsid2...]`<br>
e.g., `train_U1 N1,N2` <br> 

- **document feature file**: It contains the word and entity features for news articles. News articles are represented by aligned title words and title entities. To take a quick example, a news title may be: <i>"Trump to deliver State of the Union address next week"</i>, then the title words value may be `CandidateNews:34,45,334,23,12,987,3456,111,456,432` and the title entitie value may be: `entity:45,0,0,0,0,0,0,0,0,0`. Only the first value of entity vector is non-zero due to the word "Trump". The title value and entity value is hashed from 1 to `n` (where `n` is the number of distinct words or entities). Each feature length should be fixed at k (`doc_size` parameter), if the number of words in document is more than k, you should truncate the document to k words, and if the number of words in document is less than k, you should pad 0 to the end. 
the format is like: <br> 
`[Newsid] [w1,w2,w3...wk] [e1,e2,e3...ek]`

- **word embedding/entity embedding/ context embedding files**: These are `*.npy` files of pretrained embeddings. After loading, each file is a `[n+1,k]` two-dimensional matrix, n is the number of words(or entities) of their hash dictionary, k is dimension of the embedding, note that we keep embedding 0 for zero padding. 

In this experiment, we used GloVe \[4\] vectors to initialize the word embedding. We trained entity embedding using TransE \[2\] on knowledge graph and context embedding is the average of the entity's neighbors in the knowledge graph.<br>

## MIND dataset

MIND dataset\[3\] is a large-scale English news dataset. It was collected from anonymized behavior logs of Microsoft News website. MIND contains 1,000,000 users, 161,013 news articles and 15,777,377 impression logs. Every news article contains rich textual content including title, abstract, body, category and entities. Each impression log contains the click events, non-clicked events and historical news click behaviors of this user before this impression.

In this notebook we are going to use a subset of MIND dataset, **MIND demo**. MIND demo contains 500 users, 9,432 news articles  and 6,134 impression logs. 

For this quick start notebook, we are providing directly all the necessary word embeddings, entity embeddings and context embedding files.

## Global settings and imports

In [1]:
import platform
print(platform.python_version())

3.9.18


In [2]:
import warnings
warnings.filterwarnings("ignore")

import os
import sys
from tempfile import TemporaryDirectory
import numpy as np
import pandas as pd
import scrapbook as sb
import tensorflow as tf
tf.get_logger().setLevel("ERROR") # only show error messages
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)

from recommenders.models.deeprec.deeprec_utils import download_deeprec_resources, prepare_hparams
from recommenders.models.deeprec.models.dkn import DKN
from recommenders.models.deeprec.io.dkn_iterator import DKNTextIterator

print(f"System version: {sys.version}")
print(f"Tensorflow version: {tf.__version__}")

System version: 3.9.18 (main, Sep 11 2023, 14:09:26) [MSC v.1916 64 bit (AMD64)]
Tensorflow version: 2.14.0


## Download and load data

In [3]:
tmpdir = TemporaryDirectory()
data_path = os.path.join(tmpdir.name, "mind-demo-dkn")

yaml_file = os.path.join(data_path, "dkn.yaml")
train_file = os.path.join(data_path, "train_mind_demo.txt")
valid_file = os.path.join(data_path, "valid_mind_demo.txt")
test_file = os.path.join(data_path, "test_mind_demo.txt")
news_feature_file = os.path.join(data_path, "doc_feature.txt")
user_history_file = os.path.join(data_path, "user_history.txt")
wordEmb_file = os.path.join(data_path, "word_embeddings_100.npy")
entityEmb_file = os.path.join(data_path, "TransE_entity2vec_100.npy")
contextEmb_file = os.path.join(data_path, "TransE_context2vec_100.npy")
if not os.path.exists(yaml_file):
    download_deeprec_resources("https://recodatasets.z20.web.core.windows.net/deeprec/", tmpdir.name, "mind-demo-dkn.zip")
    

100%|██████████| 11.3k/11.3k [00:01<00:00, 6.54kKB/s]


## Create hyper-parameters

In [4]:
EPOCHS = 10
HISTORY_SIZE = 50
BATCH_SIZE = 500

In [5]:
hparams = prepare_hparams(yaml_file,
                          news_feature_file = news_feature_file,
                          user_history_file = user_history_file,
                          wordEmb_file=wordEmb_file,
                          entityEmb_file=entityEmb_file,
                          contextEmb_file=contextEmb_file,
                          epochs=EPOCHS,
                          history_size=HISTORY_SIZE,
                          batch_size=BATCH_SIZE)
print(hparams)

HParams object with values {'use_entity': True, 'use_context': True, 'cross_activation': 'identity', 'user_dropout': False, 'dropout': [0.0], 'attention_dropout': 0.0, 'load_saved_model': False, 'fast_CIN_d': 0, 'use_Linear_part': False, 'use_FM_part': False, 'use_CIN_part': False, 'use_DNN_part': False, 'init_method': 'uniform', 'init_value': 0.1, 'embed_l2': 1e-06, 'embed_l1': 0.0, 'layer_l2': 1e-06, 'layer_l1': 0.0, 'cross_l2': 0.0, 'cross_l1': 0.0, 'reg_kg': 0.0, 'learning_rate': 0.0005, 'lr_rs': 1, 'lr_kg': 0.5, 'kg_training_interval': 5, 'max_grad_norm': 2, 'is_clip_norm': 0, 'dtype': 32, 'optimizer': 'adam', 'epochs': 10, 'batch_size': 500, 'enable_BN': True, 'show_step': 10000, 'save_model': False, 'save_epoch': 2, 'write_tfevents': False, 'train_num_ngs': 4, 'need_sample': True, 'embedding_dropout': 0.0, 'EARLY_STOP': 100, 'min_seq_length': 1, 'slots': 5, 'cell': 'SUM', 'doc_size': 10, 'history_size': 50, 'word_size': 12600, 'entity_size': 3987, 'data_format': 'dkn', 'metrics'

## Train the DKN model

In [6]:
model = DKN(hparams, DKNTextIterator)

In [8]:
print(model.run_eval(valid_file))

{'auc': 0.4415, 'group_auc': 0.4549, 'mean_mrr': 0.1441, 'ndcg@5': 0.1179, 'ndcg@10': 0.1823}


In [9]:
model.fit(train_file, valid_file)

at epoch 1
train info: logloss loss:0.6918459994097551
eval info: auc:0.5729, group_auc:0.5506, mean_mrr:0.1835, ndcg@10:0.2462, ndcg@5:0.1898
at epoch 1 , train time: 222.6 eval time: 48.8
at epoch 2
train info: logloss loss:0.6521421199043592
eval info: auc:0.586, group_auc:0.5525, mean_mrr:0.1768, ndcg@10:0.2414, ndcg@5:0.1821
at epoch 2 , train time: 246.1 eval time: 29.2
at epoch 3
train info: logloss loss:0.6361930264780918
eval info: auc:0.5979, group_auc:0.5596, mean_mrr:0.1879, ndcg@10:0.2519, ndcg@5:0.1916
at epoch 3 , train time: 227.8 eval time: 34.6
at epoch 4
train info: logloss loss:0.6199349015951157
eval info: auc:0.6016, group_auc:0.5652, mean_mrr:0.1856, ndcg@10:0.2547, ndcg@5:0.188
at epoch 4 , train time: 228.2 eval time: 35.5


KeyboardInterrupt: 

## Evaluate the DKN model

Now we can check the performance on the test set:

In [None]:
res = model.run_eval(test_file)
print(res)

{'auc': 0.6019, 'group_auc': 0.5905, 'mean_mrr': 0.2122, 'ndcg@5': 0.2169, 'ndcg@10': 0.2768}


In [None]:
preds = model.predict(infile_name=test_file, outfile_name=test_file)

In [4]:
df = pd.read_csv(test_file, " ", header=None).rename(columns={0: "label", 1: "user_id", 2: "news_id_and_impression_id"})
df["news_id"] = df["news_id_and_impression_id"].apply(lambda x: str(x).split("%")[0])
df["impression_id"] = df["news_id_and_impression_id"].apply(lambda x: str(x).split("%")[1])
df = df.drop(["news_id_and_impression_id"], axis=1)
df

Unnamed: 0,label,user_id,news_id,impression_id
0,1,test_U1,N7076,0
1,1,test_U1,N7102,0
2,1,test_U1,N7716,0
3,1,test_U1,N7490,0
4,0,test_U1,N5282,0
...,...,...,...,...
20417,0,test_U498,N7102,423
20418,0,test_U498,N7283,423
20419,0,test_U498,N7413,423
20420,0,test_U498,N7209,423


In [5]:
preds = pd.read_csv("test_preds.csv", header = None).rename(columns={0: "pred"})
full_df = pd.concat([df, preds], axis=1)
full_df

Unnamed: 0,label,user_id,news_id,impression_id,pred
0,1,test_U1,N7076,0,0.177532
1,1,test_U1,N7102,0,0.523582
2,1,test_U1,N7716,0,0.539351
3,1,test_U1,N7490,0,0.076353
4,0,test_U1,N5282,0,0.210769
...,...,...,...,...,...
20417,0,test_U498,N7102,423,0.321095
20418,0,test_U498,N7283,423,0.446874
20419,0,test_U498,N7413,423,0.203353
20420,0,test_U498,N7209,423,0.357812


In [50]:
full_df["pred"]

3595    0.675842
3507    0.599789
3558    0.578578
3532    0.311621
3584    0.310242
          ...   
3       0.076353
67      0.072969
53      0.071585
60      0.060192
30      0.037560
Name: pred, Length: 20422, dtype: float64

In [1]:
from sklearn.metrics import roc_auc_score, ndcg_score
auc_score = roc_auc_score(full_df["label"], full_df["pred"])
print(f"ROC AUC: {auc_score:.4f}")

NameError: name 'full_df' is not defined

In [67]:
full_df.sort_values(["user_id", "pred"], ascending=False)

Unnamed: 0,label,user_id,news_id,impression_id,pred,correct
3595,0,test_U99,N5390,72,0.675842,False
3507,0,test_U99,N32,71,0.599789,False
3558,0,test_U99,N7525,71,0.578578,False
3532,0,test_U99,N8416,71,0.311621,False
3584,0,test_U99,N5510,72,0.310242,False
...,...,...,...,...,...,...
3,1,test_U1,N7490,0,0.076353,False
67,0,test_U1,N5532,0,0.072969,False
53,0,test_U1,N7534,0,0.071585,False
60,0,test_U1,N9110,0,0.060192,False


In [69]:
df = full_df.sort_values(["user_id", "pred"], ascending=False)
df = df.groupby("user_id").agg({"news_id": list, "pred": list, "label": list})
df

Unnamed: 0_level_0,news_id,pred,label
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
test_U1,"[N7525, N6047, N32, N8046, N7035, N6323, N7727...","[0.9402764, 0.8622448, 0.85336107, 0.7870782, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, ..."
test_U10,"[N7325, N7765, N5777, N7283, N8423, N8423, N68...","[0.5895697, 0.5687929, 0.5297649, 0.51531017, ...","[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
test_U100,"[N32, N7419, N7102, N5883, N6672, N8219, N7716...","[0.57836336, 0.33195335, 0.24049465, 0.2226193...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
test_U101,"[N32, N6369, N7419, N7716, N7283, N6672, N6417...","[0.32904655, 0.11326947, 0.11131217, 0.1069895...","[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"
test_U102,"[N7325, N7325, N7206, N7206, N5405, N5405, N55...","[0.63844717, 0.63844717, 0.60617846, 0.6061784...","[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
...,...,...,...
test_U94,"[N6121, N6883, N5710, N32, N32, N32, N8167, N6...","[0.80225945, 0.79154, 0.71537554, 0.6667554, 0...","[1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ..."
test_U96,"[N7525, N6883, N32, N32, N6323, N6218, N6622, ...","[0.90868694, 0.8126242, 0.7471411, 0.7471411, ...","[0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
test_U97,"[N32, N7525, N6323, N5510, N5405, N7359, N7035...","[0.75885576, 0.6719303, 0.48816162, 0.42869627...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
test_U98,"[N32, N32, N32, N7525, N7525, N8046, N6622, N6...","[0.86075646, 0.86075646, 0.86075646, 0.8007161...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


In [91]:
avg_ndcg = 0
count = 0
for i in range(261):
    temp = df.iloc[i]
    ndcg = ndcg_score([temp["label"]], [temp["pred"]], k=5)
    avg_ndcg += ndcg
    count += 1
print(f"Average NDCG@5: {avg_ndcg/count}")

Average NDCG@5: 0.15771405697384241


How do they actually construct their lists? What threshold do they use? Or do they just get all the predictions for a user, sort it, and take the top 5 or 10?

I attempted with the below but does this mean that the recommender is super wrong because those all have labels 0.....?

In [6]:
#full_df[full_df["user_id"] == "test_U1"].sort_values("pred", ascending=False).head()
full_df[full_df["user_id"] == "test_U1"].nlargest(5, columns="pred")

Unnamed: 0,label,user_id,news_id,impression_id,pred
51,0,test_U1,N7525,0,0.940276
21,0,test_U1,N6047,0,0.862245
71,0,test_U1,N32,0,0.853361
12,0,test_U1,N8046,0,0.787078
31,0,test_U1,N7035,0,0.700595


In [7]:
full_df = full_df.sort_values(["user_id", "pred"], ascending=False)
full_df

Unnamed: 0,label,user_id,news_id,impression_id,pred
3595,0,test_U99,N5390,72,0.675842
3507,0,test_U99,N32,71,0.599789
3558,0,test_U99,N7525,71,0.578578
3532,0,test_U99,N8416,71,0.311621
3584,0,test_U99,N5510,72,0.310242
...,...,...,...,...,...
3,1,test_U1,N7490,0,0.076353
67,0,test_U1,N5532,0,0.072969
53,0,test_U1,N7534,0,0.071585
60,0,test_U1,N9110,0,0.060192


In [8]:
# construct lists
user_recs = full_df.groupby("user_id").agg({"news_id": list, "pred": list})
user_recs

Unnamed: 0_level_0,news_id,pred
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1
test_U1,"[N7525, N6047, N32, N8046, N7035, N6323, N7727...","[0.9402764, 0.8622448, 0.85336107, 0.7870782, ..."
test_U10,"[N7325, N7765, N5777, N7283, N8423, N8423, N68...","[0.5895697, 0.5687929, 0.5297649, 0.51531017, ..."
test_U100,"[N32, N7419, N7102, N5883, N6672, N8219, N7716...","[0.57836336, 0.33195335, 0.24049465, 0.2226193..."
test_U101,"[N32, N6369, N7419, N7716, N7283, N6672, N6417...","[0.32904655, 0.11326947, 0.11131217, 0.1069895..."
test_U102,"[N7325, N7325, N7206, N7206, N5405, N5405, N55...","[0.63844717, 0.63844717, 0.60617846, 0.6061784..."
...,...,...
test_U94,"[N6121, N6883, N5710, N32, N32, N32, N8167, N6...","[0.80225945, 0.79154, 0.71537554, 0.6667554, 0..."
test_U96,"[N7525, N6883, N32, N32, N6323, N6218, N6622, ...","[0.90868694, 0.8126242, 0.7471411, 0.7471411, ..."
test_U97,"[N32, N7525, N6323, N5510, N5405, N7359, N7035...","[0.75885576, 0.6719303, 0.48816162, 0.42869627..."
test_U98,"[N32, N32, N32, N7525, N7525, N8046, N6622, N6...","[0.86075646, 0.86075646, 0.86075646, 0.8007161..."


In [45]:
full_df["correct"] = full_df.apply(lambda x: x["pred"] > 0.6 and x["label"] == 1, axis=1)
num_correct = full_df.groupby("user_id")["correct"].mean()
num_correct.mean()

0.0039401217472708

Question for our re-ranker: do we assume that baseline takes the top k recommendations based on their prediction score, and we also take those top k and just rearrange them (wouldn't affect accuracy but would affect ndcg)? Or, do we take all the potential recommendations (which would then actually be completely independent from the baseline model) and rank them ourselves? The second, then, is not so much a re-ranking as just a ranking. Or a different model just based on diversity....


## Re-ranker

In [9]:
import os
import tempfile
import urllib
import zipfile

# Temporary folder for data we need during execution of this notebook (we'll clean up
# at the end, we promise)
temp_dir = os.path.join(tempfile.gettempdir(), 'mind')
os.makedirs(temp_dir, exist_ok=True)

# The dataset is split into training and validation set, each with a large and small version.
# The format of the four files are the same.
# For demonstration purpose, we will use small version validation set only.
base_url = 'https://mind201910small.blob.core.windows.net/release'
training_small_url = f'{base_url}/MINDsmall_train.zip'
validation_small_url = f'{base_url}/MINDsmall_dev.zip'
training_large_url = f'{base_url}/MINDlarge_train.zip'
validation_large_url = f'{base_url}/MINDlarge_dev.zip'

def download_url(url,
                 destination_filename=None,
                 progress_updater=None,
                 force_download=False,
                 verbose=True):
    """
    Download a URL to a temporary file
    """
    if not verbose:
        progress_updater = None
    # This is not intended to guarantee uniqueness, we just know it happens to guarantee
    # uniqueness for this application.
    if destination_filename is None:
        url_as_filename = url.replace('://', '_').replace('/', '_')
        destination_filename = \
            os.path.join(temp_dir,url_as_filename)
    if (not force_download) and (os.path.isfile(destination_filename)):
        if verbose:
            print('Bypassing download of already-downloaded file {}'.format(
                os.path.basename(url)))
        return destination_filename
    if verbose:
        print('Downloading file {} to {}'.format(os.path.basename(url),
                                                 destination_filename),
              end='')
    urllib.request.urlretrieve(url, destination_filename, progress_updater)
    assert (os.path.isfile(destination_filename))
    nBytes = os.path.getsize(destination_filename)
    if verbose:
        print('...done, {} bytes.'.format(nBytes))
    return destination_filename

zip_path = download_url(validation_small_url, verbose=True)
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(temp_dir)

os.listdir(temp_dir)

news_path = os.path.join(temp_dir, 'news.tsv')
news_df = pd.read_table(news_path,
              header=None,
              names=[
                  'id', 'category', 'subcategory', 'title', 'abstract', 'url',
                  'title_entities', 'abstract_entities'
              ])



Bypassing download of already-downloaded file MINDsmall_dev.zip


In [10]:
news_df.set_index("id", inplace=True)
news_df.head()

Unnamed: 0_level_0,category,subcategory,title,abstract,url,title_entities,abstract_entities
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
N55528,lifestyle,lifestyleroyals,"The Brands Queen Elizabeth, Prince Charles, an...","Shop the notebooks, jackets, and more that the...",https://assets.msn.com/labs/mind/AAGH0ET.html,"[{""Label"": ""Prince Philip, Duke of Edinburgh"",...",[]
N18955,health,medical,Dispose of unwanted prescription drugs during ...,,https://assets.msn.com/labs/mind/AAISxPN.html,"[{""Label"": ""Drug Enforcement Administration"", ...",[]
N61837,news,newsworld,The Cost of Trump's Aid Freeze in the Trenches...,Lt. Ivan Molchanets peeked over a parapet of s...,https://assets.msn.com/labs/mind/AAJgNsz.html,[],"[{""Label"": ""Ukraine"", ""Type"": ""G"", ""WikidataId..."
N53526,health,voices,I Was An NBA Wife. Here's How It Affected My M...,"I felt like I was a fraud, and being an NBA wi...",https://assets.msn.com/labs/mind/AACk2N6.html,[],"[{""Label"": ""National Basketball Association"", ..."
N38324,health,medical,"How to Get Rid of Skin Tags, According to a De...","They seem harmless, but there's a very good re...",https://assets.msn.com/labs/mind/AAAKEkt.html,"[{""Label"": ""Skin tag"", ""Type"": ""C"", ""WikidataI...","[{""Label"": ""Skin tag"", ""Type"": ""C"", ""WikidataI..."


In [11]:
news_df.loc["N55528"]["title_entities"]

'[{"Label": "Prince Philip, Duke of Edinburgh", "Type": "P", "WikidataId": "Q80976", "Confidence": 1.0, "OccurrenceOffsets": [48], "SurfaceForms": ["Prince Philip"]}, {"Label": "Charles, Prince of Wales", "Type": "P", "WikidataId": "Q43274", "Confidence": 1.0, "OccurrenceOffsets": [28], "SurfaceForms": ["Prince Charles"]}, {"Label": "Elizabeth II", "Type": "P", "WikidataId": "Q9682", "Confidence": 0.97, "OccurrenceOffsets": [11], "SurfaceForms": ["Queen Elizabeth"]}]'

# MMR implementation

Using Same Same but Different approach, but only with categories
https://pure.tudelft.nl/ws/portalfiles/portal/52000140/Same_Same_but_Different.pdf

In [16]:
import torch
import torchtext

glove = torchtext.vocab.GloVe(name="6B", dim=50)

In [17]:
torch.cosine_similarity(glove["foodanddrink"].unsqueeze(0), glove["sprts"].unsqueeze(0)).item()
# 1 = similar
# 0 = different
# sports and sports returns (1 - 1) = 0

0.0

In [100]:
# nid_1 and nid_2 are news ids (strings)
# returns similarity of categories of nid_1 and nid_2 using cosine similarity
# if categories don't exist returns 0
# 0 is very similar
# 1 is very different
def get_category_similarity(nid_1, nid_2):
    if nid_1 not in news_df.index or nid_2 not in news_df.index:
        return 0
    
    cat1 = news_df.loc[nid_1]["category"]
    cat2 = news_df.loc[nid_2]["category"]
    
    return 1 - torch.cosine_similarity(glove[cat1].unsqueeze(0), glove[cat2].unsqueeze(0)).item()


# Calculates mmr score for a given item
# item: news id
# pred: relevance of item
# recs_so_far: list of news ids recommended so far
def mmr_item(item, pred, recs_so_far, lamda):
    return (lamda * pred) - ((1 - lamda) * np.max([1 - get_category_similarity(item, x) for x in recs_so_far]))

# Calculates list of recommendations
# recs is a list of news ids
# pred scores is a list of relevance scores, same order as recs
# lamda is a weight parameter
# k is how many items should be in the recommendation; assume k >= 1
def mmr_user(recs, pred_scores, lamda, k):
    list_so_far = [recs[0]]
    preds_so_far = [pred_scores[0]]
    while len(list_so_far) < k:
        max_mmr = -2 # mmr can range from -1 to 1
        max_mmr_id = ''
        for i in range(0, len(recs)): #should be a better way to do this
            if recs[i] not in list_so_far:
                mmr_score = mmr_item(recs[i], pred_scores[i], list_so_far, lamda)
                if mmr_score > max_mmr:
                    max_mmr = mmr_score
                    max_mmr_id = recs[i]
        list_so_far.append(max_mmr_id)
        preds_so_far.append(max_mmr)
    return list_so_far, preds_so_far
        
# Calculates recommendations according to mmr for all users
# df is a Pandas dataframe with cols user, news_id, pred where pred[i] is the relevance score for news_id[i]
# lamda is a weight parameter
# k is how many items should be in the recommendation; assume k >= 1
def mmr_all(df, lamda, k):
    result_df = {}
    for index, row in df.iterrows():
        news_ids, preds = mmr_user(row['news_id'], row['pred'], lamda, k)
        result_df[index] = {"news_id": news_ids, "pred": preds}
    return result_df


In [101]:
one_user = user_recs.head(1)
mmr_all(one_user, 0.75, 5)

{'test_U1': {'news_id': ['N7525', 'N32', 'N6047', 'N8046', 'N7283'],
  'pred': [0.9402764,
   0.6400208025,
   0.49023565297470095,
   0.46304756270160674,
   0.4195598269914437]}}

In [None]:
one_user

Unnamed: 0_level_0,news_id,pred
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1
test_U1,"[N7525, N6047, N32, N8046, N7035, N6323, N7727...","[0.9402764, 0.8622448, 0.85336107, 0.7870782, ..."


Example output  
test_U1  

Categories of top 5 by relevance: news, sports, foodanddrink, finance, [no category]  
Categories of top 5 by diversity: news, foodanddrink, foodanddrink, autos, lifestyle   
Categories of top 5 lambda = 0.5: news, foodanddrink, autos, finance, sports   
Categories of top 5 lambda = 0.75: news, foodanddrink, sports, finance, autos

**interesting note: foodanddrink returns score of 0 from glove, even against itself. That's why foodanddrink appears twice. Could additionally check if the categories match?

# Diversity

In [95]:
k = 5

In [19]:
def diversity_user(recs):
    score = 0.0
    count = 0.0
    for i in range(len(recs)):
        for j in range(i+1, len(recs)):
            count += 1.0
            score += get_category_similarity(recs[i], recs[j])
    return score/count
        

def diversity_eval(df):
    diversity_score = 0.0
    count = 0.0
    for index, row in df.iterrows():
        diversity_score += diversity_user(row['news_id'])
        count += 1.0
    return diversity_score/count

## Baseline diversity

In [37]:
baseline_df = user_recs.drop(['pred'], axis=1)
baseline_df = baseline_df.reset_index()
baseline_df["news_id"] = baseline_df["news_id"].apply(lambda x: x[:k])
baseline_df

Unnamed: 0,user_id,news_id
0,test_U1,"[N7525, N6047, N32, N8046, N7035]"
1,test_U10,"[N7325, N7765, N5777, N7283, N8423]"
2,test_U100,"[N32, N7419, N7102, N5883, N6672]"
3,test_U101,"[N32, N6369, N7419, N7716, N7283]"
4,test_U102,"[N7325, N7325, N7206, N7206, N5405]"
...,...,...
256,test_U94,"[N6121, N6883, N5710, N32, N32]"
257,test_U96,"[N7525, N6883, N32, N32, N6323]"
258,test_U97,"[N32, N7525, N6323, N5510, N5405]"
259,test_U98,"[N32, N32, N32, N7525, N7525]"


In [38]:
print(diversity_eval(baseline_df))

0.4037543211831672


## MMR diversity

In [110]:
lamda=0.75
mmr_rerank_data = mmr_all(user_recs, lamda, k)
mmr_rerank_df = pd.DataFrame.from_dict(mmr_rerank_data, orient="index").reset_index()
mmr_rerank_df

Unnamed: 0,index,news_id,pred
0,test_U1,"[N7525, N32, N6047, N8046, N7283]","[0.9402764, 0.6400208025, 0.49023565297470095,..."
1,test_U10,"[N7325, N7765, N5777, N7283, N8423]","[0.5895697, 0.17659467500000003, 0.147323675, ..."
2,test_U100,"[N32, N7419, N5883, N7283, N7102]","[0.57836336, 0.24896501249999997, 0.0699872487..."
3,test_U101,"[N32, N6369, N7283, N7419, N7209]","[0.32904655, 0.0849521025, 0.02333403020928381..."
4,test_U102,"[N7325, N7206, N5405, N5592, N5969]","[0.63844717, 0.20463384499999998, 0.1466304499..."
...,...,...,...
256,test_U94,"[N6121, N32, N6883, N5710, N8220]","[0.80225945, 0.50006655, 0.45277082410335545, ..."
257,test_U96,"[N7525, N32, N6883, N8220, N6323]","[0.90868694, 0.560355825, 0.4685839741033554, ..."
258,test_U97,"[N32, N7525, N6323, N7283, N7976]","[0.75885576, 0.503947725, 0.20967326797470093,..."
259,test_U98,"[N32, N7525, N8046, N6622, N7283]","[0.86075646, 0.6005370750000001, 0.41640909020..."


In [111]:
print(diversity_eval(mmr_rerank_df.drop(['pred'], axis=1)))

0.6183990380145128


In [118]:
mmr_rerank_df = mmr_rerank_df.rename({"index": "user_id"}, axis=1)
split_df = mmr_rerank_df.set_index(["user_id"]).apply(lambda x: x.explode()).reset_index()
split_df = split_df.rename({"pred": "mmr_pred"}, axis=1)
split_df

Unnamed: 0,user_id,news_id,mmr_pred
0,test_U1,N7525,0.940276
1,test_U1,N32,0.640021
2,test_U1,N6047,0.490236
3,test_U1,N8046,0.463048
4,test_U1,N7283,0.41956
...,...,...,...
1300,test_U99,N5390,0.675842
1301,test_U99,N32,0.449842
1302,test_U99,N7525,0.277486
1303,test_U99,N7735,0.111944


In [121]:
mmr_labels = pd.merge(full_df, split_df, on=["user_id", "news_id"], how="right")
mmr_labels

Unnamed: 0,label,user_id,news_id,impression_id,pred,correct,mmr_pred
0,0,test_U1,N7525,0,0.940276,False,0.940276
1,0,test_U1,N32,0,0.853361,False,0.640021
2,0,test_U1,N6047,0,0.862245,False,0.490236
3,0,test_U1,N8046,0,0.787078,False,0.463048
4,0,test_U1,N7283,0,0.669152,False,0.41956
...,...,...,...,...,...,...,...
1451,0,test_U99,N5390,72,0.675842,False,0.675842
1452,0,test_U99,N32,71,0.599789,False,0.449842
1453,0,test_U99,N7525,71,0.578578,False,0.277486
1454,0,test_U99,N7735,71,0.149259,False,0.111944


In [127]:
auc_score = roc_auc_score(mmr_labels["label"], mmr_labels["mmr_pred"])
print(f"ROC AUC: {auc_score:.4f}")

ROC AUC: 0.5015


In [131]:
mmr_labels_lists = mmr_labels.groupby("user_id").agg({"label": list, "mmr_pred": list})
mmr_labels_lists

Unnamed: 0_level_0,label,mmr_pred
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1
test_U1,"[0, 0, 0, 0, 0]","[0.9402764, 0.6400208025, 0.49023565297470095,..."
test_U10,"[0, 0, 0, 0, 1, 0]","[0.5895697, 0.17659467500000003, 0.147323675, ..."
test_U100,"[0, 0, 0, 0, 0]","[0.57836336, 0.24896501249999997, 0.0699872487..."
test_U101,"[1, 0, 0, 0, 0]","[0.32904655, 0.0849521025, 0.02333403020928381..."
test_U102,"[0, 0, 1, 0, 0, 0, 0, 0, 0, 0]","[0.63844717, 0.63844717, 0.20463384499999998, ..."
...,...,...
test_U94,"[1, 0, 0, 0, 0, 1, 0]","[0.80225945, 0.50006655, 0.50006655, 0.5000665..."
test_U96,"[0, 0, 0, 0, 0, 1]","[0.90868694, 0.560355825, 0.560355825, 0.46858..."
test_U97,"[0, 0, 0, 0, 0]","[0.75885576, 0.503947725, 0.20967326797470093,..."
test_U98,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]","[0.86075646, 0.86075646, 0.86075646, 0.6005370..."


In [133]:
avg_ndcg = 0
count = 0
for i in range(261):
    temp = mmr_labels_lists.iloc[i]
    ndcg = ndcg_score([temp["label"]], [temp["mmr_pred"]], k=5)
    avg_ndcg += ndcg
    count += 1
print(f"Average NDCG@5: {avg_ndcg/count}")

Average NDCG@5: 0.211004037646523


In [41]:
lamda = 0

In [42]:
mmr_rerank_data = mmr_all(user_recs, lamda, k)
mmr_rerank_df = pd.DataFrame(list(mmr_rerank_data.items()), columns=['user_id', 'news_id'])

mmr_rerank_df

Unnamed: 0,user_id,news_id
0,test_U1,"[N7525, N32, N5582, N7283, N7716]"
1,test_U10,"[N7325, N7765, N5777, N7283, N8423]"
2,test_U100,"[N32, N7419, N7283, N5883, N7102]"
3,test_U101,"[N32, N6369, N7283, N7419, N7209]"
4,test_U102,"[N7325, N7206, N5405, N5592, N5969]"
...,...,...
256,test_U94,"[N6121, N32, N8220, N5506, N7735]"
257,test_U96,"[N7525, N32, N8220, N5506, N7735]"
258,test_U97,"[N32, N7525, N7735, N6466, N7762]"
259,test_U98,"[N32, N7525, N6926, N7172, N7283]"


In [43]:
print(diversity_eval(mmr_rerank_df))

0.7885455176177153


TODO
- convert ranked list back into format with labels and run through ndcg
- make some visualizations?
- in thinking about this more, what does food and drink being "more different" from news than sports really mean? Instead of using word embeddings wouldn't it make more sense to simply give a 0 if the categories match and a 1 otherwise?
-  experiment with different features instead of categories

DONE
- also realized that preds do not line up with the prediction scores. At least, if they are sorting those by highest to lowest, they don't match. This could be because: a. they are doing something different or b. (more likely) the predictions output is in a different order/I read it in wrong. (Yooo I'm so dumb. Tell me why I thought the order of the test file was the order that the ranking put the recommendations in??)
- ran code on small sample and seems to be working so far
- implemented word embedding

Other diversity metric options 
from Recommenders with a Mission https://dl.acm.org/doi/pdf/10.1145/3406522.3446019

calibration: distribution of categories across user's history, distribution of categories in current pool

It seems to me that the other diversity metrics presented in Recommenders with a Mission require additional information that you need to figure out from the text, and the other features used in Same Same but Different are all provided to them with their dataset.

If we don't just want to do categories, we could try doing similarity with the abstracts or the titles or the title entities or abstract entities. Or all of them together, somehow...