# **POPULARITY MODEL**

### **Initial Setup**

In [1]:
!git clone https://github.com/microsoft/recommenders.git

Cloning into 'recommenders'...
remote: Enumerating objects: 37174, done.[K
remote: Counting objects: 100% (706/706), done.[K
remote: Compressing objects: 100% (244/244), done.[K
remote: Total 37174 (delta 469), reused 602 (delta 446), pack-reused 36468[K
Receiving objects: 100% (37174/37174), 205.13 MiB | 16.43 MiB/s, done.
Resolving deltas: 100% (25113/25113), done.


In [2]:
%cd recommenders

/content/recommenders


In [3]:
!pip install retrying

Collecting retrying
  Downloading retrying-1.3.4-py3-none-any.whl (11 kB)
Installing collected packages: retrying
Successfully installed retrying-1.3.4


In [4]:
!pip install scrapbook

Collecting scrapbook
  Downloading scrapbook-0.5.0-py3-none-any.whl (34 kB)
Collecting papermill (from scrapbook)
  Downloading papermill-2.4.0-py3-none-any.whl (38 kB)
Collecting jedi>=0.16 (from ipython->scrapbook)
  Downloading jedi-0.18.2-py2.py3-none-any.whl (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m44.4 MB/s[0m eta [36m0:00:00[0m
Collecting ansiwrap (from papermill->scrapbook)
  Downloading ansiwrap-0.8.4-py2.py3-none-any.whl (8.5 kB)
Collecting textwrap3>=0.9.2 (from ansiwrap->papermill->scrapbook)
  Downloading textwrap3-0.9.2-py2.py3-none-any.whl (12 kB)
Installing collected packages: textwrap3, jedi, ansiwrap, papermill, scrapbook
Successfully installed ansiwrap-0.8.4 jedi-0.18.2 papermill-2.4.0 scrapbook-0.5.0 textwrap3-0.9.2


###  **Importing the needed libraries**

In [5]:
from google.colab import drive
drive.mount('/content/drive')

import sys
import os
import numpy as np
import pandas as pd
import zipfile
from tqdm import tqdm
from tempfile import TemporaryDirectory
import tensorflow as tf
tf.get_logger().setLevel('ERROR') # only show error messages

from recommenders.models.deeprec.deeprec_utils import download_deeprec_resources
from recommenders.models.newsrec.newsrec_utils import prepare_hparams
from recommenders.models.newsrec.models.nrms import NRMSModel
from recommenders.models.newsrec.io.mind_iterator import MINDIterator
from recommenders.models.newsrec.newsrec_utils import get_mind_data_set
from sklearn.metrics import ndcg_score
from recommenders.evaluation.python_evaluation import ndcg_at_k

import warnings
# Avoid printing some FutureWarnings
warnings.filterwarnings("ignore", category=FutureWarning)


print("System version: {}".format(sys.version))
print("Tensorflow version: {}".format(tf.__version__))

Mounted at /content/drive
System version: 3.10.12 (main, Jun  7 2023, 12:45:35) [GCC 9.4.0]
Tensorflow version: 2.12.0


### **Loading the behavior and news dataframes**

In [6]:
# Options: demo, small, large
MIND_type = 'small'

In [8]:
tmpdir = TemporaryDirectory()
data_path = tmpdir.name

train_news_file = os.path.join(data_path, 'train', r'news.tsv')
train_behaviors_file = os.path.join(data_path, 'train', r'behaviors.tsv')
valid_news_file = os.path.join(data_path, 'valid', r'news.tsv')
valid_behaviors_file = os.path.join(data_path, 'valid', r'behaviors.tsv')

mind_url, mind_train_dataset, mind_dev_dataset, mind_utils = get_mind_data_set(MIND_type)

if not os.path.exists(train_news_file):
    download_deeprec_resources(mind_url, os.path.join(data_path, 'train'), mind_train_dataset)

if not os.path.exists(valid_news_file):
    download_deeprec_resources(mind_url, \
                               os.path.join(data_path, 'valid'), mind_dev_dataset)

100%|██████████| 51.7k/51.7k [00:03<00:00, 14.1kKB/s]
100%|██████████| 30.2k/30.2k [00:02<00:00, 10.4kKB/s]


-------------

## **POPULARITY MODEL: Choosing the most viewed articles**

**Auxiliary functions**

In [9]:
def dcg_score(y_true, y_score, k=10):
    """Computing dcg score metric at k.

    Args:
        y_true (np.ndarray): Ground-truth labels.
        y_score (np.ndarray): Predicted labels.

    Returns:
        np.ndarray: dcg scores.
    """
    k = min(np.shape(y_true)[-1], k)
    order = np.argsort(y_score)[::-1]
    y_true = np.take(y_true, order[:k])
    gains = 2 ** y_true - 1
    discounts = np.log2(np.arange(len(y_true)) + 2)
    return np.sum(gains / discounts)

In [10]:
def ndcg_score(y_true, y_score, k):
    """Computing ndcg score metric at k.

    Args:
        y_true (np.ndarray): Ground-truth labels.
        y_score (np.ndarray): Predicted labels.

    Returns:
        numpy.ndarray: ndcg scores.
    """
    best = dcg_score(y_true, y_true, k)
    actual = dcg_score(y_true, y_score, k)
    return actual / best

In [11]:
def process_impression(impression_list):
    """
    Process the impression list and extract click and non-click information.

    Args:
        impression_list (str): List of impressions in string format.

    Returns:
        tuple: A tuple containing two lists - click and non-click.
    """
    list_of_strings = impression_list.split()
    click = [x.split('-')[0] for x in list_of_strings if x.split('-')[1] == '1']
    non_click = [x.split('-')[0] for x in list_of_strings if x.split('-')[1] == '0']
    return click,non_click

In [12]:
def generate_new_array(arr):
    """
    Generate a new array based on the input array, sorting the predicted news.

    Args:
        arr (list): The input array.

    Returns:
        list: The new array.
    """
    indexed_array = [(value, index) for index, value in enumerate(arr)]
    sorted_array = sorted(indexed_array, key=lambda x: x[0], reverse=True)
    new_array = [item[1] + 1 for item in sorted_array]
    return new_array

--------------

In the first cell, the 'behaviors' file is loaded into a DataFrame, `behav_df_demo`, using the `pd.read_csv` function with a tab separator. The column names are explicitly set.

A subset of this DataFrame is selected, keeping only the 'Impression_ID', 'User_ID', and 'Impressions' columns, and stored in the `behav_pop_df` DataFrame.

The 'Impressions' column in `behav_pop_df` is split into two new columns: 'Viewed Impressions' and 'Impressions array'. 'Viewed Impressions' contains the IDs of viewed items, while 'Impressions array' contains the corresponding binary values (1 for viewed, 0 for not viewed).

In [13]:
# Read the behaviors file
behav_df_demo = pd.read_csv(valid_behaviors_file,sep='\t', header=None, names=['Impression_ID', 'User_ID', 'Time', 'History', 'Impressions'])
# Select a subset
behav_pop_df = behav_df_demo.loc[:, ["Impression_ID", "User_ID", "Impressions"]]
# Split impressions
behav_pop_df["Viewed Impressions"] = behav_pop_df["Impressions"].str.split().apply(lambda x: [item.split("-")[0] for item in x if item.split("-")[1] == "1"])
behav_pop_df["Impressions labels"] = behav_df_demo["Impressions"].str.split().apply(lambda x: [int(item.split("-")[1]) for item in x])

# Display
behav_pop_df.head()

Unnamed: 0,Impression_ID,User_ID,Impressions,Viewed Impressions,Impressions labels
0,1,U80234,N28682-0 N48740-0 N31958-1 N34130-0 N6916-0 N5...,[N31958],"[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
1,2,U60458,N20036-0 N23513-1 N32536-0 N46976-0 N35216-0 N...,[N23513],"[0, 1, 0, 0, 0, 0, 0]"
2,3,U44190,N36779-0 N62365-0 N58098-0 N5472-0 N13408-0 N5...,[N5940],"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, ..."
3,4,U87380,N6950-0 N60215-0 N6074-0 N11930-0 N6916-0 N248...,[N15347],"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ..."
4,5,U9444,N5940-1 N23513-0 N49285-0 N23355-0 N19990-0 N3...,"[N5940, N31958]","[1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]"


Counting the number of times that each news article has appeared in total

In [14]:
viewed_impressions_column = behav_pop_df['Viewed Impressions']

# Split viewed impression IDs and create a list of all viewed impressions
all_viewed_impressions = [impression_id for impressions in viewed_impressions_column for impression_id in impressions]

# Count the occurrences of each viewed impression ID
viewed_impression_counts = pd.Series(all_viewed_impressions).value_counts()

# Print the viewed impression counts
viewed_impression_counts.head()

N31958    8042
N36779    4688
N5940     4191
N20036    3826
N23513    2900
dtype: int64

In [15]:
behav_pop_df["Impressions array"] = behav_pop_df["Impressions"].str.split().apply(lambda x: [item.split("-")[0] for item in x])
behav_pop_df['Codes_Count'] = behav_pop_df['Impressions array'].map(lambda x: [viewed_impression_counts.get(code, 0) for code in x])
behav_pop_df['Popular prediction'] = behav_pop_df['Codes_Count'].apply(generate_new_array)

behav_pop_df.head()

Unnamed: 0,Impression_ID,User_ID,Impressions,Viewed Impressions,Impressions labels,Impressions array,Codes_Count,Popular prediction
0,1,U80234,N28682-0 N48740-0 N31958-1 N34130-0 N6916-0 N5...,[N31958],"[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[N28682, N48740, N31958, N34130, N6916, N5472,...","[420, 892, 8042, 706, 987, 2135, 1091, 2247, 1...","[3, 12, 8, 6, 18, 15, 9, 21, 22, 16, 7, 5, 13,..."
1,2,U60458,N20036-0 N23513-1 N32536-0 N46976-0 N35216-0 N...,[N23513],"[0, 1, 0, 0, 0, 0, 0]","[N20036, N23513, N32536, N46976, N35216, N3677...","[3826, 2900, 473, 284, 414, 4688, 8042]","[7, 6, 1, 2, 3, 5, 4]"
2,3,U44190,N36779-0 N62365-0 N58098-0 N5472-0 N13408-0 N5...,[N5940],"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, ...","[N36779, N62365, N58098, N5472, N13408, N55036...","[4688, 593, 423, 2135, 263, 152, 1562, 288, 38...","[12, 1, 14, 9, 17, 15, 4, 21, 20, 7, 23, 16, 1..."
3,4,U87380,N6950-0 N60215-0 N6074-0 N11930-0 N6916-0 N248...,[N15347],"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ...","[N6950, N60215, N6074, N11930, N6916, N24802, ...","[326, 487, 349, 485, 987, 2247, 892, 159, 185,...","[6, 25, 20, 19, 14, 5, 16, 7, 10, 18, 24, 23, ..."
4,5,U9444,N5940-1 N23513-0 N49285-0 N23355-0 N19990-0 N3...,"[N5940, N31958]","[1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]","[N5940, N23513, N49285, N23355, N19990, N31958...","[4191, 2900, 1219, 21, 1562, 8042, 322, 2853, ...","[6, 13, 1, 14, 2, 8, 5, 3, 11, 12, 10, 9, 7, 4]"


In [16]:
# We can then indexize these two new columns:
behav_pop_df['Not clicked'] = behav_pop_df['Impressions'].map(lambda x: process_impression(x)[1])
behav_pop_df["Clicks count"] = behav_pop_df["Viewed Impressions"].apply(len)

In [17]:
behav_pop_df.head()

Unnamed: 0,Impression_ID,User_ID,Impressions,Viewed Impressions,Impressions labels,Impressions array,Codes_Count,Popular prediction,Not clicked,Clicks count
0,1,U80234,N28682-0 N48740-0 N31958-1 N34130-0 N6916-0 N5...,[N31958],"[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[N28682, N48740, N31958, N34130, N6916, N5472,...","[420, 892, 8042, 706, 987, 2135, 1091, 2247, 1...","[3, 12, 8, 6, 18, 15, 9, 21, 22, 16, 7, 5, 13,...","[N28682, N48740, N34130, N6916, N5472, N50775,...",1
1,2,U60458,N20036-0 N23513-1 N32536-0 N46976-0 N35216-0 N...,[N23513],"[0, 1, 0, 0, 0, 0, 0]","[N20036, N23513, N32536, N46976, N35216, N3677...","[3826, 2900, 473, 284, 414, 4688, 8042]","[7, 6, 1, 2, 3, 5, 4]","[N20036, N32536, N46976, N35216, N36779, N31958]",1
2,3,U44190,N36779-0 N62365-0 N58098-0 N5472-0 N13408-0 N5...,[N5940],"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, ...","[N36779, N62365, N58098, N5472, N13408, N55036...","[4688, 593, 423, 2135, 263, 152, 1562, 288, 38...","[12, 1, 14, 9, 17, 15, 4, 21, 20, 7, 23, 16, 1...","[N36779, N62365, N58098, N5472, N13408, N55036...",1
3,4,U87380,N6950-0 N60215-0 N6074-0 N11930-0 N6916-0 N248...,[N15347],"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ...","[N6950, N60215, N6074, N11930, N6916, N24802, ...","[326, 487, 349, 485, 987, 2247, 892, 159, 185,...","[6, 25, 20, 19, 14, 5, 16, 7, 10, 18, 24, 23, ...","[N6950, N60215, N6074, N11930, N6916, N24802, ...",1
4,5,U9444,N5940-1 N23513-0 N49285-0 N23355-0 N19990-0 N3...,"[N5940, N31958]","[1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]","[N5940, N23513, N49285, N23355, N19990, N31958...","[4191, 2900, 1219, 21, 1562, 8042, 322, 2853, ...","[6, 13, 1, 14, 2, 8, 5, 3, 11, 12, 10, 9, 7, 4]","[N23513, N49285, N23355, N19990, N29393, N3029...",2


Two new columns, 'Labels' and 'Sorted labels', are added to the `behav_pop_df` DataFrame.

'Labels' is constructed by concatenating a list of ones (length equal to 'Clicks count') with a list of zeros. The length of the zeros list is determined by subtracting 'Clicks count' from the length of 'Popular prediction'.

'Sorted labels' is the result of sorting 'Labels' based on the corresponding values in 'Popular prediction'. The `zip()` function pairs values from 'Popular prediction' and 'Labels', and `sorted()` orders these pairs based on the 'Popular prediction' values.


In [18]:
# Create the new array column
behav_pop_df['Labels'] = behav_pop_df.apply(lambda row: [1] * row['Clicks count'] + [0] * (len(row['Popular prediction']) - row['Clicks count']), axis=1)
behav_pop_df['Sorted labels'] = behav_pop_df.apply(lambda row: [x for _, x in sorted(zip(row['Popular prediction'], row['Labels']))], axis=1)

# Print the updated DataFrame
behav_pop_df.head()

Unnamed: 0,Impression_ID,User_ID,Impressions,Viewed Impressions,Impressions labels,Impressions array,Codes_Count,Popular prediction,Not clicked,Clicks count,Labels,Sorted labels
0,1,U80234,N28682-0 N48740-0 N31958-1 N34130-0 N6916-0 N5...,[N31958],"[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[N28682, N48740, N31958, N34130, N6916, N5472,...","[420, 892, 8042, 706, 987, 2135, 1091, 2247, 1...","[3, 12, 8, 6, 18, 15, 9, 21, 22, 16, 7, 5, 13,...","[N28682, N48740, N34130, N6916, N5472, N50775,...",1,"[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
1,2,U60458,N20036-0 N23513-1 N32536-0 N46976-0 N35216-0 N...,[N23513],"[0, 1, 0, 0, 0, 0, 0]","[N20036, N23513, N32536, N46976, N35216, N3677...","[3826, 2900, 473, 284, 414, 4688, 8042]","[7, 6, 1, 2, 3, 5, 4]","[N20036, N32536, N46976, N35216, N36779, N31958]",1,"[1, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 1]"
2,3,U44190,N36779-0 N62365-0 N58098-0 N5472-0 N13408-0 N5...,[N5940],"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, ...","[N36779, N62365, N58098, N5472, N13408, N55036...","[4688, 593, 423, 2135, 263, 152, 1562, 288, 38...","[12, 1, 14, 9, 17, 15, 4, 21, 20, 7, 23, 16, 1...","[N36779, N62365, N58098, N5472, N13408, N55036...",1,"[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ..."
3,4,U87380,N6950-0 N60215-0 N6074-0 N11930-0 N6916-0 N248...,[N15347],"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ...","[N6950, N60215, N6074, N11930, N6916, N24802, ...","[326, 487, 349, 485, 987, 2247, 892, 159, 185,...","[6, 25, 20, 19, 14, 5, 16, 7, 10, 18, 24, 23, ...","[N6950, N60215, N6074, N11930, N6916, N24802, ...",1,"[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
4,5,U9444,N5940-1 N23513-0 N49285-0 N23355-0 N19990-0 N3...,"[N5940, N31958]","[1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]","[N5940, N23513, N49285, N23355, N19990, N31958...","[4191, 2900, 1219, 21, 1562, 8042, 322, 2853, ...","[6, 13, 1, 14, 2, 8, 5, 3, 11, 12, 10, 9, 7, 4]","[N23513, N49285, N23355, N19990, N29393, N3029...",2,"[1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0]"


-----------------

### **MODEL SCORING**

In [19]:
new_preds = behav_pop_df["Sorted labels"]

In [20]:
true_labels = behav_pop_df["Impressions labels"]

In [21]:
ndcg_list = [5]
for k in ndcg_list:
    ndcg_temp= np.mean(
        [
            ndcg_score(each_labels, each_preds, k)
            for each_labels, each_preds in zip(true_labels, new_preds)
        ]
    )

In [22]:
print(f'The ndcg@5 for the popularity model is {ndcg_temp}')

The ndcg@5 for the popularity model is 0.2819127378815838


In [23]:
ndcg_list = [10]
for k in ndcg_list:
    ndcg_temp= np.mean(
        [
            ndcg_score(each_labels, each_preds, k)
            for each_labels, each_preds in zip(true_labels, new_preds)
        ]
    )

In [24]:
print(f'The ndcg@10 for the popularity model is {ndcg_temp}')

The ndcg@10 for the popularity model is 0.3388937271997671


-----------

| Model   | group_auc | mean_mrr | ndcg@5 | ndcg@10 |
|----------|-----------|----------|--------|---------|
| Popularity    |   -  |   -  | 0.2815 |  0.338 |