### Shell

Installing necessary packages for the notebook by running shell commands. We recommend using a Conda virtual environment to ensure reliability.

In [69]:
%pip install pytorch_lightning
%pip install torchmetrics
%pip install --upgrade tensorboard
%pip install pandas
%pip install nbconvert

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


### Imports

Importing libraries for data manipulation, neural network building, and training.

In [70]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import torch.nn as nn
import pytorch_lightning as pl
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import torch
import os
from collections import Counter

### Import for TensorBoard

This part sets up TensorBoard logger, which is used for visualization and monitoring of the model's training progress.

In [71]:
from pytorch_lightning.loggers import TensorBoardLogger

logger = TensorBoardLogger("tb_logs", name="my_model")

### Data Preprocessing

In this section, we load and preprocess the data. It includes loading data from Parquet files, joining tables, generating binary labels, building indexes for items and users, and splitting the data into train and validation sets.

First, behaviour data is loaded. We concatinate training and validation sets, so we can choose our own ratios later.

In [72]:
# Load EBNeRD behaviors dataset for both train and validation
train_behaviour = pd.read_parquet("./ebnerd_small/train/behaviors.parquet")
valid_behaviour = pd.read_parquet("./ebnerd_small/validation/behaviors.parquet")
behaviors = pd.concat([train_behaviour, valid_behaviour], ignore_index=True)

behaviors.head()

Unnamed: 0,impression_id,article_id,impression_time,read_time,scroll_percentage,device_type,article_ids_inview,article_ids_clicked,user_id,is_sso_user,gender,postcode,age,is_subscriber,session_id,next_read_time,next_scroll_percentage
0,149474,,2023-05-24 07:47:53,13.0,,2,"[9778623, 9778682, 9778669, 9778657, 9778736, ...",[9778657],139836,False,,,,False,759,7.0,22.0
1,150528,,2023-05-24 07:33:25,25.0,,2,"[9778718, 9778728, 9778745, 9778669, 9778657, ...",[9778623],143471,False,,,,False,1240,287.0,100.0
2,153068,9778682.0,2023-05-24 07:09:04,78.0,100.0,1,"[9778657, 9778669, 9772866, 9776259, 9756397, ...",[9778669],151570,False,,,,False,1976,45.0,100.0
3,153070,9777492.0,2023-05-24 07:13:14,26.0,100.0,1,"[9020783, 9778444, 9525589, 7213923, 9777397, ...",[9778628],151570,False,,,,False,1976,4.0,18.0
4,153071,9778623.0,2023-05-24 07:11:08,125.0,100.0,1,"[9777492, 9774568, 9565836, 9335113, 9771223, ...",[9777492],151570,False,,,,False,1976,26.0,100.0


History data is loaded. We concatinate training and validation sets, so we can choose our own ratios later.

In [73]:
# Load EBNeRD history dataset for both train and validation
train_history = pd.read_parquet("./ebnerd_small/train/history.parquet")
valid_history = pd.read_parquet("./ebnerd_small/validation/history.parquet")
history = pd.concat([train_history, valid_history], ignore_index=True)

history.head()

Unnamed: 0,user_id,impression_time_fixed,scroll_percentage_fixed,article_id_fixed,read_time_fixed
0,13538,"[2023-04-27T10:17:43.000000, 2023-04-27T10:18:...","[100.0, 35.0, 100.0, 24.0, 100.0, 23.0, 100.0,...","[9738663, 9738569, 9738663, 9738490, 9738663, ...","[17.0, 12.0, 4.0, 5.0, 4.0, 9.0, 5.0, 46.0, 11..."
1,14241,"[2023-04-27T09:40:18.000000, 2023-04-27T09:40:...","[100.0, 46.0, 100.0, 70.0, 100.0, 100.0, 100.0...","[9738557, 9738528, 9738533, 9738684, 9739035, ...","[8.0, 9.0, 28.0, 17.0, 91.0, 21.0, 14.0, 27.0,..."
2,20396,"[2023-04-27T12:30:44.000000, 2023-04-27T12:31:...","[100.0, 59.0, nan, nan, 100.0, 100.0, nan, nan...","[9738760, 9738355, 9738355, 9739864, 9741788, ...","[49.0, 34.0, 0.0, 60.0, 180.0, 49.0, 0.0, 0.0,..."
3,34912,"[2023-04-29T07:12:49.000000, 2023-04-29T13:01:...","[100.0, 35.0, 44.0, 31.0, 100.0, 100.0, 100.0,...","[9741802, 9741804, 9741803, 9740087, 9742039, ...","[153.0, 7.0, 5.0, 6.0, 44.0, 44.0, 108.0, 10.0..."
4,37953,"[2023-04-27T19:17:10.000000, 2023-04-27T19:17:...","[14.0, 28.0, 29.0, nan, 36.0, 33.0, 50.0, 100....","[9739205, 9739202, 9737084, 9739274, 9739358, ...","[4.0, 16.0, 4.0, 0.0, 5.0, 5.0, 25.0, 48.0, 6...."


News data is loaded.

In [74]:
# Load EBNeRD news dataset
news = pd.read_parquet("./ebnerd_small/articles.parquet")

news.head()

Unnamed: 0,article_id,title,subtitle,last_modified_time,premium,body,published_time,image_ids,article_type,url,...,entity_groups,topics,category,subcategory,category_str,total_inviews,total_pageviews,total_read_time,sentiment_score,sentiment_label
0,3001353,Natascha var ikke den første,"Politiet frygter nu, at Nataschas bortfører ha...",2023-06-29 06:20:33,False,Sagen om den østriske Natascha og hendes bortf...,2006-08-31 08:06:45,[3150850],article_default,https://ekstrabladet.dk/krimi/article3001353.ece,...,[],"[Kriminalitet, Personfarlig kriminalitet]",140,[],krimi,,,,0.9955,Negative
1,3003065,Kun Star Wars tjente mere,Biografgængerne strømmer ind for at se 'Da Vin...,2023-06-29 06:20:35,False,Vatikanet har opfordret til at boykotte filmen...,2006-05-21 16:57:00,[3006712],article_default,https://ekstrabladet.dk/underholdning/filmogtv...,...,[],"[Underholdning, Film og tv, Økonomi]",414,"[433, 434]",underholdning,,,,0.846,Positive
2,3012771,Morten Bruun fyret i SønderjyskE,FODBOLD: Morten Bruun fyret med øjeblikkelig v...,2023-06-29 06:20:39,False,Kemien mellem spillerne i Superligaklubben Søn...,2006-05-01 14:28:40,[3177953],article_default,https://ekstrabladet.dk/sport/fodbold/dansk_fo...,...,[],"[Erhverv, Kendt, Sport, Fodbold, Ansættelsesfo...",142,"[196, 199]",sport,,,,0.8241,Negative
3,3023463,Luderne flytter på landet,I landets tyndest befolkede områder skyder bor...,2023-06-29 06:20:43,False,Det frække erhverv rykker på landet. I den tyn...,2007-03-24 08:27:59,[3184029],article_default,https://ekstrabladet.dk/nyheder/samfund/articl...,...,[],"[Livsstil, Erotik]",118,[133],nyheder,,,,0.7053,Neutral
4,3032577,Cybersex: Hvornår er man utro?,En flirtende sms til den flotte fyr i regnskab...,2023-06-29 06:20:46,False,"De fleste af os mener, at et tungekys er utros...",2007-01-18 10:30:37,[3030463],article_default,https://ekstrabladet.dk/sex_og_samliv/article3...,...,[],"[Livsstil, Partnerskab]",565,[],sex_og_samliv,,,,0.9307,Neutral


### Join history and behaviour tables

Not entirely sure if it is good practice, but we join the history and behaviour tables, so we have all our data in one dataframe.

In [75]:
# Left join on 'user_id'
behaviour_history_merged= pd.merge(behaviors, history, on='user_id', how='left')

# Display the merged data
behaviour_history_merged.head()

Unnamed: 0,impression_id,article_id,impression_time,read_time,scroll_percentage,device_type,article_ids_inview,article_ids_clicked,user_id,is_sso_user,...,postcode,age,is_subscriber,session_id,next_read_time,next_scroll_percentage,impression_time_fixed,scroll_percentage_fixed,article_id_fixed,read_time_fixed
0,149474,,2023-05-24 07:47:53,13.0,,2,"[9778623, 9778682, 9778669, 9778657, 9778736, ...",[9778657],139836,False,...,,,False,759,7.0,22.0,"[2023-05-03T19:04:15.000000, 2023-05-03T19:05:...","[100.0, 89.0, 27.0, 33.0, 100.0, 75.0, 39.0, 2...","[9745590, 9748574, 9748432, 9748080, 9750687, ...","[60.0, 11.0, 1.0, 15.0, 37.0, 15.0, 4.0, 8.0, ..."
1,150528,,2023-05-24 07:33:25,25.0,,2,"[9778718, 9778728, 9778745, 9778669, 9778657, ...",[9778623],143471,False,...,,,False,1240,287.0,100.0,"[2023-04-27T08:05:09.000000, 2023-04-27T10:05:...","[21.0, 100.0, 34.0, 85.0, 92.0, 75.0, 52.0, 66...","[9737881, 9738659, 9738569, 9738490, 9738528, ...","[7.0, 24.0, 28.0, 65.0, 16.0, 41.0, 59.0, 24.0..."
2,150528,,2023-05-24 07:33:25,25.0,,2,"[9778718, 9778728, 9778745, 9778669, 9778657, ...",[9778623],143471,False,...,,,False,1240,287.0,100.0,"[2023-05-04T07:10:24.000000, 2023-05-04T07:10:...","[77.0, 80.0, 28.0, 11.0, 94.0, 54.0, 74.0, 30....","[9748977, 9748976, 9747490, 9745484, 9747959, ...","[3.0, 29.0, 2.0, 3.0, 16.0, 30.0, 4.0, 3.0, 4...."
3,153068,9778682.0,2023-05-24 07:09:04,78.0,100.0,1,"[9778657, 9778669, 9772866, 9776259, 9756397, ...",[9778669],151570,False,...,,,False,1976,45.0,100.0,"[2023-04-27T14:07:16.000000, 2023-04-27T14:08:...","[100.0, nan, 100.0, 14.0, 100.0, 100.0, 100.0,...","[9738303, 9738993, 9738303, 9738902, 9738303, ...","[59.0, 1.0, 2.0, 8.0, 4.0, 28.0, 51.0, 7.0, 7...."
4,153068,9778682.0,2023-05-24 07:09:04,78.0,100.0,1,"[9778657, 9778669, 9772866, 9776259, 9756397, ...",[9778669],151570,False,...,,,False,1976,45.0,100.0,"[2023-05-04T20:50:44.000000, 2023-05-04T20:51:...","[100.0, nan, 100.0, 100.0, 100.0, 18.0, 100.0,...","[9750389, 9749756, 9750389, 9750318, 9749582, ...","[27.0, 8.0, 10.0, 24.0, 13.0, 7.0, 5.0, 34.0, ..."


Every entry in history has entries article_id_fixed. With the code below, we confirm that when joining the tables, every single behaviour entry gets the corresponding user information.

In [76]:
# Check if every row has successfully been merged with the correct user information
article_id_fixed_null = behaviour_history_merged['article_id_fixed'].isnull().any()

if article_id_fixed_null:
    print("The 'article_id_fixed' column contains null values in some rows.")
else:
    print("The 'article_id_fixed' column does not contain null values in any row.")


The 'article_id_fixed' column does not contain null values in any row.


### Generate Binary Labels

Generating binary labels enables us to tackle the binary classification problem. Note that this operation could be optimised. We generate binary labels by going through articles that were shown to the user, and checking which article was clicked. The generated column is appended to the dataframe. For example if article "2" is clicked in article_ids_inview entry (1,2,3,4), then the generated binary column, labels, will be (0,1,0,0).

In [77]:
# Function to create binary labels column
def create_binary_labels_column(df):
    # Define the column names
    clicked_col = "article_ids_clicked"
    inview_col = "article_ids_inview"
    labels_col = "labels"

    # Create a new column with binary labels
    df[labels_col] = df.apply(lambda row: [1 if article_id in row[clicked_col] else 0 for article_id in row[inview_col]], axis=1)

    # Shuffle the data
    df = df.sample(frac=1, random_state=123)

    # Add a column with the length of the labels list
    df[labels_col + "_len"] = df[labels_col].apply(len)

    return df

# Apply the function to your merged dataset
behaviour_history_merged = create_binary_labels_column(behaviour_history_merged)

# Display the updated dataset
behaviour_history_merged.head()

Unnamed: 0,impression_id,article_id,impression_time,read_time,scroll_percentage,device_type,article_ids_inview,article_ids_clicked,user_id,is_sso_user,...,is_subscriber,session_id,next_read_time,next_scroll_percentage,impression_time_fixed,scroll_percentage_fixed,article_id_fixed,read_time_fixed,labels,labels_len
578909,182440184,,2023-05-28 09:02:31,7.0,,2,"[9784044, 9784679, 9784058, 9142564, 9782809, ...",[9784591],437088,False,...,False,1626986,84.0,100.0,"[2023-05-20T21:32:46.000000, 2023-05-20T21:32:...","[36.0, 100.0, 20.0, 100.0, 100.0, 100.0, 100.0...","[9774079, 9774074, 9772453, 9774120, 9773638, ...","[6.0, 39.0, 7.0, 39.0, 71.0, 8.0, 99.0, 16.0, ...","[0, 0, 0, 0, 0, 1, 0, 0, 0, 0]",10
200232,263268931,,2023-05-22 05:21:33,19.0,,1,"[9754160, 9775430, 9774595, 9775402, 7460419, ...",[9775402],1327305,False,...,False,1519807,4.0,17.0,"[2023-05-08T05:34:30.000000, 2023-05-10T07:40:...","[16.0, 48.0, 26.0, 52.0, 100.0, 100.0, 100.0, ...","[9753521, 9757183, 9759154, 9759355, 9759418, ...","[3.0, 8.0, 6.0, 22.0, 32.0, 95.0, 7.0, 87.0, 3...","[0, 0, 0, 1, 0, 0, 0]",7
194891,258249876,,2023-05-21 15:40:24,46.0,,1,"[9774598, 9770028, 9774404, 9774708, 9746360, ...",[9774015],720141,False,...,False,375748,28.0,40.0,"[2023-05-04T06:50:57.000000, 2023-05-04T06:51:...","[56.0, 13.0, 26.0, 70.0, 28.0, 25.0, nan, 26.0...","[9748977, 9745484, 9747490, 9748918, 9748942, ...","[20.0, 27.0, 8.0, 8.0, 12.0, 20.0, 22.0, 17.0,...","[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ...",32
503093,87132561,,2023-05-26 17:42:02,25.0,,1,"[9782616, 9780651, 9783043, 9782495, 9783056, ...",[9783043],1447383,False,...,False,1162569,21.0,37.0,"[2023-04-27T12:47:48.000000, 2023-04-27T12:48:...","[21.0, 1.0, 70.0, 100.0, 83.0, 79.0, 100.0, 19...","[9733845, 9733713, 9738684, 9738533, 9737521, ...","[21.0, 9.0, 32.0, 115.0, 101.0, 16.0, 22.0, 13...","[0, 0, 1, 0, 0, 0, 0]",7
858950,547070818,,2023-05-29 05:00:09,8.0,,2,"[9785992, 9785835, 9786111, 9785017, 9785986, ...",[9786111],885672,False,...,False,1375847,87.0,100.0,"[2023-04-27T07:30:17.000000, 2023-04-27T09:37:...","[100.0, 94.0, nan, 23.0, 69.0, 15.0, 47.0, 13....","[9738334, 9738569, 9738364, 9738490, 9738760, ...","[1.0, 22.0, 3.0, 4.0, 1276.0, 2.0, 10.0, 4.0, ...","[0, 0, 1, 0, 0, 0]",6


We average a user's sentiment scores and append it to the dataframe. We do this by iterating through their clicked articles' sentiment scores, and calculating the average.

**IMPORTANT:** Sentiment is not correctly implemented. The value is most likely a single value from a vector (negative score, neutral score, positive score), where vector values add up to 1. If we wanted to fix this, we would add a vector instead, where we add the given value in the appropriate field, and half the value of 1 minus given value between the two other fields. This fix is not implemented because the entire model is replaced.

In [78]:
# Create a dictionary mapping article IDs to sentiment scores
sentiment_dict = dict(zip(news['article_id'], news['sentiment_score']))

# Function to map sentiment score based on article ID
def map_sentiment(article_ids):
    # Filter out NaN values and get sentiment scores for clicked articles
    sentiment_scores = [sentiment_dict.get(article_id, np.nan) for article_id in article_ids if not pd.isnull(article_id)]
    # Calculate the average sentiment score if there are sentiment scores available
    if sentiment_scores:
        return np.mean(sentiment_scores)
    else:
        print("Error: Unable to calculate average sentiment score. No sentiment scores available for clicked articles.")
        return np.nan

# Apply the function to create a new column with average sentiment score
behaviour_history_merged['average_sentiment_score_for_user'] = behaviour_history_merged['article_id_fixed'].apply(map_sentiment)

# Display the updated dataframe
behaviour_history_merged.head()


Unnamed: 0,impression_id,article_id,impression_time,read_time,scroll_percentage,device_type,article_ids_inview,article_ids_clicked,user_id,is_sso_user,...,session_id,next_read_time,next_scroll_percentage,impression_time_fixed,scroll_percentage_fixed,article_id_fixed,read_time_fixed,labels,labels_len,average_sentiment_score_for_user
578909,182440184,,2023-05-28 09:02:31,7.0,,2,"[9784044, 9784679, 9784058, 9142564, 9782809, ...",[9784591],437088,False,...,1626986,84.0,100.0,"[2023-05-20T21:32:46.000000, 2023-05-20T21:32:...","[36.0, 100.0, 20.0, 100.0, 100.0, 100.0, 100.0...","[9774079, 9774074, 9772453, 9774120, 9773638, ...","[6.0, 39.0, 7.0, 39.0, 71.0, 8.0, 99.0, 16.0, ...","[0, 0, 0, 0, 0, 1, 0, 0, 0, 0]",10,0.811352
200232,263268931,,2023-05-22 05:21:33,19.0,,1,"[9754160, 9775430, 9774595, 9775402, 7460419, ...",[9775402],1327305,False,...,1519807,4.0,17.0,"[2023-05-08T05:34:30.000000, 2023-05-10T07:40:...","[16.0, 48.0, 26.0, 52.0, 100.0, 100.0, 100.0, ...","[9753521, 9757183, 9759154, 9759355, 9759418, ...","[3.0, 8.0, 6.0, 22.0, 32.0, 95.0, 7.0, 87.0, 3...","[0, 0, 0, 1, 0, 0, 0]",7,0.8639
194891,258249876,,2023-05-21 15:40:24,46.0,,1,"[9774598, 9770028, 9774404, 9774708, 9746360, ...",[9774015],720141,False,...,375748,28.0,40.0,"[2023-05-04T06:50:57.000000, 2023-05-04T06:51:...","[56.0, 13.0, 26.0, 70.0, 28.0, 25.0, nan, 26.0...","[9748977, 9745484, 9747490, 9748918, 9748942, ...","[20.0, 27.0, 8.0, 8.0, 12.0, 20.0, 22.0, 17.0,...","[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ...",32,0.853615
503093,87132561,,2023-05-26 17:42:02,25.0,,1,"[9782616, 9780651, 9783043, 9782495, 9783056, ...",[9783043],1447383,False,...,1162569,21.0,37.0,"[2023-04-27T12:47:48.000000, 2023-04-27T12:48:...","[21.0, 1.0, 70.0, 100.0, 83.0, 79.0, 100.0, 19...","[9733845, 9733713, 9738684, 9738533, 9737521, ...","[21.0, 9.0, 32.0, 115.0, 101.0, 16.0, 22.0, 13...","[0, 0, 1, 0, 0, 0, 0]",7,0.863627
858950,547070818,,2023-05-29 05:00:09,8.0,,2,"[9785992, 9785835, 9786111, 9785017, 9785986, ...",[9786111],885672,False,...,1375847,87.0,100.0,"[2023-04-27T07:30:17.000000, 2023-04-27T09:37:...","[100.0, 94.0, nan, 23.0, 69.0, 15.0, 47.0, 13....","[9738334, 9738569, 9738364, 9738490, 9738760, ...","[1.0, 22.0, 3.0, 4.0, 1276.0, 2.0, 10.0, 4.0, ...","[0, 0, 1, 0, 0, 0]",6,0.864199


For later ease of use, we generate indices for users and articles. Note that the correctness of this is checked later.

In [79]:
# Build index of items    
ind2article = {idx + 1: itemid for idx, itemid in enumerate(news['article_id'].values)}
article2ind = {itemid: idx for idx, itemid in ind2article.items()}

# Build index of users
unique_userIds = behaviour_history_merged['user_id'].unique()
ind2user = {idx + 1: itemid for idx, itemid in enumerate(unique_userIds)}
user2ind = {itemid: idx for idx, itemid in ind2user.items()}

behaviour_history_merged['userIdx'] = behaviour_history_merged['user_id'].map(lambda x: user2ind.get(x, 0))
behaviour_history_merged['articleIdx'] = behaviour_history_merged['article_id'].map(lambda x: article2ind.get(x, 0))
print(f"We have {len(article2ind)} unique articles in the dataset")
print(f"We have {len(user2ind)} unique users in the dataset")

We have 20738 unique articles in the dataset
We have 18827 unique users in the dataset


We split our data into train and validation sets. We will use the train set to train our model and the validation set to evaluate its performance.

In [80]:
# Split data into train and validation
test_time_threshold = behaviour_history_merged['impression_time'].quantile(0.9)
train_data = behaviour_history_merged[behaviour_history_merged['impression_time'] < test_time_threshold]
valid_data = behaviour_history_merged[behaviour_history_merged['impression_time'] >= test_time_threshold]

### Dataset Model

Defining the dataset model. This is used in the machine learning model itself later.

In [81]:
class EBNeRDMindDataset(Dataset):
    def __init__(self, df):
        self.data = {
            'userIdx': torch.tensor(df.userIdx.values),
            'articleIdx': torch.tensor(df.articleIdx.values),
            'labels': torch.tensor([item for sublist in df.labels for item in sublist], dtype=torch.float32),
            'sentiment_score': torch.tensor(df.average_sentiment_score_for_user.values, dtype=torch.float32),
        }

    def __len__(self):
        return len(self.data['userIdx'])

    def __getitem__(self, idx):
        return {
            'userIdx': self.data['userIdx'][idx],
            'articleIdx': self.data['articleIdx'][idx],
            'click': self.data['labels'][idx].long(),
            'noclick': 1 - self.data['labels'][idx].long(),
            'sentiment_score': self.data['sentiment_score'][idx],
        }


We define datasets and dataloaders that will be used in the machine learning model.

In [82]:
# Build datasets and dataloaders for train and validation dataframes
bs = 1024
ds_train = EBNeRDMindDataset(train_data)
train_loader = DataLoader(ds_train, batch_size=bs, shuffle=True)
ds_valid = EBNeRDMindDataset(valid_data)
valid_loader = DataLoader(ds_valid, batch_size=bs, shuffle=False)


### Model 1

This section defines our neural network model. It includes creating data loaders, defining the model architecture (NewsMF), specifying training steps, validation steps, optimizer, and training configurations. Note that this is only the first model, which is more messy, and not as well performing as the second model.

In [83]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import pytorch_lightning as pl
from torch.utils.data import Dataset, DataLoader
from torchmetrics.classification import BinaryF1Score, BinaryAUROC

class NewsMF(pl.LightningModule):
    def __init__(self, num_users, num_items, dim=10):
        super().__init__()
        self.dim = dim
        self.useremb = nn.Embedding(num_embeddings=num_users, embedding_dim=dim)
        self.itememb = nn.Embedding(num_embeddings=num_items, embedding_dim=dim)

        # BinaryF1Score metric
        self.f1_metric = BinaryF1Score()
        self.train_step_f1_outputs = []
        self.validation_step_f1_outputs = []

        # BinaryAUROC metric
        self.binary_auroc = BinaryAUROC()
        self.train_step_auroc_outputs = []
        self.validation_step_auroc_outputs = []

    def forward(self, user, item, sentiment):
        batch_size = user.size(0)
        uservec = self.useremb(user)
        itemvec = self.itememb(item)
        
        # Concatenate user and item embeddings with sentiment scores
        uservec = torch.cat((uservec, sentiment.unsqueeze(-1)), dim=1)
        itemvec = torch.cat((itemvec, sentiment.unsqueeze(-1)), dim=1)

        score = (uservec * itemvec).sum(-1).unsqueeze(-1)

        return score

    def training_step(self, batch, batch_idx):
        batch_size = batch['userIdx'].size(0)

        score_click = self.forward(batch['userIdx'], batch['click'], batch['sentiment_score'])
        score_noclick = self.forward(batch['userIdx'], batch['noclick'], batch['sentiment_score'])

        loss = F.cross_entropy(input=torch.cat((score_click, score_noclick), dim=1),
                               target=torch.zeros(batch_size, device=score_click.device).long())

        # Compute F1-score
        f1_click = self.f1_metric(score_click.squeeze(), torch.ones_like(batch['click']))
        f1_noclick = self.f1_metric(score_noclick.squeeze(), torch.zeros_like(batch['noclick']))

        # Average F1-scores
        f1 = (f1_click + f1_noclick) / 2.0

        self.train_step_f1_outputs.append(f1)

        # Calculate Binary AUROC
        binary_auroc_score = self.binary_auroc(torch.cat((score_click, score_noclick), dim=1),
                                                torch.cat((torch.ones_like(batch['click']),
                                                           torch.zeros_like(batch['noclick'])))
                                               )
        
        self.train_step_auroc_outputs.append(binary_auroc_score)

        return {'loss': loss, 'f1': f1, 'auroc': binary_auroc_score}

    def validation_step(self, batch, batch_idx):
        score_click = self.forward(batch['userIdx'], batch['click'], batch['sentiment_score'])
        score_noclick = self.forward(batch['userIdx'], batch['noclick'], batch['sentiment_score'])

        loss = F.cross_entropy(input=torch.cat((score_click, score_noclick), dim=1),
                            target=torch.zeros(batch['userIdx'].size(0), device=score_click.device).long())

        # F1 Score
        f1_click = self.f1_metric(score_click.squeeze(), torch.ones_like(batch['click']))
        f1_noclick = self.f1_metric(score_noclick.squeeze(), torch.zeros_like(batch['noclick']))
        f1 = (f1_click + f1_noclick) / 2.0 # Average F1-scores

        self.validation_step_f1_outputs.append(f1)

        # Calculate Binary AUROC
        binary_auroc_score = self.binary_auroc(torch.cat((score_click, score_noclick), dim=1),
                                                torch.cat((torch.ones_like(batch['click']),
                                                           torch.zeros_like(batch['noclick'])))
                                               )
        
        self.validation_step_auroc_outputs.append(binary_auroc_score)
                
        return {'loss': loss, 'f1': f1, 'auroc': binary_auroc_score}
    
    def on_train_epoch_end(self):
        epoch_average_f1 = torch.stack(self.train_step_f1_outputs).mean()
        print(f'Epoch {self.current_epoch}: Training F1 Score: {epoch_average_f1.item()}')
        self.log("train_epoch_average_f1", epoch_average_f1)
        self.train_step_f1_outputs.clear()  # free memory

        epoch_average_auroc = torch.stack(self.train_step_auroc_outputs).mean()
        print(f'Epoch {self.current_epoch}: Training AUROC Score: {epoch_average_auroc.item()}')
        self.log("train_epoch_average_auroc", epoch_average_auroc)
        self.validation_step_auroc_outputs.clear()  # free memory


    def on_validation_epoch_end(self):
        epoch_average_f1 = torch.stack(self.validation_step_f1_outputs).mean()
        print(f'Epoch {self.current_epoch}: Validation F1 Score: {epoch_average_f1.item()}')
        self.log("validation_epoch_average_f1", epoch_average_f1)
        self.validation_step_f1_outputs.clear()  # free memory

        epoch_average_auroc = torch.stack(self.validation_step_auroc_outputs).mean()
        print(f'Epoch {self.current_epoch}: Validation AUROC Score: {epoch_average_auroc.item()}')
        self.log("validation_epoch_average_auroc", epoch_average_auroc)
        self.validation_step_auroc_outputs.clear()  # free memory

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
        return optimizer


We instantiate the model, the trainer, and run the trainer.

In [84]:
# Instantiate the model
ebnerd_model = NewsMF(num_users=len(user2ind) + 1, num_items=len(article2ind) + 1)

# Instantiate the trainer
trainer = pl.Trainer(max_epochs=10, logger=logger)

# Train the model
trainer.fit(model=ebnerd_model, train_dataloaders=train_loader, val_dataloaders=valid_loader)

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs

  | Name         | Type          | Params
-----------------------------------------------
0 | useremb      | Embedding     | 188 K 
1 | itememb      | Embedding     | 207 K 
2 | f1_metric    | BinaryF1Score | 0     
3 | binary_auroc | BinaryAUROC   | 0     
-----------------------------------------------
395 K     Trainable params
0         Non-trainable params
395 K     Total params
1.583     Total estimated model params size (MB)


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

d:\Anaconda\Lib\site-packages\pytorch_lightning\trainer\connectors\data_connector.py:441: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=3` in the `DataLoader` to improve performance.


Epoch 0: Validation F1 Score: 0.3668864369392395
Epoch 0: Validation AUROC Score: 0.47754335403442383


d:\Anaconda\Lib\site-packages\pytorch_lightning\trainer\connectors\data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=3` in the `DataLoader` to improve performance.


Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Epoch 0: Validation F1 Score: 0.39982688426971436
Epoch 0: Validation AUROC Score: 0.49833253026008606
Epoch 0: Training F1 Score: 0.38376784324645996
Epoch 0: Training AUROC Score: 0.5002530217170715


Validation: |          | 0/? [00:00<?, ?it/s]

Epoch 1: Validation F1 Score: 0.41821375489234924
Epoch 1: Validation AUROC Score: 0.49845242500305176
Epoch 1: Training F1 Score: 0.40993988513946533
Epoch 1: Training AUROC Score: 0.4999447166919708


Validation: |          | 0/? [00:00<?, ?it/s]

Epoch 2: Validation F1 Score: 0.4263119399547577
Epoch 2: Validation AUROC Score: 0.498483806848526
Epoch 2: Training F1 Score: 0.42293581366539
Epoch 2: Training AUROC Score: 0.4995303452014923


Validation: |          | 0/? [00:00<?, ?it/s]

Epoch 3: Validation F1 Score: 0.43029481172561646
Epoch 3: Validation AUROC Score: 0.49846577644348145
Epoch 3: Training F1 Score: 0.4295817017555237
Epoch 3: Training AUROC Score: 0.49956706166267395


Validation: |          | 0/? [00:00<?, ?it/s]

Epoch 4: Validation F1 Score: 0.432224839925766
Epoch 4: Validation AUROC Score: 0.49845150113105774
Epoch 4: Training F1 Score: 0.4323936104774475
Epoch 4: Training AUROC Score: 0.49963217973709106


Validation: |          | 0/? [00:00<?, ?it/s]

Epoch 5: Validation F1 Score: 0.43374064564704895
Epoch 5: Validation AUROC Score: 0.4984639883041382
Epoch 5: Training F1 Score: 0.4337572157382965
Epoch 5: Training AUROC Score: 0.49976715445518494


Validation: |          | 0/? [00:00<?, ?it/s]

Epoch 6: Validation F1 Score: 0.4341133236885071
Epoch 6: Validation AUROC Score: 0.49846339225769043
Epoch 6: Training F1 Score: 0.43463417887687683
Epoch 6: Training AUROC Score: 0.499746173620224


Validation: |          | 0/? [00:00<?, ?it/s]

Epoch 7: Validation F1 Score: 0.4347049593925476
Epoch 7: Validation AUROC Score: 0.4984622299671173
Epoch 7: Training F1 Score: 0.43513062596321106
Epoch 7: Training AUROC Score: 0.4997369050979614


Validation: |          | 0/? [00:00<?, ?it/s]

Epoch 8: Validation F1 Score: 0.4347802996635437
Epoch 8: Validation AUROC Score: 0.4984569847583771
Epoch 8: Training F1 Score: 0.4353675842285156
Epoch 8: Training AUROC Score: 0.499746173620224


Validation: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=10` reached.


Epoch 9: Validation F1 Score: 0.4350593090057373
Epoch 9: Validation AUROC Score: 0.4984586834907532
Epoch 9: Training F1 Score: 0.4356866478919983
Epoch 9: Training AUROC Score: 0.4997742474079132


We print the logs (F1 score and AUC)

In [85]:
# Print the logs
logs = trainer.logged_metrics
print("Training and validation logs:", logs)

Training and validation logs: {'validation_epoch_average_f1': tensor(0.4351), 'validation_epoch_average_auroc': tensor(0.4985), 'train_epoch_average_f1': tensor(0.4357), 'train_epoch_average_auroc': tensor(0.4998)}


### Alternative Model

This is our second model, written in a different approach.

First we define the dataset model, it is pretty much identical to the one above.

In [86]:
class EBNeRDMindDataset(Dataset):
    def __init__(self, df):
        self.data = {
            'userIdx': torch.tensor(df.userIdx.values),
            'articleIdx': torch.tensor(df.articleIdx.values),
            'click': torch.tensor([item for sublist in df.labels for item in sublist], dtype=torch.float32),
            'noclick': 1 - torch.tensor([item for sublist in df.labels for item in sublist], dtype=torch.float32),
            'sentiment_score': torch.tensor(df.average_sentiment_score_for_user.values, dtype=torch.float32),
        }

    def __len__(self):
        return len(self.data['userIdx'])

    def __getitem__(self, idx):
        return {
            'userIdx': self.data['userIdx'][idx],
            'articleIdx': self.data['articleIdx'][idx],
            'click': self.data['click'][idx].long(),
            'noclick': self.data['noclick'][idx].long(),
            'sentiment_score': self.data['sentiment_score'][idx],
        }

Then we define the model itself. Note that the F1 score is broken at the moment.

In [87]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset
import pytorch_lightning as pl
import numpy as np
from sklearn.metrics import roc_auc_score, f1_score

class RecommenderModel(pl.LightningModule):
    def __init__(self, num_users, num_items, embedding_dim=64):
        super().__init__()
        self.user_embeddings = nn.Embedding(num_users, embedding_dim)
        self.item_embeddings = nn.Embedding(num_items, embedding_dim)
        self.fc1 = nn.Linear(embedding_dim * 2, 128)
        self.fc2 = nn.Linear(128, 64)
        self.fc3 = nn.Linear(64, 1)

    def forward(self, user_idx, item_idx, sentiment_score):
        user_embedding = self.user_embeddings(user_idx)
        item_embedding = self.item_embeddings(item_idx)
        x = torch.cat([user_embedding, item_embedding], dim=1)
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        return x

    def training_step(self, batch, batch_idx):
        user_idx = batch['userIdx']
        item_idx = batch['articleIdx']
        sentiment_score = batch['sentiment_score']
        click = batch['click']
        output = self(user_idx, item_idx, sentiment_score).squeeze()

        # Calculate Loss
        loss = F.binary_cross_entropy_with_logits(output, click.float())  # Convert click to float

        # Calculate AUC-ROC
        predicted_probs = torch.sigmoid(output)
        auc_roc = roc_auc_score(click.cpu().numpy(), predicted_probs.cpu().detach().numpy())

        # Calculate F1 score
        predicted_labels = predicted_probs > 0.5
        f1 = f1_score(click.cpu().numpy(), predicted_labels.cpu().numpy())

        # Log
        self.log('train_loss', loss, prog_bar=True)
        self.log('train_auc_roc', auc_roc, prog_bar=True)  # Log AUC-ROC score
        self.log('train_f1_score', f1, prog_bar=True)  # Log F1 score
        return loss

    def validation_step(self, batch, batch_idx):
        user_idx = batch['userIdx']
        item_idx = batch['articleIdx']
        sentiment_score = batch['sentiment_score']
        click = batch['click']
        output = self(user_idx, item_idx, sentiment_score).squeeze()
        loss = F.binary_cross_entropy_with_logits(output, click.float())  # Convert click to float

        # Calculate AUC-ROC
        predicted_probs = torch.sigmoid(output)
        auc_roc = roc_auc_score(click.cpu().numpy(), predicted_probs.cpu().detach().numpy())

        # Calculate F1 score
        predicted_labels = predicted_probs > 0.5
        f1 = f1_score(click.cpu().numpy(), predicted_labels.cpu().numpy())

        self.log('val_loss', loss, prog_bar=True)
        self.log('val_auc_roc', auc_roc, prog_bar=True)  # Log AUC-ROC score
        self.log('val_f1_score', f1, prog_bar=True)  # Log F1 score
        return loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=1e-3)


We instantiate the model and trainer, then run the trainer.

In [88]:
model = RecommenderModel(num_users=len(user2ind) + 1, num_items=len(article2ind) + 1)
trainer = pl.Trainer(logger=logger, max_epochs=10)
trainer.fit(model, train_loader, valid_loader)

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
d:\Anaconda\Lib\site-packages\pytorch_lightning\callbacks\model_checkpoint.py:653: Checkpoint directory tb_logs\my_model\version_29\checkpoints exists and is not empty.

  | Name            | Type      | Params
----------------------------------------------
0 | user_embeddings | Embedding | 1.2 M 
1 | item_embeddings | Embedding | 1.3 M 
2 | fc1             | Linear    | 16.5 K
3 | fc2             | Linear    | 8.3 K 
4 | fc3             | Linear    | 65    
----------------------------------------------
2.6 M     Trainable params
0         Non-trainable params
2.6 M     Total params
10.228    Total estimated model params size (MB)


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

d:\Anaconda\Lib\site-packages\pytorch_lightning\trainer\connectors\data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=3` in the `DataLoader` to improve performance.


Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=10` reached.


The F1 and AUC scores are logged.

In [89]:
# Print the logs
logs = trainer.logged_metrics
print("Training and validation logs:", logs)

Training and validation logs: {'train_loss': tensor(0.2622), 'train_auc_roc': tensor(0.6321), 'train_f1_score': tensor(0.), 'val_loss': tensor(0.3081), 'val_auc_roc': tensor(0.4984), 'val_f1_score': tensor(0.0018)}


### Prediction test for Model 1

Here, we perform a prediction test using our trained model. It involves selecting a random user, generating predictions for item recommendations, and filtering the top recommended items.

In [90]:
USER_ID = 2350 # Random user id
# Create item_ids and user ids list
item_id = list(ind2article.keys())
userIdx =  [USER_ID]*len(item_id)

preditions = ebnerd_model.forward(torch.IntTensor(userIdx), torch.IntTensor(item_id))

# Select top 10 argmax
top_index = torch.topk(preditions.flatten(), 10).indices

# Filter for top 10 suggested items
filters = [ind2article[ix.item()] for ix in top_index]
news[news["article_id"].isin(filters)]

TypeError: NewsMF.forward() missing 1 required positional argument: 'sentiment'

### Model 1 Save

This section saves the trained model's state dictionary to a specified directory.

In [None]:
# Specify the relative directory path
relative_directory = "Saved_Model/"

# Create the full directory path
directory_path = os.path.join(relative_directory)

# Create the directory if it does not exist
os.makedirs(directory_path, exist_ok=True)

# Save the state dictionary of the model to the specified directory
model_save_path = os.path.join(directory_path, "EBNERD_collaborative_filtering_model.pth")
torch.save(ebnerd_model.state_dict(), model_save_path)

### Model 1 Load

Here, we load the saved model from the directory.

In [None]:
# Load the state dictionary from the specified directory
loaded_model = NewsMF(num_users=len(ind2user)+1, num_items=len(ind2article)+1)

# Use a relative path when loading the model
model_load_path = os.path.join("Saved_Model", "EBNERD_collaborative_filtering_model.pth")
loaded_model.load_state_dict(torch.load(model_load_path))

<All keys matched successfully>

### Loaded Model 1 Single Prediction

Similar to the prediction test, but this time, it involves loading the saved model and making predictions for a specific user.

In [None]:
# Specify the user ID for prediction
USER_ID = 1234
PREDICTION_COUNT = 10

# Create item_ids and user ids list
article_id = list(ind2article.keys())
userIdx = [USER_ID] * len(article_id)

# Convert lists to PyTorch tensors
user_tensor = torch.IntTensor(userIdx)
item_tensor = torch.IntTensor(article_id)

# Forward pass to get predictions
predictions = loaded_model.forward(user_tensor, item_tensor)

# Select top 10 indices
top_indices = torch.topk(predictions.flatten(), PREDICTION_COUNT).indices

# Get corresponding item IDs
top_item_ids = [ind2article[ix.item()] for ix in top_indices]

# Filter for top 10 suggested items
recommended_items = news[news["article_id"].isin(top_item_ids)]

# Display the recommended items
recommended_items.head()

       article_id                                              title  \
5433      8904651                     Anastasia, 24 år og fra Herlev   
7499      9344914                     Kimmie, 29 år og fra København   
7608      9358095          Iværksætterskolen: Sådan kommer du i gang   
8798      9482881   FN: 150 skibe med korn hober sig op ved Istanbul   
10618     9623877  Sigtelse: Rapperen Miklo kom med dødstrusler u...   
13762     9728282                       Her er det spiselige batteri   
14122     9733519                          Ukraine slår igen på Krim   
14412     9735665  Kvist Industries øger omsætningen med et tocif...   
18879     9778236                               Brand i større villa   
19149     9781264        Sabbatår i udlandet er også for håndværkere   

                                                subtitle  last_modified_time  \
5433                                                     2023-06-29 06:38:13   
7499                                           

### Tensorboard

This section loads and starts TensorBoard to visualize training metrics.

In [None]:
# Load the extension and start TensorBoard
%load_ext tensorboard
%tensorboard --logdir tb_logs

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


Reusing TensorBoard on port 6006 (pid 27724), started 1:10:08 ago. (Use '!kill 27724' to kill it.)

### Utilities

This section contains various utility functions and commands. It includes converting the notebook to a Python script, getting a random user ID, and validating index mappings.

### Convert to Python Script (not needed right now but keep as utility)

This exports the entire notebook as a Python Script

In [None]:
!python -m nbconvert --to script EBNERD_Notebook.ipynb

[NbConvertApp] Converting notebook EBNERD_Notebook.ipynb to script
[NbConvertApp] Writing 16222 bytes to EBNERD_Notebook.py


### Get random user id

Gets a random user_id from the dataset.

In [None]:
random_user_index = np.random.randint(0, len(behaviors))
random_user_id = behaviors.iloc[random_user_index]['user_id']

print(f"Randomly selected user ID: {random_user_id}")

Randomly selected user ID: 2499828


### Validate conversion consistency

Tests user2ind, article2ind, ind2user, and ind2article.

In [None]:
def validate_mapping_consistency(user2ind, ind2user, article2ind, ind2article):
    # Choose a random user and article ID for validation
    random_user_id = np.random.choice(list(user2ind.keys()))
    random_article_id = np.random.choice(list(article2ind.keys()))
    print(f"Randomly selected user ID: {random_user_id}")
    print(f"Randomly selected article ID: {random_article_id}")

    # Validate user mapping
    user_index = user2ind.get(random_user_id)
    retrieved_user_id = ind2user.get(user_index)
    print(f"User index: {user_index}")
    print(f"Retrieved user ID: {retrieved_user_id}")
    
    user_mapping_consistent = random_user_id == retrieved_user_id

    # Validate article mapping
    article_index = article2ind.get(random_article_id)
    retrieved_article_id = ind2article.get(article_index)
    print(f"Article index: {article_index}")
    print(f"Retrieved article ID: {retrieved_article_id}")

    article_mapping_consistent = random_article_id == retrieved_article_id

    return user_mapping_consistent, article_mapping_consistent

# Perform validation
user_consistency, article_consistency = validate_mapping_consistency(user2ind, ind2user, article2ind, ind2article)

# Print results
print(f"User Mapping Consistency: {user_consistency}")
print(f"Article Mapping Consistency: {article_consistency}")

Randomly selected user ID: 591009
Randomly selected article ID: 8907869
User index: 4882
Retrieved user ID: 591009
Article index: 5462
Retrieved article ID: 8907869
User Mapping Consistency: True
Article Mapping Consistency: True
