# Content-based system using prebuilt article embeddings

## Idea in brief

- More than 99.9% of the dataset's interactions have a single clicked article from a list of articles "in view", so we can assume that treating the task as a multiclass classification, where the possible classes are the articles in view should yield decent results. Instead of trying to predict a ranking of the articles in view, we try to predict a single article - the one that should be ranked first. 
- The dataset provides multiple embeddings for the articles' content, namely ones generated from word2vec, BERT and RoBERTa models. In this notebook, we'll be using the RoBERTa embeddings as a starting point for a content-based recommender system.
- The system creates a user profile from all of the articles they've visited as per the "history" data. For each interaction, we calculate the cosine similarity between the articles in view and the user's profile. The article in view with the biggest similarity values is the predicted rank 1.
- Future development ideas:
  - Introduce a ML model that uses relevant history fields like timestamps, read times, scroll percentages as features to find optimal weights to apply to the calculated cosine similarities. Other features candidates are auxiliary fields from the articles data like tags, total inviews, total pageview, as well as behaviors data like age, gender, etc.

## Imports

In [28]:
# import random
import json

import pandas as pd
import numpy as np
from tqdm import tqdm
from sklearn.metrics import accuracy_score 
from scipy.spatial.distance import cosine

## Constants

In [2]:
ROOT_PATH = '../data/small'
ARTICLE_EMBEDDINGS_PATH = f'{ROOT_PATH}/../embeddings/xlm_roberta_base.parquet'
TRAIN_HISTORY_PATH = f'{ROOT_PATH}/train/history.parquet'
VALIDATION_HISTORY_PATH = f'{ROOT_PATH}/validation/history.parquet'
TRAIN_INTERACTIONS_PATH = f'{ROOT_PATH}/train/behaviors.parquet'
VALIDATION_INTERACTIONS_PATH = f'{ROOT_PATH}/validation/behaviors.parquet'

## Data loading

In [3]:
article_embeddings_df = pd.read_parquet(ARTICLE_EMBEDDINGS_PATH)
history_df = pd.read_parquet(TRAIN_HISTORY_PATH)

In [4]:
history_df

Unnamed: 0,user_id,impression_time_fixed,scroll_percentage_fixed,article_id_fixed,read_time_fixed
0,13538,"[2023-04-27T10:17:43.000000, 2023-04-27T10:18:...","[100.0, 35.0, 100.0, 24.0, 100.0, 23.0, 100.0,...","[9738663, 9738569, 9738663, 9738490, 9738663, ...","[17.0, 12.0, 4.0, 5.0, 4.0, 9.0, 5.0, 46.0, 11..."
1,14241,"[2023-04-27T09:40:18.000000, 2023-04-27T09:40:...","[100.0, 46.0, 100.0, 70.0, 100.0, 100.0, 100.0...","[9738557, 9738528, 9738533, 9738684, 9739035, ...","[8.0, 9.0, 28.0, 17.0, 91.0, 21.0, 14.0, 27.0,..."
2,20396,"[2023-04-27T12:30:44.000000, 2023-04-27T12:31:...","[100.0, 59.0, nan, nan, 100.0, 100.0, nan, nan...","[9738760, 9738355, 9738355, 9739864, 9741788, ...","[49.0, 34.0, 0.0, 60.0, 180.0, 49.0, 0.0, 0.0,..."
3,34912,"[2023-04-29T07:12:49.000000, 2023-04-29T13:01:...","[100.0, 35.0, 44.0, 31.0, 100.0, 100.0, 100.0,...","[9741802, 9741804, 9741803, 9740087, 9742039, ...","[153.0, 7.0, 5.0, 6.0, 44.0, 44.0, 108.0, 10.0..."
4,37953,"[2023-04-27T19:17:10.000000, 2023-04-27T19:17:...","[14.0, 28.0, 29.0, nan, 36.0, 33.0, 50.0, 100....","[9739205, 9739202, 9737084, 9739274, 9739358, ...","[4.0, 16.0, 4.0, 0.0, 5.0, 5.0, 25.0, 48.0, 6...."
...,...,...,...,...,...
15138,1479974,"[2023-05-18T06:03:16.000000, 2023-05-18T06:03:...","[58.0, 100.0, 21.0, 100.0, 100.0, 100.0, 6.0, ...","[9770989, 9769553, 9770882, 9770541, 9770867, ...","[8.0, 124.0, 9.0, 72.0, 52.0, 70.0, 2.0, 40.0,..."
15139,2405403,"[2023-04-30T19:48:27.000000, 2023-04-30T19:48:...","[40.0, 100.0, 100.0, 43.0, 100.0, 39.0]","[9743574, 9740618, 9742401, 9740156, 9742401, ...","[1.0, 64.0, 6.0, 5.0, 2.0, 120.0]"
15140,2454548,"[2023-05-17T09:35:06.000000, 2023-05-17T09:35:...","[12.0, 10.0, 30.0, 100.0, 33.0, 37.0, 87.0, 57...","[9768328, 9769328, 9769414, 9769380, 9769378, ...","[9.0, 6.0, 3.0, 32.0, 101.0, 10.0, 69.0, 10.0,..."
15141,581228,"[2023-05-18T05:24:32.000000, 2023-05-18T05:24:...","[28.0, 60.0, 100.0, 100.0, 49.0, 100.0]","[9770799, 9770726, 9747757, 9769404, 9769366, ...","[12.0, 15.0, 52.0, 700.0, 7.0, 43.0]"


In [5]:
interactions_df = pd.read_parquet(TRAIN_INTERACTIONS_PATH)

In [6]:
interactions_df

Unnamed: 0,impression_id,article_id,impression_time,read_time,scroll_percentage,device_type,article_ids_inview,article_ids_clicked,user_id,is_sso_user,gender,postcode,age,is_subscriber,session_id,next_read_time,next_scroll_percentage
0,149474,,2023-05-24 07:47:53,13.0,,2,"[9778623, 9778682, 9778669, 9778657, 9778736, ...",[9778657],139836,False,,,,False,759,7.0,22.0
1,150528,,2023-05-24 07:33:25,25.0,,2,"[9778718, 9778728, 9778745, 9778669, 9778657, ...",[9778623],143471,False,,,,False,1240,287.0,100.0
2,153068,9778682.0,2023-05-24 07:09:04,78.0,100.0,1,"[9778657, 9778669, 9772866, 9776259, 9756397, ...",[9778669],151570,False,,,,False,1976,45.0,100.0
3,153070,9777492.0,2023-05-24 07:13:14,26.0,100.0,1,"[9020783, 9778444, 9525589, 7213923, 9777397, ...",[9778628],151570,False,,,,False,1976,4.0,18.0
4,153071,9778623.0,2023-05-24 07:11:08,125.0,100.0,1,"[9777492, 9774568, 9565836, 9335113, 9771223, ...",[9777492],151570,False,,,,False,1976,26.0,100.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
232882,580099643,9769306.0,2023-05-18 10:01:05,121.0,100.0,3,"[9233208, 9771242, 9767697, 9514481, 9771065, ...",[9770886],2106715,False,,,,False,1416293,121.0,
232883,580099644,9770882.0,2023-05-18 10:05:07,176.0,100.0,3,"[9771065, 9767697, 9770886, 9758882, 9709817, ...",[9769306],2106715,False,,,,False,1416293,148.0,100.0
232884,580099645,9769306.0,2023-05-18 10:11:03,24.0,100.0,3,"[9771042, 9440508, 9486080, 9770997, 9120051, ...",[9771042],2106715,False,,,,False,1416293,4.0,
232885,580100695,9771242.0,2023-05-18 10:00:08,5.0,100.0,1,"[9440508, 9142581, 9769917, 9767697, 9514481, ...",[9767697],2110744,False,,,,False,747086,75.0,100.0


## Embedding similarity algorithm

In [30]:
# Create a dictionary from article_embeddings_df for fast lookups
article_embedding_dict = article_embeddings_df.set_index('article_id')['FacebookAI/xlm-roberta-base'].to_dict()

In [8]:
# Generate a list of unique articles seen in dataset (inview from interactions and visited from history)
article_ids_history = [item for sublist in history_df['article_id_fixed'] for item in sublist]
article_ids_inview = [item for sublist in interactions_df['article_ids_inview'] for item in sublist]
unique_articles_set = set(article_ids_inview)
unique_articles_set.update(article_ids_history)
unique_articles = list(unique_articles_set)
len(unique_articles)

In [10]:
# Cell takes ~15 minutes to run

# Calculate pairwise cosine similarity of the embeddings for each article seen in dataset ahead of time
cosine_similarities = dict()
for i, id in enumerate(tqdm(unique_articles)):
    for next_id in unique_articles[i+1:]:
        similarity = cosine(article_embedding_dict[id], article_embedding_dict[next_id]).sum()
        cosine_similarities[f'{id},{next_id}'] = similarity
        cosine_similarities[f'{next_id},{id}'] = similarity


100%|██████████| 10360/10360 [14:27<00:00, 11.94it/s] 


In [11]:
with open(f'{ROOT_PATH}/article_similarities.json', 'w') as file:
     file.write(json.dumps(cosine_similarities)) 

In [12]:
# dict for quickly getting article ids from the user's history
user_id_to_profile_dict = history_df.set_index('user_id')['article_id_fixed'].to_dict()

In [24]:
# For each article in the interaction's inview list, calculate the sum of its cosine similarity to each article in the user's profile. 
# The most similar inview article is recommended

def get_best_rec(row):
    article_inview_ids = row['article_ids_inview']
    article_profile_ids = user_id_to_profile_dict[row['user_id']]
    
    best_similarity_article_id = article_inview_ids[0]
    best_similarity = 0
    
    for i, article_inview_id in enumerate(article_inview_ids):
        similarity_sum = 0
        for article_profile_id in article_profile_ids:
            similarity = 0
            # FIXME: find out how many KeyErrors we're getting and why
            try:
                similarity = cosine_similarities[f'{article_inview_id},{article_profile_id}']
            except KeyError:
                pass
            similarity_sum = similarity_sum + similarity
        if similarity_sum > best_similarity:
            best_similarity = similarity_sum
            best_similarity_article_id = article_inview_ids[i]
    return best_similarity_article_id

In [26]:
# Cell takes ~15 minutes to execute

tqdm.pandas()
interactions_df['predict'] = interactions_df.progress_apply(get_best_rec, axis = 1)

100%|██████████| 232887/232887 [14:52<00:00, 261.04it/s]


In [29]:
interactions_df['actual'] = interactions_df['article_ids_clicked'].apply(lambda ids: ids[0])
accuracy = accuracy_score(interactions_df['actual'], interactions_df['predict']
print(f'Accuracy: {accuracy}')

0.12274193063588779

In [None]:
# Intended for submission
#def transform_interaction_to_prediction(interaction, test_format=False):
#    prediction_index = interaction['article_ids_inview'].index(interaction['predict'])
#    prediction_list = list(range(2,len(interaction['article_ids_inview'].str.len())) + 1)
#    random.shuffle(prediction_list)
#    prediction_list.insert(prediction_index, 1)
#    list_as_str = ','.join(str(i) for i in prediction_list)
#    return f'{interaction['impression_id']} [{list_as_str}]'
#
#predictions = interactions_df['predict'].apply(transform_interaction_to_prediction)
#
#predictions.to_csv('./test-predictions.txt', index=False, header=False, line_terminator='\n')