### LightFM Data Preparation

To work with the LightFM model example, we need to restructure the EBNeRD data.

How it is done in this notebook (for both training and validation data):
1. Load history and behaviour data
2. Make temporary dataframe containing all article_ids_inview for every unique user, because they are scattered in the behaviour dataframe
3. Append article_ids_inview to history dataframe now that they are grouped by user_id
4. Make new column for unclicked articles by comparing clicked and inview articles in history dataframe
5. Remove unnecessary columns
6. Restructure data, so every interaction (click and no click) has a single row, and generate ratings column based on click or no click (1 for click, 0 for no click)
7. Add topics column from article data using article_id (there are multiple topics for a single article)
8. Only keep the first topic for each article, and remove rows where topic is empty (and rename the column genre for now)
9. If there are any duplicate entries for any userID itemID combination, remove the duplicates
9. (Only for validation) Because the data is faulty, there are rows that are both present in training data and validation data. We remove these from the validation data.
10. Save data as csv

The data now looks like:
| userID | itemID | rating | genre  |
|--------|--------|--------|--------|
| 123456 | 518008 |   1    | Sport  |

At every critical step, sanity checks are in place to check for correctness.

### Shell

In [233]:
%pip install pandas
%pip install pyarrow
%pip install fastparquet

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


### Imports

In [234]:
import pandas as pd

### Load article data

In [235]:
# Load EBNeRD news dataset
news = pd.read_parquet("./ebnerd_demo/articles.parquet")
news.head()

Unnamed: 0,article_id,title,subtitle,last_modified_time,premium,body,published_time,image_ids,article_type,url,...,entity_groups,topics,category,subcategory,category_str,total_inviews,total_pageviews,total_read_time,sentiment_score,sentiment_label
0,3037230,Ishockey-spiller: Jeg troede jeg skulle dø,ISHOCKEY: Ishockey-spilleren Sebastian Harts h...,2023-06-29 06:20:57,False,Ambitionerne om at komme til USA og spille ish...,2003-08-28 08:55:00,,article_default,https://ekstrabladet.dk/sport/anden_sport/isho...,...,[],"[Kriminalitet, Kendt, Sport, Katastrofe, Mindr...",142,"[327, 334]",sport,,,,0.9752,Negative
1,3044020,Prins Harry tvunget til dna-test,Hoffet tvang Prins Harry til at tage dna-test ...,2023-06-29 06:21:16,False,Den britiske tabloidavis The Sun fortsætter me...,2005-06-29 08:47:00,"[3097307, 3097197, 3104927]",article_default,https://ekstrabladet.dk/underholdning/udlandke...,...,"[PER, PER]","[Kriminalitet, Kendt, Underholdning, Personfar...",414,[432],underholdning,,,,0.7084,Negative
2,3057622,Rådden kørsel på blå plader,Kan ikke straffes: Udenlandske diplomater i Da...,2023-06-29 06:21:24,False,Slingrende spritkørsel. Grove overtrædelser af...,2005-10-10 07:20:00,[3047102],article_default,https://ekstrabladet.dk/nyheder/samfund/articl...,...,[],"[Kriminalitet, Transportmiddel, Bil]",118,[133],nyheder,,,,0.9236,Negative
3,3073151,Mærsk-arvinger i livsfare,FANGET I FLODBØLGEN: Skibsrederens oldebørn må...,2023-06-29 06:21:38,False,To oldebørn af skibsreder Mærsk McKinney Mølle...,2005-01-04 06:59:00,"[3067474, 3067478, 3153705]",article_default,https://ekstrabladet.dk/nyheder/samfund/articl...,...,[],"[Erhverv, Privat virksomhed, Livsstil, Familie...",118,[133],nyheder,,,,0.9945,Negative
4,3193383,Skød svigersøn gennem babydyne,44-årig kvinde tiltalt for drab på ekssvigersø...,2023-06-29 06:22:57,False,En 44-årig mormor blev i dag fremstillet i et ...,2003-09-15 15:30:00,,article_default,https://ekstrabladet.dk/krimi/article3193383.ece,...,[],"[Kriminalitet, Personfarlig kriminalitet]",140,[],krimi,,,,0.9966,Negative


### Preparing train data with sanity checks

#### Load data

In [236]:
# Load EBNeRD history dataset for both train and validation
train_history = pd.read_parquet("./ebnerd_demo/train/history.parquet")
train_history.head()

Unnamed: 0,user_id,impression_time_fixed,scroll_percentage_fixed,article_id_fixed,read_time_fixed
0,13538,"[2023-04-27T10:17:43.000000, 2023-04-27T10:18:...","[100.0, 35.0, 100.0, 24.0, 100.0, 23.0, 100.0,...","[9738663, 9738569, 9738663, 9738490, 9738663, ...","[17.0, 12.0, 4.0, 5.0, 4.0, 9.0, 5.0, 46.0, 11..."
1,58608,"[2023-04-27T18:48:09.000000, 2023-04-27T18:48:...","[37.0, 61.0, 100.0, 100.0, 55.0, 100.0, 100.0,...","[9739362, 9739179, 9738567, 9739344, 9739202, ...","[2.0, 24.0, 72.0, 65.0, 11.0, 4.0, 101.0, 0.0,..."
2,95507,"[2023-04-27T15:20:28.000000, 2023-04-27T15:20:...","[60.0, 100.0, 100.0, 21.0, 29.0, 67.0, 49.0, 5...","[9739035, 9738646, 9634967, 9738902, 9735495, ...","[18.0, 29.0, 51.0, 12.0, 10.0, 10.0, 13.0, 24...."
3,106588,"[2023-04-27T08:29:09.000000, 2023-04-27T08:29:...","[24.0, 57.0, 100.0, nan, nan, 100.0, 100.0, 73...","[9738292, 9738216, 9737266, 9737556, 9737657, ...","[9.0, 15.0, 42.0, 9.0, 3.0, 58.0, 26.0, 214.0,..."
4,617963,"[2023-04-27T14:42:25.000000, 2023-04-27T14:43:...","[100.0, 100.0, nan, 46.0, 23.0, 19.0, 61.0, 70...","[9739035, 9739088, 9738902, 9738968, 9738760, ...","[45.0, 29.0, 116.0, 26.0, 34.0, 42.0, 58.0, 59..."


In [237]:
# Load EBNeRD behaviors dataset for both train and validation
train_behaviour = pd.read_parquet("./ebnerd_demo/train/behaviors.parquet")
train_behaviour.head()

Unnamed: 0,impression_id,article_id,impression_time,read_time,scroll_percentage,device_type,article_ids_inview,article_ids_clicked,user_id,is_sso_user,gender,postcode,age,is_subscriber,session_id,next_read_time,next_scroll_percentage
0,48401,,2023-05-21 21:06:50,21.0,,2,"[9774516, 9771051, 9770028, 9775402, 9774461, ...",[9759966],22779,False,,,,False,21,16.0,27.0
1,152513,9778745.0,2023-05-24 07:31:26,30.0,100.0,1,"[9778669, 9778736, 9778623, 9089120, 9778661, ...",[9778661],150224,False,,,,False,298,2.0,48.0
2,155390,,2023-05-24 07:30:33,45.0,,1,"[9778369, 9777856, 9778500, 9778021, 9778627, ...",[9777856],160892,False,,,,False,401,215.0,100.0
3,214679,,2023-05-23 05:25:40,33.0,,2,"[9776715, 9776406, 9776566, 9776071, 9776808, ...",[9776566],1001055,False,,,,False,1357,40.0,47.0
4,214681,,2023-05-23 05:31:54,21.0,,2,"[9775202, 9776855, 9776688, 9771995, 9776583, ...",[9776553],1001055,False,,,,False,1358,5.0,49.0


#### Make temporary dataframe and group article_ids_inview for unique user_ids (they are scattered in behaviour dataframe)

In [238]:
# Group by user_id, flatten article_ids_inview lists, and remove duplicates
train_inview_temp = train_behaviour.groupby('user_id')['article_ids_inview'].agg(lambda x: list(set(y for sublist in x for y in sublist))).reset_index()

# Renaming columns
train_inview_temp.columns = ['user_id', 'article_ids_inview']

train_inview_temp.head()

Unnamed: 0,user_id,article_ids_inview
0,11313,"[9776897, 9779713, 9775489, 9779205, 9777036, ..."
1,13538,"[9773070, 9773078, 9775142, 9770028, 9754160, ..."
2,15430,"[9345280, 9775489, 9268227, 9378062, 9778318, ..."
3,19181,"[9771009, 9771523, 9772548, 9779204, 9769996, ..."
4,19568,"[9779713, 9766434, 9780195, 9772227, 9779205, ..."


##### Sanity checks

In [239]:
# Check for missing user IDs in history and behavior dataframes
# Print the number of unique user IDs in history and behaviors dataframes
print("Number of unique user IDs in history dataframe:", train_history['user_id'].nunique())
print("Number of unique user IDs in behaviors dataframe:", train_behaviour['user_id'].nunique())
print("Number of unique user IDs in inview_temp dataframe:", train_inview_temp['user_id'].nunique())

# Check for duplicate user IDs
duplicate_user_ids = train_history[train_history.duplicated('user_id')]['user_id'].unique()

if len(duplicate_user_ids) > 0:
    print("Duplicate user IDs found in history dataframe:")
    print(duplicate_user_ids)
else:
    print("No duplicate user IDs found in history dataframe.")

# Check if all user_ids match
# Get the set of unique user IDs in inview_temp dataframe
inview_temp_user_ids = set(train_inview_temp['user_id'])

# Get the set of unique user IDs in train_history dataframe
train_history_user_ids = set(train_history['user_id'])

# Check if the sets of unique user IDs are the same
user_ids_match = inview_temp_user_ids == train_history_user_ids

print("Distinct user IDs in inview_temp match with train_history:", user_ids_match)

# Inview_temp checks
# Check 1: Number of rows matches with number of rows in history dataframe
check1 = len(train_inview_temp) == len(train_history)

# Check 2: All user_ids are distinct
check2 = train_inview_temp['user_id'].nunique() == len(train_inview_temp)

# Check 3: Article_ids_inview contain no duplicates
check3 = all(len(set(article_ids)) == len(article_ids) for article_ids in train_inview_temp['article_ids_inview'])

# Printing the results of checks
print("Check 1: Number of rows match with number of rows in history dataframe:", check1)
print("Check 2: All user_ids are distinct:", check2)
print("Check 3: Article_ids_inview contain no duplicates:", check3)

Number of unique user IDs in history dataframe: 1590
Number of unique user IDs in behaviors dataframe: 1590
Number of unique user IDs in inview_temp dataframe: 1590
No duplicate user IDs found in history dataframe.
Distinct user IDs in inview_temp match with train_history: True
Check 1: Number of rows match with number of rows in history dataframe: True
Check 2: All user_ids are distinct: True
Check 3: Article_ids_inview contain no duplicates: True


#### We can now add article_ids_inview to history, because they are grouped in lists for unique user_ids

In [240]:
train_history_with_inview = pd.merge(train_history, train_inview_temp, on='user_id', how='left')
train_history_with_inview.head()

Unnamed: 0,user_id,impression_time_fixed,scroll_percentage_fixed,article_id_fixed,read_time_fixed,article_ids_inview
0,13538,"[2023-04-27T10:17:43.000000, 2023-04-27T10:18:...","[100.0, 35.0, 100.0, 24.0, 100.0, 23.0, 100.0,...","[9738663, 9738569, 9738663, 9738490, 9738663, ...","[17.0, 12.0, 4.0, 5.0, 4.0, 9.0, 5.0, 46.0, 11...","[9773070, 9773078, 9775142, 9770028, 9754160, ..."
1,58608,"[2023-04-27T18:48:09.000000, 2023-04-27T18:48:...","[37.0, 61.0, 100.0, 100.0, 55.0, 100.0, 100.0,...","[9739362, 9739179, 9738567, 9739344, 9739202, ...","[2.0, 24.0, 72.0, 65.0, 11.0, 4.0, 101.0, 0.0,...","[9770369, 9769474, 9771523, 9770886, 9769996, ..."
2,95507,"[2023-04-27T15:20:28.000000, 2023-04-27T15:20:...","[60.0, 100.0, 100.0, 21.0, 29.0, 67.0, 49.0, 5...","[9739035, 9738646, 9634967, 9738902, 9735495, ...","[18.0, 29.0, 51.0, 12.0, 10.0, 10.0, 13.0, 24....","[9777156, 9363981, 9758734, 9777693, 9769504, ..."
3,106588,"[2023-04-27T08:29:09.000000, 2023-04-27T08:29:...","[24.0, 57.0, 100.0, nan, nan, 100.0, 100.0, 73...","[9738292, 9738216, 9737266, 9737556, 9737657, ...","[9.0, 15.0, 42.0, 9.0, 3.0, 58.0, 26.0, 214.0,...","[9769474, 9268227, 9777156, 9775621, 9773574, ..."
4,617963,"[2023-04-27T14:42:25.000000, 2023-04-27T14:43:...","[100.0, 100.0, nan, 46.0, 23.0, 19.0, 61.0, 70...","[9739035, 9739088, 9738902, 9738968, 9738760, ...","[45.0, 29.0, 116.0, 26.0, 34.0, 42.0, 58.0, 59...","[9770882, 9774789, 9757574, 9774764, 9774542, ..."


#### Make new column for unclicked articles

In [241]:
# Function to filter out non-clicked articles
def get_non_clicked_articles(inview_articles, clicked_articles):
    return [article for article in inview_articles if article not in clicked_articles]

In [242]:
# Apply the function to create the new column
train_history_with_inview['non_clicked_articles'] = train_history_with_inview.apply(lambda row: get_non_clicked_articles(row['article_ids_inview'], row['article_id_fixed']), axis=1)

# Display the updated dataframe
train_history_with_inview.head()

Unnamed: 0,user_id,impression_time_fixed,scroll_percentage_fixed,article_id_fixed,read_time_fixed,article_ids_inview,non_clicked_articles
0,13538,"[2023-04-27T10:17:43.000000, 2023-04-27T10:18:...","[100.0, 35.0, 100.0, 24.0, 100.0, 23.0, 100.0,...","[9738663, 9738569, 9738663, 9738490, 9738663, ...","[17.0, 12.0, 4.0, 5.0, 4.0, 9.0, 5.0, 46.0, 11...","[9773070, 9773078, 9775142, 9770028, 9754160, ...","[9773070, 9773078, 9775142, 9770028, 9754160, ..."
1,58608,"[2023-04-27T18:48:09.000000, 2023-04-27T18:48:...","[37.0, 61.0, 100.0, 100.0, 55.0, 100.0, 100.0,...","[9739362, 9739179, 9738567, 9739344, 9739202, ...","[2.0, 24.0, 72.0, 65.0, 11.0, 4.0, 101.0, 0.0,...","[9770369, 9769474, 9771523, 9770886, 9769996, ...","[9770369, 9769474, 9771523, 9770886, 9769996, ..."
2,95507,"[2023-04-27T15:20:28.000000, 2023-04-27T15:20:...","[60.0, 100.0, 100.0, 21.0, 29.0, 67.0, 49.0, 5...","[9739035, 9738646, 9634967, 9738902, 9735495, ...","[18.0, 29.0, 51.0, 12.0, 10.0, 10.0, 13.0, 24....","[9777156, 9363981, 9758734, 9777693, 9769504, ...","[9777156, 9363981, 9758734, 9777693, 9769504, ..."
3,106588,"[2023-04-27T08:29:09.000000, 2023-04-27T08:29:...","[24.0, 57.0, 100.0, nan, nan, 100.0, 100.0, 73...","[9738292, 9738216, 9737266, 9737556, 9737657, ...","[9.0, 15.0, 42.0, 9.0, 3.0, 58.0, 26.0, 214.0,...","[9769474, 9268227, 9777156, 9775621, 9773574, ...","[9769474, 9268227, 9777156, 9775621, 9773574, ..."
4,617963,"[2023-04-27T14:42:25.000000, 2023-04-27T14:43:...","[100.0, 100.0, nan, 46.0, 23.0, 19.0, 61.0, 70...","[9739035, 9739088, 9738902, 9738968, 9738760, ...","[45.0, 29.0, 116.0, 26.0, 34.0, 42.0, 58.0, 59...","[9770882, 9774789, 9757574, 9774764, 9774542, ...","[9770882, 9774789, 9757574, 9774764, 9774542, ..."


#### Remove unnecessary columns

In [243]:
# Selecting only the desired columns
train_history_with_unclicked = train_history_with_inview[['user_id', 'article_id_fixed', 'non_clicked_articles']]

# Display the updated dataframe
train_history_with_unclicked.head()

Unnamed: 0,user_id,article_id_fixed,non_clicked_articles
0,13538,"[9738663, 9738569, 9738663, 9738490, 9738663, ...","[9773070, 9773078, 9775142, 9770028, 9754160, ..."
1,58608,"[9739362, 9739179, 9738567, 9739344, 9739202, ...","[9770369, 9769474, 9771523, 9770886, 9769996, ..."
2,95507,"[9739035, 9738646, 9634967, 9738902, 9735495, ...","[9777156, 9363981, 9758734, 9777693, 9769504, ..."
3,106588,"[9738292, 9738216, 9737266, 9737556, 9737657, ...","[9769474, 9268227, 9777156, 9775621, 9773574, ..."
4,617963,"[9739035, 9739088, 9738902, 9738968, 9738760, ...","[9770882, 9774789, 9757574, 9774764, 9774542, ..."


#### Sanity check

In [244]:
# Assuming you have the train_history_with_unclicked dataframe

# Check if any element in article_id_fixed is present in non_clicked_articles for each row
rows_with_common_elements = train_history_with_unclicked[
    train_history_with_unclicked.apply(
        lambda row: any(item in row['article_id_fixed'] for item in row['non_clicked_articles']),
        axis=1
    )
]

# Check if there are any rows with common elements
if not rows_with_common_elements.empty:
    print("There are rows where article_id_fixed and non_clicked_articles contain the same element.")
    print(rows_with_common_elements)
else:
    print("No rows where article_id_fixed and non_clicked_articles contain the same element.")


No rows where article_id_fixed and non_clicked_articles contain the same element.


#### Generate ratings and restructure data

In [245]:
# Initialize an empty list to store the data
data = []

# Iterate over each row in merged_df_filtered
for index, row in train_history_with_unclicked.iterrows():
    # For each item in article_id_fixed, add a row with rating 5
    for item in row['article_id_fixed']:
        data.append([row['user_id'], item, 1])
    
    # For each item in non_clicked_articles, add a row with rating 1
    for item in row['non_clicked_articles']:
        data.append([row['user_id'], item, 0])

# Create the final dataframe from the collected data
train_history_with_rating = pd.DataFrame(data, columns=['userID', 'itemID', 'rating'])

# Display the first few rows of the final dataframe
train_history_with_rating.head()

Unnamed: 0,userID,itemID,rating
0,13538,9738663,1
1,13538,9738569,1
2,13538,9738663,1
3,13538,9738490,1
4,13538,9738663,1


##### Sanity check

In [246]:
# Calculate the total number of items in article_id_fixed and non_clicked_articles columns
total_items = train_history_with_unclicked['article_id_fixed'].apply(len).sum() + train_history_with_unclicked['non_clicked_articles'].apply(len).sum()

# Check if the total number of items is equal to the number of rows in final_data dataframe
check = len(train_history_with_rating) == total_items

# Print the result of the check
print("Check if the number of rows in data with rating equals\nthe total number of items in article_id_fixed and non_clicked_articles:", check)

Check if the number of rows in data with rating equals
the total number of items in article_id_fixed and non_clicked_articles: True


#### Add topics

In [247]:
# Merge final_data with news on article_id and item_id
train_history_with_topics = pd.merge(train_history_with_rating, news[['article_id', 'topics']], left_on='itemID', right_on='article_id', how='left')

# Drop the redundant article_id column
train_history_with_topics.drop(columns=['article_id'], inplace=True)

# Display the first few rows of the final_data_with_genre dataframe
train_history_with_topics.head()

Unnamed: 0,userID,itemID,rating,topics
0,13538,9738663,1,"[Erhverv, Privat virksomhed, Ansættelsesforhol..."
1,13538,9738569,1,"[Erhverv, Samfund, Sport, Bæredygtighed og klima]"
2,13538,9738663,1,"[Erhverv, Privat virksomhed, Ansættelsesforhol..."
3,13538,9738490,1,"[Erhverv, Privat virksomhed, Film og tv, Økono..."
4,13538,9738663,1,"[Erhverv, Privat virksomhed, Ansættelsesforhol..."


#### Remove rows without topics

In [248]:
# Remove rows where the 'topics' column contains empty arrays
train_history_with_topics = train_history_with_topics[train_history_with_topics['topics'].apply(len) > 0]

# Check if there are any rows where the 'topics' column contains empty arrays after removal
empty_topics_rows_after_removal = train_history_with_topics['topics'].apply(lambda x: len(x) == 0).any()

# Print the result
if empty_topics_rows_after_removal:
    print("There are still rows where the 'topics' column contains empty arrays after removal.")
else:
    print("All rows with empty arrays in the 'topics' column have been successfully removed.")


All rows with empty arrays in the 'topics' column have been successfully removed.


#### Sanity check

In [249]:
# Filter rows where the 'topics' column contains empty arrays
rows_with_empty_topics = train_history_with_topics[train_history_with_topics['topics'].apply(len) == 0]

# Print the rows
if not rows_with_empty_topics.empty:
    print("Rows where the 'topics' column contains empty arrays:")
    print(rows_with_empty_topics)
else:
    print("No rows where the 'topics' column contains empty arrays.")


No rows where the 'topics' column contains empty arrays.


#### Add genre (first element from topics) and remove topics

In [250]:
def extract_genre(topics):
    if len(topics) > 0:
        return topics[0]
    else:
        return None

# Create the new "genre" column
train_history_with_topics['genre'] = train_history_with_topics['topics'].apply(extract_genre)

# Drop the redundant "topics" column
train_history_with_topics.drop(columns=['topics'], inplace=True)

# Display the first few rows of the updated dataframe
train_history_with_topics.head()

Unnamed: 0,userID,itemID,rating,genre
0,13538,9738663,1,Erhverv
1,13538,9738569,1,Erhverv
2,13538,9738663,1,Erhverv
3,13538,9738490,1,Erhverv
4,13538,9738663,1,Erhverv


#### Remove any duplicate rows

In [251]:
# Check for duplicate rows
duplicate_rows = train_history_with_topics.duplicated().sum()

# Output the number of duplicate rows
print("Number of duplicate rows:", duplicate_rows)

# Remove duplicate rows from the dataframe
train_history_with_topics = train_history_with_topics.drop_duplicates()

# Reset the indices
train_history_with_topics.reset_index(drop=True, inplace=True)

# Output the updated dataframe
print("Updated dataframe:")
print(train_history_with_topics)

Number of duplicate rows: 39318
Updated dataframe:
         userID   itemID  rating         genre
0         13538  9738663       1       Erhverv
1         13538  9738569       1       Erhverv
2         13538  9738490       1       Erhverv
3         13538  9738667       1       Erhverv
4         13538  9738528       1      Livsstil
...         ...      ...     ...           ...
413154  2539047  9773045       0       Erhverv
413155  2539047  9770102       0       Erhverv
413156  2539047  9772659       0       Økonomi
413157  2539047  9773306       0         Kendt
413158  2539047  9773307       0  Kriminalitet

[413159 rows x 4 columns]


#### Sanity check

In [252]:
# Group the dataframe by userID and itemID and count occurrences
duplicate_rows = train_history_with_topics.groupby(['userID', 'itemID']).size().reset_index(name='count')

# Filter out rows where count is greater than 1
duplicate_rows = duplicate_rows[duplicate_rows['count'] > 1]

# Output the duplicate rows
print("Duplicate rows with matching userID and itemID:")
print(duplicate_rows)

Duplicate rows with matching userID and itemID:
Empty DataFrame
Columns: [userID, itemID, count]
Index: []


### Save data

In [253]:
# Specify the file path where you want to save the CSV file
file_path = "exported_data/train_data.csv"

# Save the final_data dataframe to a CSV file
train_history_with_topics.to_csv(file_path, index=False)

print("Data saved to", file_path)

Data saved to exported_data/train_data.csv


### Repeat same steps for validation data

In [254]:
valid_history = pd.read_parquet("./ebnerd_demo/validation/history.parquet")
valid_history.head()

Unnamed: 0,user_id,impression_time_fixed,scroll_percentage_fixed,article_id_fixed,read_time_fixed
0,750497,"[2023-05-04T09:42:39.000000, 2023-05-04T09:43:...","[100.0, 65.0, 100.0, 100.0, 100.0, 100.0, 100....","[9749224, 9749156, 9749224, 9748948, 9748980, ...","[49.0, 5.0, 7.0, 151.0, 214.0, 199.0, 22.0, 64..."
1,22779,"[2023-05-04T07:53:42.000000, 2023-05-04T15:59:...","[52.0, 39.0, 62.0, 38.0, 74.0, 19.0, 30.0, 56....","[9749025, 9750090, 9750015, 9750161, 9745750, ...","[4.0, 16.0, 2.0, 9.0, 40.0, 7.0, 9.0, 8.0, 18...."
2,373598,"[2023-05-04T07:51:58.000000, 2023-05-04T09:59:...","[nan, nan, nan, 59.0, 33.0, 75.0, nan, nan, 76...","[9514481, 9514481, 9111040, 9750389, 9750307, ...","[0.0, 0.0, 0.0, 3.0, 9.0, 117.0, 39.0, 0.0, 8...."
3,383378,"[2023-05-04T07:27:57.000000, 2023-05-04T07:29:...","[100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100...","[9747490, 9749036, 9749025, 9748792, 9748592, ...","[85.0, 18.0, 133.0, 191.0, 331.0, 56.0, 43.0, ..."
4,411733,"[2023-05-04T17:09:09.000000, 2023-05-04T17:09:...","[20.0, 14.0, 61.0, 55.0, 21.0, 81.0, 100.0, 10...","[9750081, 9750111, 9750039, 9749948, 9749729, ...","[2.0, 4.0, 6.0, 9.0, 1.0, 30.0, 37.0, 5.0, 3.0..."


In [255]:
valid_behaviour = pd.read_parquet("./ebnerd_demo/validation/behaviors.parquet")
valid_behaviour.head()

Unnamed: 0,impression_id,article_id,impression_time,read_time,scroll_percentage,device_type,article_ids_inview,article_ids_clicked,user_id,is_sso_user,gender,postcode,age,is_subscriber,session_id,next_read_time,next_scroll_percentage
0,144772,,2023-05-30 14:21:34,29.0,,2,"[9788239, 9780702, 9553264, 9787499, 6741781, ...",[9783042],76658,False,,,,False,29,7.0,59.0
1,144777,,2023-05-30 14:22:11,10.0,,2,"[9788521, 9786217, 9553264, 9788361, 9788352, ...",[9788125],76658,False,,,,False,29,58.0,98.0
2,196487,,2023-05-27 19:54:18,16.0,,2,"[9279095, 9784273, 9784275, 9784506, 9784444, ...",[9782806],760446,False,,,,False,220,64.0,100.0
3,196495,,2023-05-27 19:53:48,25.0,,2,"[9784575, 9784607, 9784559, 9784662, 9783852, ...",[9782656],760446,False,,,,False,220,4.0,56.0
4,196496,,2023-05-27 19:56:28,11.0,,2,"[9784137, 9784298, 9779370, 9782517, 9777324, ...",[9777324],760446,False,,,,False,220,65.0,87.0


In [256]:
valid_inview_temp = valid_behaviour.groupby('user_id')['article_ids_inview'].agg(lambda x: list(set(y for sublist in x for y in sublist))).reset_index()
valid_inview_temp.columns = ['user_id', 'article_ids_inview']
valid_inview_temp.head()

Unnamed: 0,user_id,article_ids_inview
0,19181,"[8952833, 9533957, 9789446, 9786378, 9786381, ..."
1,21271,"[9785350, 9786378, 9339920, 9783824, 9784852, ..."
2,21774,"[9785986, 9785604, 9785992, 9783824, 9785500, ..."
3,22779,"[9786378, 9786381, 9784852, 9782806, 9784344, ..."
4,22895,"[9754112, 9785349, 9533957, 9784839, 9781257, ..."


In [257]:
valid_history_with_inview = pd.merge(valid_history, valid_inview_temp, on='user_id', how='left')
valid_history_with_inview.head()

Unnamed: 0,user_id,impression_time_fixed,scroll_percentage_fixed,article_id_fixed,read_time_fixed,article_ids_inview
0,750497,"[2023-05-04T09:42:39.000000, 2023-05-04T09:43:...","[100.0, 65.0, 100.0, 100.0, 100.0, 100.0, 100....","[9749224, 9749156, 9749224, 9748948, 9748980, ...","[49.0, 5.0, 7.0, 151.0, 214.0, 199.0, 22.0, 64...","[9783296, 9766912, 9785350, 9781257, 9781262, ..."
1,22779,"[2023-05-04T07:53:42.000000, 2023-05-04T15:59:...","[52.0, 39.0, 62.0, 38.0, 74.0, 19.0, 30.0, 56....","[9749025, 9750090, 9750015, 9750161, 9745750, ...","[4.0, 16.0, 2.0, 9.0, 40.0, 7.0, 9.0, 8.0, 18....","[9786378, 9786381, 9784852, 9782806, 9784344, ..."
2,373598,"[2023-05-04T07:51:58.000000, 2023-05-04T09:59:...","[nan, nan, nan, 59.0, 33.0, 75.0, nan, nan, 76...","[9514481, 9514481, 9111040, 9750389, 9750307, ...","[0.0, 0.0, 0.0, 3.0, 9.0, 117.0, 39.0, 0.0, 8....","[9781763, 9533957, 9785349, 9786378, 9790987, ..."
3,383378,"[2023-05-04T07:27:57.000000, 2023-05-04T07:29:...","[100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100...","[9747490, 9749036, 9749025, 9748792, 9748592, ...","[85.0, 18.0, 133.0, 191.0, 331.0, 56.0, 43.0, ...","[9754112, 9783296, 9774595, 9268227, 9785350, ..."
4,411733,"[2023-05-04T17:09:09.000000, 2023-05-04T17:09:...","[20.0, 14.0, 61.0, 55.0, 21.0, 81.0, 100.0, 10...","[9750081, 9750111, 9750039, 9749948, 9749729, ...","[2.0, 4.0, 6.0, 9.0, 1.0, 30.0, 37.0, 5.0, 3.0...","[9784839, 9785868, 9339920, 9782806, 9784856, ..."


In [258]:
valid_history_with_inview['non_clicked_articles'] = valid_history_with_inview.apply(lambda row: get_non_clicked_articles(row['article_ids_inview'], row['article_id_fixed']), axis=1)
valid_history_with_inview.head()

Unnamed: 0,user_id,impression_time_fixed,scroll_percentage_fixed,article_id_fixed,read_time_fixed,article_ids_inview,non_clicked_articles
0,750497,"[2023-05-04T09:42:39.000000, 2023-05-04T09:43:...","[100.0, 65.0, 100.0, 100.0, 100.0, 100.0, 100....","[9749224, 9749156, 9749224, 9748948, 9748980, ...","[49.0, 5.0, 7.0, 151.0, 214.0, 199.0, 22.0, 64...","[9783296, 9766912, 9785350, 9781257, 9781262, ...","[9783296, 9766912, 9785350, 9781257, 9781262, ..."
1,22779,"[2023-05-04T07:53:42.000000, 2023-05-04T15:59:...","[52.0, 39.0, 62.0, 38.0, 74.0, 19.0, 30.0, 56....","[9749025, 9750090, 9750015, 9750161, 9745750, ...","[4.0, 16.0, 2.0, 9.0, 40.0, 7.0, 9.0, 8.0, 18....","[9786378, 9786381, 9784852, 9782806, 9784344, ...","[9786378, 9786381, 9784852, 9782806, 9784344, ..."
2,373598,"[2023-05-04T07:51:58.000000, 2023-05-04T09:59:...","[nan, nan, nan, 59.0, 33.0, 75.0, nan, nan, 76...","[9514481, 9514481, 9111040, 9750389, 9750307, ...","[0.0, 0.0, 0.0, 3.0, 9.0, 117.0, 39.0, 0.0, 8....","[9781763, 9533957, 9785349, 9786378, 9790987, ...","[9781763, 9533957, 9785349, 9786378, 9790987, ..."
3,383378,"[2023-05-04T07:27:57.000000, 2023-05-04T07:29:...","[100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100...","[9747490, 9749036, 9749025, 9748792, 9748592, ...","[85.0, 18.0, 133.0, 191.0, 331.0, 56.0, 43.0, ...","[9754112, 9783296, 9774595, 9268227, 9785350, ...","[9754112, 9783296, 9268227, 9785350, 9781257, ..."
4,411733,"[2023-05-04T17:09:09.000000, 2023-05-04T17:09:...","[20.0, 14.0, 61.0, 55.0, 21.0, 81.0, 100.0, 10...","[9750081, 9750111, 9750039, 9749948, 9749729, ...","[2.0, 4.0, 6.0, 9.0, 1.0, 30.0, 37.0, 5.0, 3.0...","[9784839, 9785868, 9339920, 9782806, 9784856, ...","[9784839, 9785868, 9339920, 9782806, 9784856, ..."


In [259]:
# Selecting only the desired columns
valid_history_with_unclicked = valid_history_with_inview[['user_id', 'article_id_fixed', 'non_clicked_articles']]

# Display the updated dataframe
valid_history_with_unclicked.head()

Unnamed: 0,user_id,article_id_fixed,non_clicked_articles
0,750497,"[9749224, 9749156, 9749224, 9748948, 9748980, ...","[9783296, 9766912, 9785350, 9781257, 9781262, ..."
1,22779,"[9749025, 9750090, 9750015, 9750161, 9745750, ...","[9786378, 9786381, 9784852, 9782806, 9784344, ..."
2,373598,"[9514481, 9514481, 9111040, 9750389, 9750307, ...","[9781763, 9533957, 9785349, 9786378, 9790987, ..."
3,383378,"[9747490, 9749036, 9749025, 9748792, 9748592, ...","[9754112, 9783296, 9268227, 9785350, 9781257, ..."
4,411733,"[9750081, 9750111, 9750039, 9749948, 9749729, ...","[9784839, 9785868, 9339920, 9782806, 9784856, ..."


In [260]:
data = []

for index, row in valid_history_with_unclicked.iterrows():
    for item in row['article_id_fixed']:
        data.append([row['user_id'], item, 1])
    
    for item in row['non_clicked_articles']:
        data.append([row['user_id'], item, 0])

valid_history_with_rating = pd.DataFrame(data, columns=['userID', 'itemID', 'rating'])
valid_history_with_rating.head()

Unnamed: 0,userID,itemID,rating
0,750497,9749224,1
1,750497,9749156,1
2,750497,9749224,1
3,750497,9748948,1
4,750497,9748980,1


In [261]:
valid_history_with_topics = pd.merge(valid_history_with_rating, news[['article_id', 'topics']], left_on='itemID', right_on='article_id', how='left')
valid_history_with_topics.drop(columns=['article_id'], inplace=True)
valid_history_with_topics.head()

Unnamed: 0,userID,itemID,rating,topics
0,750497,9749224,1,"[Kriminalitet, Transportmiddel, Bil]"
1,750497,9749156,1,"[Kriminalitet, Personfarlig kriminalitet]"
2,750497,9749224,1,"[Kriminalitet, Transportmiddel, Bil]"
3,750497,9748948,1,"[Erhverv, Kendt, Sport, Fodbold, Ansættelsesfo..."
4,750497,9748980,1,"[Politik, International politik, Konflikt og k..."


In [262]:
# Remove rows where the 'topics' column contains empty arrays
valid_history_with_topics = valid_history_with_topics[valid_history_with_topics['topics'].apply(len) > 0]

# Check if there are any rows where the 'topics' column contains empty arrays after removal
empty_topics_rows_after_removal = valid_history_with_topics['topics'].apply(lambda x: len(x) == 0).any()

# Print the result
if empty_topics_rows_after_removal:
    print("There are still rows where the 'topics' column contains empty arrays after removal.")
else:
    print("All rows with empty arrays in the 'topics' column have been successfully removed.")

All rows with empty arrays in the 'topics' column have been successfully removed.


In [263]:
valid_history_with_topics['genre'] = valid_history_with_topics['topics'].apply(extract_genre)
valid_history_with_topics.drop(columns=['topics'], inplace=True)
valid_history_with_topics.head()

Unnamed: 0,userID,itemID,rating,genre
0,750497,9749224,1,Kriminalitet
1,750497,9749156,1,Kriminalitet
2,750497,9749224,1,Kriminalitet
3,750497,9748948,1,Erhverv
4,750497,9748980,1,Politik


In [264]:
# Check for duplicate rows
duplicate_rows = valid_history_with_topics.duplicated().sum()

# Output the number of duplicate rows
print("Number of duplicate rows:", duplicate_rows)

# Remove duplicate rows from the dataframe
valid_history_with_topics = valid_history_with_topics.drop_duplicates()

# Reset the indices
valid_history_with_topics.reset_index(drop=True, inplace=True)

# Output the updated dataframe
print("Updated dataframe:")
print(valid_history_with_topics)

Number of duplicate rows: 35561
Updated dataframe:
         userID   itemID  rating         genre
0        750497  9749224       1  Kriminalitet
1        750497  9749156       1  Kriminalitet
2        750497  9748948       1       Erhverv
3        750497  9748980       1       Politik
4        750497  9748792       1  Kriminalitet
...         ...      ...     ...           ...
414230  2385386  9782103       0       Erhverv
414231  2385386  7594265       0      Livsstil
414232  2385386  9782108       0    Begivenhed
414233  2385386  9782205       0  Kriminalitet
414234  2385386  9782046       0         Kendt

[414235 rows x 4 columns]


#### Removing common rows between training and validation from the validation dataset

In [265]:
# Assuming you have imported pandas as pd
common_rows = pd.merge(valid_history_with_topics, train_history_with_topics, on=['userID', 'itemID'], how='inner')

if not common_rows.empty:
    print("There are common rows between the two dataframes.")
    print(common_rows)
else:
    print("There are no common rows between the two dataframes.")

common_indices = common_rows.index
# Remove common rows based on userID and itemID combination
valid_history_without_matches = valid_history_with_topics[
    ~valid_history_with_topics.set_index(['userID', 'itemID']).index.isin(common_rows.set_index(['userID', 'itemID']).index)
]

# Reset the index of valid_history_without_matches
valid_history_without_matches.reset_index(drop=True, inplace=True)

print(valid_history_without_matches)



There are common rows between the two dataframes.
        userID   itemID  rating_x       genre_x  rating_y       genre_y
0       750497  9749224         1  Kriminalitet         1  Kriminalitet
1       750497  9749156         1  Kriminalitet         1  Kriminalitet
2       750497  9748948         1       Erhverv         1       Erhverv
3       750497  9748980         1       Politik         1       Politik
4       750497  9748792         1  Kriminalitet         1  Kriminalitet
...        ...      ...       ...           ...       ...           ...
169489   40071  9778413         1  Kriminalitet         0  Kriminalitet
169490   40071  9778326         1         Kendt         0         Kendt
169491   40071  9778257         1       Erhverv         0       Erhverv
169492   40071  9778902         1       Erhverv         0       Erhverv
169493   40071  9779860         1       Politik         0       Politik

[169494 rows x 6 columns]
         userID   itemID  rating            genre
0        

#### Sanity check

In [266]:
common_rows_after_drop = pd.merge(valid_history_without_matches, train_history_with_topics, on=['userID', 'itemID'], how='inner')

if not common_rows_after_drop.empty:
    print("There are still common rows between the two dataframes after dropping.")
    print(common_rows_after_drop)
else:
    print("There are no common rows between the two dataframes after dropping.")

There are no common rows between the two dataframes after dropping.


In [267]:
file_path = "exported_data/valid_data.csv"
valid_history_with_topics.to_csv(file_path, index=False)
print("Data saved to", file_path)

Data saved to exported_data/valid_data.csv


### Combine train and valid data (not used but keep as utility)

#### Combine first 50.000 train data entries and 50.000 validation data entries for smaller dataset

In [268]:
SPLITNUM = 50000

# Select the first 50,000 rows of train_history_with_topics
train_subset = train_history_with_topics.iloc[:SPLITNUM]

# Select the first 50,000 rows of valid_history_with_topics
valid_subset = valid_history_with_topics.iloc[:SPLITNUM]

# Concatenate the subsets
combined_df = pd.concat([train_subset, valid_subset], ignore_index=True)

# Output the index where valid_history_with_topics starts
valid_start_index = len(train_subset)
print("Index where valid_history_with_topics starts:", valid_start_index)

Index where valid_history_with_topics starts: 50000


#### Sanity check

In [269]:
# Output the row at valid_start_index in combined_df
row_valid_start = combined_df.iloc[valid_start_index]
print("Row at valid_start_index in combined_df:")
print(row_valid_start)

# Output the first row of train_history_with_topics
first_row_train = valid_history_with_topics.iloc[0]
print("\nFirst row of valid_history_with_topics:")
print(first_row_train)

Row at valid_start_index in combined_df:
userID          750497
itemID         9749224
rating               1
genre     Kriminalitet
Name: 50000, dtype: object

First row of valid_history_with_topics:
userID          750497
itemID         9749224
rating               1
genre     Kriminalitet
Name: 0, dtype: object


#### Save combined data

In [270]:
file_path = "exported_data/combined_data.csv"
valid_history_with_topics.to_csv(file_path, index=False)
print("Data saved to", file_path)

Data saved to exported_data/combined_data.csv
