# Imports et paramétrages de base

Content

The dataset contains a sample of user interactions (page views) in G1 news portal from Oct. 1 to 16, 2017, including about 3 million clicks, distributed in more than 1 million sessions from 314,000 users who read more than 46,000 different news articles during that period.

It is composed by three files/folders:

clicks.zip - Folder with CSV files (one per hour), containing user sessions interactions in the news portal.


articles_metadata.csv - CSV file with metadata information about all (364047) published articles


articles_embeddings.pickle Pickle (Python 3) of a NumPy matrix containing the Article Content Embeddings (250-dimensional vectors), trained upon articles' text and metadata by the CHAMELEON's ACR module (see paper for details) for 364047 published articles.


P.s. The full text of news articles could not be provided due to license restrictions, but those embeddings can be used by Neura

In [1]:
import pandas as pd

In [2]:
import pickle

In [3]:
import numpy as np

In [4]:
pickle_file = 'news-portal-user-interactions-by-globocom/articles_embeddings.pickle'

articles_metadata_file = 'news-portal-user-interactions-by-globocom/articles_metadata.csv'

clicks_sample_file = 'news-portal-user-interactions-by-globocom/clicks_sample.zip'

# Exploration des données

## Articles Metadata

In [5]:
articles_metadata = pd.read_csv(articles_metadata_file)

In [6]:
articles_metadata.head()

Unnamed: 0,article_id,category_id,created_at_ts,publisher_id,words_count
0,0,0,1513144419000,0,168
1,1,1,1405341936000,0,189
2,2,1,1408667706000,0,250
3,3,1,1408468313000,0,230
4,4,1,1407071171000,0,162


In [7]:
articles_metadata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 364047 entries, 0 to 364046
Data columns (total 5 columns):
 #   Column         Non-Null Count   Dtype
---  ------         --------------   -----
 0   article_id     364047 non-null  int64
 1   category_id    364047 non-null  int64
 2   created_at_ts  364047 non-null  int64
 3   publisher_id   364047 non-null  int64
 4   words_count    364047 non-null  int64
dtypes: int64(5)
memory usage: 13.9 MB


In [8]:
articles_metadata.describe()

Unnamed: 0,article_id,category_id,created_at_ts,publisher_id,words_count
count,364047.0,364047.0,364047.0,364047.0,364047.0
mean,182023.0,283.108239,1474070000000.0,0.0,190.897727
std,105091.461061,136.72347,42930380000.0,0.0,59.502766
min,0.0,0.0,1159356000000.0,0.0,0.0
25%,91011.5,199.0,1444925000000.0,0.0,159.0
50%,182023.0,301.0,1489422000000.0,0.0,186.0
75%,273034.5,399.0,1509891000000.0,0.0,218.0
max,364046.0,460.0,1520943000000.0,0.0,6690.0


### Recherche de doublons

In [2]:
articles_metadata.duplicated().sum()

NameError: name 'articles_metadata' is not defined

In [None]:
articles_metadata['article_id'].duplicated().sum()

np.int64(0)

J'aurais pu m'en douter en regardant les valeurs des différents quartiles etc. mais bon autant vérifier

### Articles vides ?

In [11]:
articles_metadata[articles_metadata["words_count"] == 0].shape

(35, 5)

## Embeddings

In [12]:
# Shape des embeddings
with open(pickle_file, "rb") as f:
    embeddings = pickle.load(f)   # devrait être un np.ndarray

print("Shape embeddings :", embeddings.shape)

# Shape des métadonnées
print("Shape metadata :", articles_metadata.shape)


Shape embeddings : (364047, 250)
Shape metadata : (364047, 5)


## Clicks

In [13]:
unclick = pd.read_csv('news-portal-user-interactions-by-globocom/clicks/clicks_hour_034.csv')

In [14]:
unclick.head()

Unnamed: 0,user_id,session_id,session_start,session_size,click_article_id,click_timestamp,click_environment,click_deviceGroup,click_os,click_country,click_region,click_referrer_type
0,57453,1506947781355815,1506947781000,2,59758,1506947938747,4,1,17,1,25,2
1,57453,1506947781355815,1506947781000,2,336429,1506947968747,4,1,17,1,25,2
2,57454,1506947782193816,1506947782000,2,160974,1506948442748,4,3,2,1,24,6
3,57454,1506947782193816,1506947782000,2,158536,1506948472748,4,3,2,1,24,6
4,14928,1506947782202817,1506947782000,2,160974,1506948022263,4,1,17,1,21,1


In [15]:
unclick.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16209 entries, 0 to 16208
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype
---  ------               --------------  -----
 0   user_id              16209 non-null  int64
 1   session_id           16209 non-null  int64
 2   session_start        16209 non-null  int64
 3   session_size         16209 non-null  int64
 4   click_article_id     16209 non-null  int64
 5   click_timestamp      16209 non-null  int64
 6   click_environment    16209 non-null  int64
 7   click_deviceGroup    16209 non-null  int64
 8   click_os             16209 non-null  int64
 9   click_country        16209 non-null  int64
 10  click_region         16209 non-null  int64
 11  click_referrer_type  16209 non-null  int64
dtypes: int64(12)
memory usage: 1.5 MB


In [16]:
unclick.describe()

Unnamed: 0,user_id,session_id,session_start,session_size,click_article_id,click_timestamp,click_environment,click_deviceGroup,click_os,click_country,click_region,click_referrer_type
count,16209.0,16209.0,16209.0,16209.0,16209.0,16209.0,16209.0,16209.0,16209.0,16209.0,16209.0,16209.0
mean,52831.265285,1506950000000000.0,1506950000000.0,3.799186,178677.879511,1506953000000.0,3.968844,1.402369,15.212104,1.330495,18.220063,1.840953
std,15357.881936,1021561000.0,1021565.0,3.433163,76798.315924,12519630.0,0.256849,0.833933,5.050295,1.676233,7.139908,1.160811
min,34.0,1506948000000000.0,1506948000000.0,2.0,1899.0,1506948000000.0,1.0,1.0,2.0,1.0,1.0,1.0
25%,57609.0,1506949000000000.0,1506949000000.0,2.0,158536.0,1506950000000.0,4.0,1.0,17.0,1.0,13.0,1.0
50%,59075.0,1506950000000000.0,1506950000000.0,3.0,160974.0,1506951000000.0,4.0,1.0,17.0,1.0,21.0,2.0
75%,60569.0,1506950000000000.0,1506950000000.0,4.0,207731.0,1506952000000.0,4.0,1.0,17.0,1.0,25.0,2.0
max,62108.0,1506951000000000.0,1506951000000.0,28.0,363976.0,1507294000000.0,4.0,5.0,20.0,11.0,28.0,7.0


# Nettoyage des données

In [17]:
# Supprimer les articles vides
articles_clean = articles_metadata[articles_metadata["words_count"] > 0].reset_index(drop=True)

# Et leurs embeddings associés
mask = articles_metadata["words_count"] > 0
embeddings_clean = embeddings[mask.values]


In [18]:
import os

In [None]:
click_csvs = os.listdir('news-portal-user-interactions-by-globocom/clicks/')

In [20]:
print(click_csvs)

['clicks_hour_367.csv', 'clicks_hour_314.csv', 'clicks_hour_184.csv', 'clicks_hour_277.csv', 'clicks_hour_055.csv', 'clicks_hour_298.csv', 'clicks_hour_268.csv', 'clicks_hour_378.csv', 'clicks_hour_197.csv', 'clicks_hour_083.csv', 'clicks_hour_095.csv', 'clicks_hour_015.csv', 'clicks_hour_037.csv', 'clicks_hour_012.csv', 'clicks_hour_071.csv', 'clicks_hour_163.csv', 'clicks_hour_005.csv', 'clicks_hour_090.csv', 'clicks_hour_014.csv', 'clicks_hour_025.csv', 'clicks_hour_266.csv', 'clicks_hour_302.csv', 'clicks_hour_376.csv', 'clicks_hour_354.csv', 'clicks_hour_195.csv', 'clicks_hour_153.csv', 'clicks_hour_143.csv', 'clicks_hour_187.csv', 'clicks_hour_342.csv', 'clicks_hour_165.csv', 'clicks_hour_267.csv', 'clicks_hour_270.csv', 'clicks_hour_340.csv', 'clicks_hour_046.csv', 'clicks_hour_060.csv', 'clicks_hour_166.csv', 'clicks_hour_081.csv', 'clicks_hour_096.csv', 'clicks_hour_307.csv', 'clicks_hour_154.csv', 'clicks_hour_006.csv', 'clicks_hour_074.csv', 'clicks_hour_213.csv', 'clicks_ho

In [25]:
articles_vides = articles_metadata.loc[articles_metadata["words_count"] == 0, "article_id"]

In [26]:
print(articles_vides)

35491      35491
38472      38472
39043      39043
39054      39054
164414    164414
206233    206233
212323    212323
212324    212324
212327    212327
212526    212526
233934    233934
236119    236119
248114    248114
275183    275183
315386    315386
315387    315387
315388    315388
315389    315389
315390    315390
315391    315391
315392    315392
315393    315393
315394    315394
315395    315395
315396    315396
315397    315397
315398    315398
315788    315788
315789    315789
315790    315790
315791    315791
315792    315792
315793    315793
328318    328318
333154    333154
Name: article_id, dtype: int64


In [28]:
for p in click_csvs:
    f = os.path.join('news-portal-user-interactions-by-globocom/clicks/',p)
    df = pd.read_csv(f)
    intersect = df["click_article_id"].isin(articles_vides)
    if intersect.any():
        print(f"{f} contient {intersect.sum()} clic(s) sur des articles vides")


news-portal-user-interactions-by-globocom/clicks/clicks_hour_015.csv contient 2 clic(s) sur des articles vides
news-portal-user-interactions-by-globocom/clicks/clicks_hour_037.csv contient 1 clic(s) sur des articles vides
news-portal-user-interactions-by-globocom/clicks/clicks_hour_012.csv contient 2 clic(s) sur des articles vides
news-portal-user-interactions-by-globocom/clicks/clicks_hour_014.csv contient 1 clic(s) sur des articles vides
news-portal-user-interactions-by-globocom/clicks/clicks_hour_342.csv contient 1 clic(s) sur des articles vides
news-portal-user-interactions-by-globocom/clicks/clicks_hour_006.csv contient 1 clic(s) sur des articles vides
news-portal-user-interactions-by-globocom/clicks/clicks_hour_108.csv contient 1 clic(s) sur des articles vides
news-portal-user-interactions-by-globocom/clicks/clicks_hour_129.csv contient 5 clic(s) sur des articles vides
news-portal-user-interactions-by-globocom/clicks/clicks_hour_034.csv contient 2 clic(s) sur des articles vides
n