## Preprocess datasets

This notebook preprocess the datasets used in this paper.  We perform preprocessing on the following datasets


1. Movielens20m
2. KuaiRec
3. Goodreads


The results can be found in the datasets folder

In [2]:
import pandas as pd


## 01 - Preprocessing MovieLens


This step is meant for comparison with Steck base paper

steps:

1. Consider only movies with rating >= 4
2. Remove movies with no gender
3. Remove movies with less than 10 interactions
4. Remove movies without any user rating (this is done by simply doing a inner join between movie and rating df)

ten interactions and users with less than 190 movies.

In [3]:
def preprocess_rating_df(ratings, sample_size=300_000):
    """
        ins:
            ratings: pd.DataFrame
        outs:
            preprocessed_ratigs: pd.DataFrame

        -------------------------------------
        Preprocess the ratings df following the procedures suggested in Steck (2018).
        The following steps are applied:
        1. Consider only movies with rating >= 4
        2. Remove movies with less than 10 interactions (we'll consider movies with move than
            10 reviews)
    """

    over_4_df = ratings[ratings["rating"] >= 4.0]
    num_interactions = over_4_df.groupby("movieId").size().reset_index(name="size")
    to_remove = list(num_interactions[num_interactions["size"] <= 10].movieId.values)
    df = over_4_df[~over_4_df["movieId"].isin(to_remove)]

    if sample_size is not None:
        return df.sample(sample_size)
    return dfs


In [4]:
!wget https://files.grouplens.org/datasets/movielens/ml-20m.zip
!unzip ml-20m.zip

--2024-11-07 19:14:36--  https://files.grouplens.org/datasets/movielens/ml-20m.zip
Resolvendo files.grouplens.org (files.grouplens.org)... 128.101.65.152
Conectando-se a files.grouplens.org (files.grouplens.org)|128.101.65.152|:443... conectado.
A requisição HTTP foi enviada, aguardando resposta... 200 OK
Tamanho: 198702078 (189M) [application/zip]
Salvando em: ‘ml-20m.zip’


2024-11-07 19:15:11 (6,05 MB/s) - ‘ml-20m.zip’ salvo [198702078/198702078]

Archive:  ml-20m.zip
   creating: ml-20m/
  inflating: ml-20m/genome-scores.csv  
  inflating: ml-20m/genome-tags.csv  
  inflating: ml-20m/links.csv        
  inflating: ml-20m/movies.csv       
  inflating: ml-20m/ratings.csv      
  inflating: ml-20m/README.txt       
  inflating: ml-20m/tags.csv         


In [5]:
!rm ml-20m/links.csv
!rm ml-20m/tags.csv
!rm ml-20m/genome-scores.csv
!rm ml-20m/genome-tags.csv

In [6]:
movies = pd.read_csv("ml-20m/movies.csv")

In [7]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [8]:
ratings = pd.read_csv("ml-20m/ratings.csv")

In [9]:
ratings

Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,1112486027
1,1,29,3.5,1112484676
2,1,32,3.5,1112484819
3,1,47,3.5,1112484727
4,1,50,3.5,1112484580
...,...,...,...,...
20000258,138493,68954,4.5,1258126920
20000259,138493,69526,4.5,1259865108
20000260,138493,69644,3.0,1260209457
20000261,138493,70286,5.0,1258126944


In [10]:
filtered_ratings = preprocess_rating_df(ratings)

In [11]:
filtered_ratings

Unnamed: 0,userId,movieId,rating,timestamp
11053637,76414,704,5.0,843742726
12827145,88641,4022,4.0,1058636265
19175697,132652,778,5.0,1112551957
11192242,77297,908,4.0,1414671763
5589270,38429,7438,4.5,1352481230
...,...,...,...,...
1085726,7398,50,5.0,1417206076
17597631,121725,593,4.0,1223982340
6414965,44055,6552,4.5,1179697507
10500314,72652,3147,4.5,1115027905


In [12]:
df = movies.merge(filtered_ratings, on="movieId").drop(columns=["title"])

In [13]:
df

Unnamed: 0,movieId,genres,userId,rating,timestamp
0,1,Adventure|Animation|Children|Comedy|Fantasy,136670,4.0,862652804
1,1,Adventure|Animation|Children|Comedy|Fantasy,83076,4.5,1234146518
2,1,Adventure|Animation|Children|Comedy|Fantasy,42831,4.0,979348139
3,1,Adventure|Animation|Children|Comedy|Fantasy,66181,5.0,939675916
4,1,Adventure|Animation|Children|Comedy|Fantasy,2758,4.0,833215908
...,...,...,...,...,...
299995,119145,Action|Adventure|Comedy|Crime,84136,4.5,1424865002
299996,125916,Drama,24646,4.0,1424274393
299997,127098,Comedy,69470,4.5,1422828040
299998,127098,Comedy,136684,4.5,1424133121


In [14]:
df.to_csv("ml20m_filtered_300k_sample.csv")

## 02- Preprocessing movielens for comparison with saciloti et al


In [18]:
def preprocess_rating_for_sacilotti(ratings, sample_size=300_000):
    """
        ins:
            ratings: pd.DataFrame
        outs:
            preprocessed_ratigs: pd.DataFrame

        -------------------------------------
        Preprocess the ratings df following the procedures suggested in Steck (2018).
        The following steps are applied:

        1. Remove movies with less than 10 interactions (we'll consider movies with move than
            10 reviews)
        2. Consider only users who have interacted with over 190 movies.
    """
    num_interactions = ratings.groupby("movieId").size().reset_index(name="size")
    num_movies = ratings.groupby("userId").size().reset_index(name="num_movies")
    m_to_remove = list(num_interactions[num_interactions["size"] <= 10].movieId.values)
    u_to_remove = list(num_movies[num_movies["num_movies"] >= 190].userId.values)
    
    df =ratings[~ratings["movieId"].isin(m_to_remove)]
    df = df[~df["userId"].isin(u_to_remove)]
    if sample_size is not None:
        return df.sample(sample_size)
    return df


In [19]:
filtered_ratings_s = preprocess_rating_for_sacilotti(ratings)

In [20]:
filtered_ratings_s

Unnamed: 0,userId,movieId,rating,timestamp
7215026,49783,1591,3.0,919075246
2363579,15987,648,3.0,848432405
6509726,44724,2571,5.0,1301528455
1442246,9756,6370,4.5,1295733998
10643158,73641,586,3.0,838430934
...,...,...,...,...
19391246,134228,4128,4.0,982025349
19048307,131804,5401,4.0,1122202974
10289899,71173,5452,3.0,1081865905
4435118,30314,319,5.0,832206541


In [21]:
df_s = movies.merge(filtered_ratings_s, on="movieId").drop(columns=["title"])

In [22]:
df_s.to_csv("ml20m_sacilotti_filtered_300k_sample.csv")

## 03 - Preprocessing kuairec dataset


Downloaded from https://drive.google.com/file/d/1qe5hOSBxzIuxBb1G_Ih5X-O65QElollE/view on 30/10/2024


Apparently, the authors of Lin et al 2024 used the small_matrix.csv

In [4]:
!unzip KuaiRec.zip

Archive:  KuaiRec.zip
   creating: KuaiRec 2.0/
  inflating: KuaiRec 2.0/LICENSE     
  inflating: KuaiRec 2.0/Statistics_KuaiRec.ipynb  
   creating: KuaiRec 2.0/data/
  inflating: KuaiRec 2.0/data/big_matrix.csv  
  inflating: KuaiRec 2.0/data/item_categories.csv  
  inflating: KuaiRec 2.0/data/item_daily_features.csv  
  inflating: KuaiRec 2.0/data/kuairec_caption_category.csv  
  inflating: KuaiRec 2.0/data/small_matrix.csv  
  inflating: KuaiRec 2.0/data/social_network.csv  
  inflating: KuaiRec 2.0/data/user_features.csv  
   creating: KuaiRec 2.0/figs/
  inflating: KuaiRec 2.0/figs/KuaiRec.png  
  inflating: KuaiRec 2.0/figs/colab-badge.svg  
  inflating: KuaiRec 2.0/loaddata.py  


In [9]:
kuairec = pd.read_csv("KuaiRec 2.0/data/small_matrix.csv").dropna(subset=["time", "timestamp"])

In [10]:
kuairec

Unnamed: 0,user_id,video_id,play_duration,video_duration,time,date,timestamp,watch_ratio
0,14,148,4381,6067,2020-07-05 05:27:48.378,20200705.0,1.593898e+09,0.722103
1,14,183,11635,6100,2020-07-05 05:28:00.057,20200705.0,1.593898e+09,1.907377
2,14,3649,22422,10867,2020-07-05 05:29:09.479,20200705.0,1.593898e+09,2.063311
3,14,5262,4479,7908,2020-07-05 05:30:43.285,20200705.0,1.593898e+09,0.566388
4,14,8234,4602,11000,2020-07-05 05:35:43.459,20200705.0,1.593899e+09,0.418364
...,...,...,...,...,...,...,...,...
4676370,7162,9177,5315,37205,2020-09-01 20:06:35.984,20200901.0,1.598962e+09,0.142857
4676371,7162,4987,10085,8167,2020-09-02 14:44:51.342,20200902.0,1.599029e+09,1.234848
4676372,7162,7988,50523,49319,2020-09-03 08:45:01.474,20200903.0,1.599094e+09,1.024412
4676373,7162,6533,2190,8000,2020-09-04 22:56:32.021,20200904.0,1.599231e+09,0.273750


In [11]:
kuairec["watch_ratio"].describe()

count    4.494578e+06
mean     9.113757e-01
std      1.356487e+00
min      0.000000e+00
25%      4.735875e-01
50%      7.745146e-01
75%      1.124976e+00
max      5.715214e+02
Name: watch_ratio, dtype: float64

Notice that watch_ratio can be larger than one. That happens because watch_ratio is defined as (=play_duration/video_duration). So if a user repeatedely sees a video, then this can be larger than 1.

In order to compare with Lin et al 2024, we'll label watch_ratio >= 0.9 as positive examples and watch_ratio < 0.9 as negative

In [12]:
kuairec['label'] = kuairec['watch_ratio'].apply(lambda x: 1 if x > 0.9 else 0)

In [13]:
kuairec.to_csv("datasets/kuairec_filtered.csv")

## 04 - Preprocessing goodreads

In [None]:
goodreads = pd.read_json('path_to_file.json.gz', lines=True, compression='gzip')