## Preprocess datasets

This notebook preprocess the datasets used in this paper.  We perform preprocessing on the following datasets


1. Movielens20m
2. KuaiRec
3. Goodreads


The results can be found in the datasets folder

In [None]:
import pandas as pd


## 01 - Preprocessing MovieLens


This step is meant for comparison with Steck base paper

steps:

1. Consider only movies with rating >= 4
2. Remove movies with no gender
3. Remove movies with less than 10 interactions
4. Remove movies without any user rating (this is done by simply doing a inner join between movie and rating df)

ten interactions and users with less than 190 movies.

In [None]:
def preprocess_rating_for_sacilotti(ratings, sample_size=3000):
    """
        ins:
            ratings: pd.DataFrame
        outs:
            preprocessed_ratigs: pd.DataFrame

        -------------------------------------
        Preprocess the ratings df following the procedures suggested in Steck (2018).
        The following steps are applied:
        1. Consider only movies with rating >= 4
        2. Remove movies with less than 10 interactions (we'll consider movies with move than
            10 reviews)
    """
    num_interactions = ratings.groupby("movieId").size().reset_index(name="size")
    num_movies = ratings.groupby("userId").size().reset_index(name="num_movies")
    to_remove = list(num_interactions[num_interactions["size"] <= 10].movieId.values)
    df = over_4_df[~over_4_df["movieId"].isin(to_remove)]


In [None]:
def preprocess_rating_df(ratings, sample_size=300_000):
    """
        ins:
            ratings: pd.DataFrame
        outs:
            preprocessed_ratigs: pd.DataFrame

        -------------------------------------
        Preprocess the ratings df following the procedures suggested in Steck (2018).
        The following steps are applied:
        1. Consider only movies with rating >= 4
        2. Remove movies with less than 10 interactions (we'll consider movies with move than
            10 reviews)
    """

    over_4_df = ratings[ratings["rating"] >= 4.0]
    num_interactions = over_4_df.groupby("movieId").size().reset_index(name="size")
    to_remove = list(num_interactions[num_interactions["size"] <= 10].movieId.values)
    df = over_4_df[~over_4_df["movieId"].isin(to_remove)]

    if sample_size is not None:
        return df.sample(sample_size)
    return dfs


In [None]:
!wget https://files.grouplens.org/datasets/movielens/ml-20m.zip
!unzip ml-20m.zip

In [None]:
!rm ml-20m/links.csv
!rm ml-20m/tags.csv
!rm ml-20m/genome-scores.csv
!rm ml-20m/genome-tags.csv

In [None]:
movies = pd.read_csv("ml-20m/movies.csv")

In [None]:
movies.head()

In [None]:
ratings = pd.read_csv("ml-20m/ratings.csv")

In [None]:
ratings

In [None]:
filtered_ratings = preprocess_rating_df(ratings)

In [None]:
filtered_ratings

In [None]:
df = movies.merge(filtered_ratings, on="movieId").drop(columns=["title"])

In [None]:
df

In [None]:
df.to_csv("ml20m_filtered_300k_sample.csv")

## 02- Preprocessing movielens for comparison with saciloti et al


## 03 - Preprocessing kuairec dataset


Downloaded from https://drive.google.com/file/d/1qe5hOSBxzIuxBb1G_Ih5X-O65QElollE/view on 30/10/2024


Apparently, the authors of Lin et al 2024 used the small_matrix.csv

In [4]:
!unzip KuaiRec.zip

Archive:  KuaiRec.zip
   creating: KuaiRec 2.0/
  inflating: KuaiRec 2.0/LICENSE     
  inflating: KuaiRec 2.0/Statistics_KuaiRec.ipynb  
   creating: KuaiRec 2.0/data/
  inflating: KuaiRec 2.0/data/big_matrix.csv  
  inflating: KuaiRec 2.0/data/item_categories.csv  
  inflating: KuaiRec 2.0/data/item_daily_features.csv  
  inflating: KuaiRec 2.0/data/kuairec_caption_category.csv  
  inflating: KuaiRec 2.0/data/small_matrix.csv  
  inflating: KuaiRec 2.0/data/social_network.csv  
  inflating: KuaiRec 2.0/data/user_features.csv  
   creating: KuaiRec 2.0/figs/
  inflating: KuaiRec 2.0/figs/KuaiRec.png  
  inflating: KuaiRec 2.0/figs/colab-badge.svg  
  inflating: KuaiRec 2.0/loaddata.py  


In [9]:
kuairec = pd.read_csv("KuaiRec 2.0/data/small_matrix.csv").dropna(subset=["time", "timestamp"])

In [10]:
kuairec

Unnamed: 0,user_id,video_id,play_duration,video_duration,time,date,timestamp,watch_ratio
0,14,148,4381,6067,2020-07-05 05:27:48.378,20200705.0,1.593898e+09,0.722103
1,14,183,11635,6100,2020-07-05 05:28:00.057,20200705.0,1.593898e+09,1.907377
2,14,3649,22422,10867,2020-07-05 05:29:09.479,20200705.0,1.593898e+09,2.063311
3,14,5262,4479,7908,2020-07-05 05:30:43.285,20200705.0,1.593898e+09,0.566388
4,14,8234,4602,11000,2020-07-05 05:35:43.459,20200705.0,1.593899e+09,0.418364
...,...,...,...,...,...,...,...,...
4676370,7162,9177,5315,37205,2020-09-01 20:06:35.984,20200901.0,1.598962e+09,0.142857
4676371,7162,4987,10085,8167,2020-09-02 14:44:51.342,20200902.0,1.599029e+09,1.234848
4676372,7162,7988,50523,49319,2020-09-03 08:45:01.474,20200903.0,1.599094e+09,1.024412
4676373,7162,6533,2190,8000,2020-09-04 22:56:32.021,20200904.0,1.599231e+09,0.273750


In [11]:
kuairec["watch_ratio"].describe()

count    4.494578e+06
mean     9.113757e-01
std      1.356487e+00
min      0.000000e+00
25%      4.735875e-01
50%      7.745146e-01
75%      1.124976e+00
max      5.715214e+02
Name: watch_ratio, dtype: float64

Notice that watch_ratio can be larger than one. That happens because watch_ratio is defined as (=play_duration/video_duration). So if a user repeatedely sees a video, then this can be larger than 1.

In order to compare with Lin et al 2024, we'll label watch_ratio >= 0.9 as positive examples and watch_ratio < 0.9 as negative

In [12]:
kuairec['label'] = kuairec['watch_ratio'].apply(lambda x: 1 if x > 0.9 else 0)

In [13]:
kuairec.to_csv("datasets/kuairec_filtered.csv")

## 04 - Preprocessing goodreads

In [None]:
goodreads = pd.read_json('path_to_file.json.gz', lines=True, compression='gzip')