# Dataset Construction - KuaiRec2

This notebook aims to model any recommender system datasets to the unified sequential format of pure text string, serving as the input of the user simulator LLM.

A recommender system data usually consists of 3 critical components: user, item and behavior. The user and item are usually represented by their IDs, while the behavior is usually represented by their interactions.

The goal of this notebook is to transform the original data into a unified format, with two types of JSON format:

- Item Feature JSON: This JSON contains a list of all items in the dataset, each element is following the format:
```json
{
    "item_id": "item_id (str), a unique identifier for the item",
    "item_description": "item_description (dict[str, str]), a dictionary of item attributes and their values. All values should be processed to be pure text string that can be understood by human. All features that contain unreadable information (e.g. image, video, audio, url, ID string ,etc.) should be converted (when possible) or removed (when not possible) to be pure text string.",
    "item_features": "item_features (dict[str, Any]), a dictionary of item features (except for those in item_description) and their values. "
This JSON is stored as `<DATASET_NAME>_item_feature.jsonl`.

- User Feature JSON: This JSON contains a list of all items in the dataset, each element is following the format:
```json
{
    "user_id": "user_id (str), a unique identifier for the user",
    "user_description": "user_description (dict[str, str]), a dictionary of user attributes and their values. All values should be processed to be pure text string that can be understood by human. All features that contain unreadable information (e.g. image, video, audio, url, ID string ,etc.) should be converted (when possible) or removed (when not possible) to be pure text string.",
    "user_features": "user_features (dict[str, Any]), a dictionary of user features (except for those in user_description) and their values. "
}
```
This JSON is stored as `<DATASET_NAME>_user_feature.jsonl`.

- Interaction JSON: This JSON contains a list of all user behaviors in the dataset, each element is following the format:
```json
{
    "user_id": "user_id (str), a unique identifier for the user",
    "item_id": "item_id (str), the ID of the item that the user has interacted with",
    "timestamp": "timestamp (str), the timestamp of the interaction in the format of YYYY-MM-DD HH:MM:SS. When the timestamp is not available, it should be set to the random timestamp.",
    "behavior_features": "behavior_features (dict[str, Any]), a dictionary of behavior features and their values. All values should be processed to be pure text string that can be understood by human. All features that contain unreadable information (e.g. image, video, audio, url, ID string ,etc.) should be converted (when possible) or removed (when not possible) to be pure text string."
}
```
This JSON is stored as `<DATASET_NAME>_interaction.jsonl`.

## Download the raw data

You can download the raw data from the following links:
- [KuaiRec2](https://kuairec.com/)

Or via Google Drive:
- [KuaiRec2](https://drive.google.com/file/d/1qe5hOSBxzIuxBb1G_Ih5X-O65QElollE/view?usp=sharing)

After downloading the raw data, you can unzip the file.

In [1]:
# Basic Setting of the Filtered Dataset
DATASET_NAME = "KuaiRec2"
DATASET_PATH = "<SOURCE_PATH>"
OUTPUT_PATH = "<PROJECT_PATH>/raws/"
MIN_INTERACTION_CNT = 5     # The minimum number of interactions for a user to be included in the dataset.
MAX_INTERACTION_CNT = 20  # The maximum number of interactions for a user to be included in the dataset.


In [None]:
# Step 1: Identify the file structure of the dataset

import os
# List all files in the dataset directory
print("Files in dataset directory:")
for file in os.listdir(DATASET_PATH):
    print(f"- {file}")


In [None]:
# Step 1.1: Load README

README_FILE_NAME = "../README.md"
README_FILE_PATH = os.path.join(DATASET_PATH, README_FILE_NAME)

with open(README_FILE_PATH, "r") as file:
    README_CONTENT = file.read()

print(README_CONTENT)


In [None]:
# Step 2: Load the dataset

import pandas as pd


# Load user-item interaction data
interaction_df = pd.read_csv(os.path.join(DATASET_PATH, "big_matrix.csv"))
print("Interaction data loaded  ")

# Load user features
user_features_df = pd.read_csv(os.path.join(DATASET_PATH, "user_features.csv"))
social_features_df = pd.read_csv(os.path.join(DATASET_PATH, "social_network.csv"))

user_features_df = user_features_df.merge(social_features_df, on="user_id", how="left") 
print("User features loaded")

# Load item features and daily features
item_features_df = pd.read_csv(os.path.join(DATASET_PATH, "item_categories.csv"))
item_daily_features_df = pd.read_csv(os.path.join(DATASET_PATH, "item_daily_features.csv"))
print("Item features and daily features loaded")
# Load caption and category information
caption_category_df = pd.read_csv(os.path.join(DATASET_PATH, "kuairec_caption_category.csv"),  lineterminator='\n')
print("Caption and category information loaded")
# Join all item features
item_features_df = item_features_df.merge(item_daily_features_df, on="video_id", how="left")
item_features_df = item_features_df.merge(caption_category_df, on="video_id", how="left")

print("Loaded dataframes shapes:")
print(f"- Interactions: {interaction_df.shape}")
print(f"- User features: {user_features_df.shape}")
print(f"- Item features: {item_features_df.shape}")

print("Data loaded")
# print head
print(interaction_df.head())
print(user_features_df.head())
print(item_features_df.head())


In [None]:
# Step 4: Convert the dataset to the unified format

# Convert interaction data

# Step 4: Convert the dataset to the unified format
from tqdm import tqdm
from datetime import datetime

tqdm.pandas()

def number_handle(number, suffix="") -> str:
    if not(isinstance(number, float) or isinstance(number, int)):
        return "Not Available"
    if pd.isna(number):
        return "Not Available"
    number = round(float(number))
    
    if number >= 1000000:
        return f"{number/1000000:.0f}M" + suffix
    elif number >= 1000:
        return f"{number/1000:.0f}k" + suffix
    else:
        return str(number) + suffix
    
def process_item_df(row):
    return {
        "item_id": row["video_id"],
        "item_description": {
            "video_type": row["video_type"],
            "upload_dt": row["upload_dt"],
            "video_duration": number_handle(row['video_duration'] / 1000, suffix="  sec"),
            "play_count": number_handle(row['play_cnt']),
            "like_count": number_handle(row['like_cnt']),
            "comment_count": number_handle(row['comment_cnt']),
            "share_count": number_handle(row['share_cnt']),
            "cover_text": row["manual_cover_text"] if row["manual_cover_text"] != "UNKNOWN" else "",
            "caption": row["caption"] if row["caption"] else "无标题",
            "topic_tags": row["topic_tag"] if row["topic_tag"] != "[]" else "",
            "category": "/".join([row["first_level_category_name"], 
                                row["second_level_category_name"], 
                                row["third_level_category_name"]]).replace("/UNKNOWN", "")
        },
        "item_features": {
            "video_type": row["video_type"],
            "upload_dt": datetime.strptime(row["upload_dt"], "%Y-%m-%d").timestamp(),
            "video_duration": row['video_duration'] / 1000,
            "play_count": row['play_cnt'],
            "like_count": row['like_cnt'],
            "comment_count": row['comment_cnt'],
            "share_count": row['share_cnt'],
        }
    }

def user_role(row):
    if row["is_live_streamer"] and row["is_video_author"]:
        return "Video Author | Live Streamer"
    elif row["is_live_streamer"]:
        return "Live Streamer"
    elif row["is_video_author"]:
        return "Video Author"
    else:
        return ""

def process_user_df(row):
    return {
        "user_id": row["user_id"],
        "user_description": {
            "active_degree": f"{row['user_active_degree']}".replace("_active", ""),
            "low_active_period": "Yes" if row["is_lowactive_period"] else "No",
            "role": user_role(row),
            "follow_user_num": f"{row['follow_user_num']}",
            "register_days": f"{row['register_days']}",
            "friend_user_num": f"{row['friend_user_num']}",
            "fans_user_num": f"{row['fans_user_num']}",
        },
        "user_features": {
            "follow_user_num": row['follow_user_num'],
            "register_days": row['register_days'],
            "friend_user_num": row['friend_user_num'],
            "fans_user_num": row['fans_user_num'],
            "friend_list": row['friend_list'],
        }   
    }

def watch_operation(row):
    valid_play = (row["play_duration"] >= row["video_duration"] if row["video_duration"] <= 7000 else row["play_duration"] > 7000)
    long_play = (row["play_duration"] >= row["video_duration"] if row["video_duration"] <= 18000 else row["play_duration"] >= 18000)
    short_play = row["play_duration"] < min(3000, row["video_duration"])

    if short_play or row["watch_ratio"] < 1.0:
        op = "Skip"
    else:
        op = "Complete"
    return op + f" (watched {row['play_duration'] / 1000:.0f}s)"


def process_interaction_df(row):
    return {
        "user_id": row["user_id"],
        "item_id": row["video_id"],
        "timestamp": row["timestamp"],
        "behavior_features": {
            "watch_ratio": row["watch_ratio"],
            "operation": watch_operation(row),
            "valid_play": (row["play_duration"] >= row["video_duration"] if row["video_duration"] <= 7000 else row["play_duration"] > 7000),
            "long_play": (row["play_duration"] >= row["video_duration"] if row["video_duration"] <= 18000 else row["play_duration"] >= 18000),
            "short_play": row["play_duration"] < min(3000, row["video_duration"])
        }
    }

user_df = user_features_df.progress_apply(process_user_df, axis=1).to_list()
item_df = item_features_df.progress_apply(process_item_df, axis=1).to_list()

interaction_df = interaction_df.progress_apply(process_interaction_df, axis=1).to_list()


In [None]:
# Step 5: Save the dataset

item_df = pd.DataFrame(item_df)
user_df = pd.DataFrame(user_df)
interaction_df = pd.DataFrame(interaction_df)

item_df = item_df.groupby("item_id").agg("first")

item_df = item_df.reset_index()

print(f"\nFinal shapes:")
print(f"Items: {len(item_df)}")
print(f"Users: {len(user_df)}")
print(f"Interactions: {len(interaction_df)}")


print("\nSaving files...")

# Create output directory if it doesn't exist
os.makedirs(OUTPUT_PATH, exist_ok=True)
interaction_df.to_json(os.path.join(OUTPUT_PATH, f"{DATASET_NAME}_interaction.jsonl"), lines=True, orient="records")
user_df.to_json(os.path.join(OUTPUT_PATH, f"{DATASET_NAME}_user_feature.jsonl"), lines=True, orient="records")
item_df.to_json(os.path.join(OUTPUT_PATH, f"{DATASET_NAME}_item_feature.jsonl"), lines=True, orient="records")

print("\nProcessing complete.")