# Dataset Construction - MobileRec

This notebook aims to model any recommender system datasets to the unified sequential format of pure text string, serving as the input of the user simulator LLM.

A recommender system data usually consists of 3 critical components: user, item and behavior. The user and item are usually represented by their IDs, while the behavior is usually represented by their interactions.

The goal of this notebook is to transform the original data into a unified format, with two types of JSON format:

- Item Feature JSON: This JSON contains a list of all items in the dataset, each element is following the format:
```json
{
    "item_id": "item_id (str), a unique identifier for the item",
    "item_description": "item_description (dict[str, str]), a dictionary of item attributes and their values. All values should be processed to be pure text string that can be understood by human. All features that contain unreadable information (e.g. image, video, audio, url, ID string ,etc.) should be converted (when possible) or removed (when not possible) to be pure text string.",
    "item_features": "item_features (dict[str, Any]), a dictionary of item features (except for those in item_description) and their values. "
This JSON is stored as `<DATASET_NAME>_item_feature.jsonl`.

- User Feature JSON: This JSON contains a list of all items in the dataset, each element is following the format:
```json
{
    "user_id": "user_id (str), a unique identifier for the user",
    "user_description": "user_description (dict[str, str]), a dictionary of user attributes and their values. All values should be processed to be pure text string that can be understood by human. All features that contain unreadable information (e.g. image, video, audio, url, ID string ,etc.) should be converted (when possible) or removed (when not possible) to be pure text string.",
    "user_features": "user_features (dict[str, Any]), a dictionary of user features (except for those in user_description) and their values. "
}
```
This JSON is stored as `<DATASET_NAME>_user_feature.jsonl`.

- Interaction JSON: This JSON contains a list of all user behaviors in the dataset, each element is following the format:
```json
{
    "user_id": "user_id (str), a unique identifier for the user",
    "item_id": "item_id (str), the ID of the item that the user has interacted with",
    "timestamp": "timestamp (str), the timestamp of the interaction in the format of YYYY-MM-DD HH:MM:SS. When the timestamp is not available, it should be set to the random timestamp.",
    "behavior_features": "behavior_features (dict[str, Any]), a dictionary of behavior features and their values. All values should be processed to be pure text string that can be understood by human. All features that contain unreadable information (e.g. image, video, audio, url, ID string ,etc.) should be converted (when possible) or removed (when not possible) to be pure text string."
}
```
This JSON is stored as `<DATASET_NAME>_interaction.jsonl`.


## Download the raw data

You can download the raw data from the following links:
- [MobileRec](https://huggingface.co/datasets/recmeapp/mobilerec)


In [1]:
# Basic Setting of the Filtered Dataset
DATASET_NAME = "mobilerec"
DATASET_PATH = "<SOURCE_PATH>"
OUTPUT_PATH = "<PROJECT_PATH>"
MIN_INTERACTION_CNT = 5     # The minimum number of interactions for a user to be included in the dataset.
MAX_INTERACTION_CNT = 20  # The maximum number of interactions for a user to be included in the dataset.


In [None]:
# Step 1: Identify the file structure of the dataset

import os
# List all files and subfolders in the dataset directory, excluding hidden ones
print("Files and subfolders in dataset directory:")
for root, dirs, files in os.walk(DATASET_PATH):
    # Remove hidden directories
    dirs[:] = [d for d in dirs if not d.startswith('.')]
    
    level = root.replace(DATASET_PATH, '').count(os.sep)
    indent = ' ' * 4 * level
    folder_name = os.path.basename(root)
    if not folder_name.startswith('.'):
        print(f"{indent}{folder_name}/")
        subindent = ' ' * 4 * (level + 1)
        # Only show non-hidden files
        visible_files = [f for f in files if not f.startswith('.')]
        for f in visible_files:
            print(f"{subindent}{f}")

### File Structure
> Fill the file structure here so the LLM assistant can write the code to load the dataset.

```txt
Files and subfolders in dataset directory:
mobilerec/
    README.md
    interactions/
        mobilerec_final.csv
    app_meta/
        app_meta.csv
```

In [None]:
# Step 2: Load the dataset

# Load the dataset based on the file structure and the README (if applicable).

import pandas as pd
import os

mobilerec_df = pd.read_csv(os.path.join(DATASET_PATH, "interactions", "mobilerec_final.csv"), nrows=100000)

print(f'Loaded {len(mobilerec_df)} interactions from MobileRec dataset')

meta_df = pd.read_csv(os.path.join(DATASET_PATH, "app_meta", "app_meta.csv"))

print(f'Loaded {len(meta_df)} items from MobileRec dataset')


# Display column names and sample values for each dataframe
print("\nMobileRec Interactions DataFrame Columns:")
print("----------------------------------------")
for col in mobilerec_df.columns:
    print(f"{col}:")
    print(mobilerec_df[col].head())
    print()

print("\nMeta DataFrame Columns:") 
print("------------------------")
for col in meta_df.columns:
    print(f"{col}:")
    print(meta_df[col].head())
    print()


In [None]:
# Step 4: Convert the dataset to the unified format
# Use tqdm and pandas.progress_apply to process the dataset.
# Based on the data descriptions, convert the dataset to 3 dataframes: item_df, user_df, interaction_df.
# Write 3 row processing functions: process_item_df, process_user_df, process_interaction_df.

from tqdm import tqdm

tqdm.pandas()

def process_item_df(row):
    return {
        "item_id": row["app_package"],
        "item_description": {
            "app_name": row["app_name"],
            "developer_name": row["developer_name"] if row["developer_name"] != "None" else "Unknown",
            "app_category": row["app_category"],
            "description": row["description"],
            "content_rating": row["content_rating"],
            "num_reviews": row["num_reviews"],
            "price": row["price"] if row["price"] != "Install" else "Free",
            "avg_rating": str(row["avg_rating"]),
        },
        "item_features": {
            "num_reviews": int(row["num_reviews"].replace(",", "")),
            "avg_rating": float(row["avg_rating"]),
        }
    }

def process_user_df(row):
    return {
        "user_id": row["uid"],
        "user_description": {

        },
        "user_features": {
            "avg_rating": row["rating"],
        }
    }

def process_interaction_df(row):
    return {
        "user_id": row["uid"],
        "item_id": row["app_package"],
        "timestamp": row["unix_timestamp"],
        "behavior_features": {
            "rating": row["rating"],
            "review": row["review"],
        }
    }

# Apply the processing functions to the dataset
item_df = meta_df.progress_apply(process_item_df, axis=1).to_list()


user_features = mobilerec_df.groupby("uid").agg({
    "rating": "mean",
}).reset_index()
user_df = user_features.progress_apply(process_user_df, axis=1).to_list()


interaction_df = mobilerec_df.progress_apply(process_interaction_df, axis=1).to_list()


In [None]:
# Step 5: Save the dataset


item_df = pd.DataFrame(item_df)
user_df = pd.DataFrame(user_df)
interaction_df = pd.DataFrame(interaction_df)


print(f"\nFinal shapes:")
print(f"Items: {len(item_df)}")
print(f"Users: {len(user_df)}")
print(f"Interactions: {len(interaction_df)}")

# Save the processed data
os.makedirs(OUTPUT_PATH, exist_ok=True)

print("\nSaving files...")

interaction_df.to_json(os.path.join(OUTPUT_PATH, f"{DATASET_NAME}_interaction.jsonl"), lines=True, orient="records")
user_df.to_json(os.path.join(OUTPUT_PATH, f"{DATASET_NAME}_user_feature.jsonl"), lines=True, orient="records")
item_df.to_json(os.path.join(OUTPUT_PATH, f"{DATASET_NAME}_item_feature.jsonl"), lines=True, orient="records")

print("\nProcessing complete.")