# Dataset Construction - Steam

This notebook aims to model any recommender system datasets to the unified sequential format of pure text string, serving as the input of the user simulator LLM.

A recommender system data usually consists of 3 critical components: user, item and behavior. The user and item are usually represented by their IDs, while the behavior is usually represented by their interactions.

The goal of this notebook is to transform the original data into a unified format, with two types of JSON format:

- Item Feature JSON: This JSON contains a list of all items in the dataset, each element is following the format:
```json
{
    "item_id": "item_id (str), a unique identifier for the item",
    "item_description": "item_description (dict[str, str]), a dictionary of item attributes and their values. All values should be processed to be pure text string that can be understood by human. All features that contain unreadable information (e.g. image, video, audio, url, ID string ,etc.) should be converted (when possible) or removed (when not possible) to be pure text string.",
    "item_features": "item_features (dict[str, Any]), a dictionary of item features (except for those in item_description) and their values. "
This JSON is stored as `<DATASET_NAME>_item_feature.jsonl`.

- User Feature JSON: This JSON contains a list of all items in the dataset, each element is following the format:
```json
{
    "user_id": "user_id (str), a unique identifier for the user",
    "user_description": "user_description (dict[str, str]), a dictionary of user attributes and their values. All values should be processed to be pure text string that can be understood by human. All features that contain unreadable information (e.g. image, video, audio, url, ID string ,etc.) should be converted (when possible) or removed (when not possible) to be pure text string.",
    "user_features": "user_features (dict[str, Any]), a dictionary of user features (except for those in user_description) and their values. "
}
```
This JSON is stored as `<DATASET_NAME>_user_feature.jsonl`.

- Interaction JSON: This JSON contains a list of all user behaviors in the dataset, each element is following the format:
```json
{
    "user_id": "user_id (str), a unique identifier for the user",
    "item_id": "item_id (str), the ID of the item that the user has interacted with",
    "timestamp": "timestamp (str), the timestamp of the interaction in the format of YYYY-MM-DD HH:MM:SS. When the timestamp is not available, it should be set to the random timestamp.",
    "behavior_features": "behavior_features (dict[str, Any]), a dictionary of behavior features and their values. All values should be processed to be pure text string that can be understood by human. All features that contain unreadable information (e.g. image, video, audio, url, ID string ,etc.) should be converted (when possible) or removed (when not possible) to be pure text string."
}
```
This JSON is stored as `<DATASET_NAME>_interaction.jsonl`.


## Download the raw data

You can download the raw data from the repos shared by authors of a previous work:
- [Steam](https://github.com/kang205/SASRec)

You should download the following files:
- `steam_games.jsonl.gz`
- `steam_reviews.jsonl.gz`

After downloading the raw data, you can unzip the file.

In [1]:
# Basic Setting of the Filtered Dataset
DATASET_NAME = "steam"
DATASET_PATH = "<SOURCE_PATH>"
OUTPUT_PATH = "<PROJECT_PATH>/raws/"
MIN_INTERACTION_CNT = 5     # The minimum number of interactions for a user to be included in the dataset.
MAX_INTERACTION_CNT = 20  # The maximum number of interactions for a user to be included in the dataset.


In [None]:
# Step 1: Identify the file structure of the dataset

import os
# List all files in the dataset directory
print("Files in dataset directory:")
for file in os.listdir(DATASET_PATH):
    print(f"- {file}")


In [None]:
# Step 2: Load the dataset

# We convert the informal JSON format to the formal JSON format.

# import pandas as pd

# def load_as_python(file_path):
#     data = []
#     with open(file_path, 'r') as file:
#         for line in file:
#             data.append(eval(line))
#     return data

# data = pd.DataFrame(load_as_python(os.path.join(DATASET_PATH, 'steam_reviews.json')))
# data.to_json(os.path.join(DATASET_PATH, f"steam_reviews.jsonl"), lines=True, orient="records")

import pandas as pd
games_df = pd.read_json(os.path.join(DATASET_PATH, 'steam_games.jsonl'), lines=True).dropna(subset=['id'])
reviews_df = pd.read_json(os.path.join(DATASET_PATH, 'steam_reviews.jsonl'), lines=True)


print(games_df.head())

print(reviews_df.head())

In [4]:
reviews_df =reviews_df.sort_values(by='date', ascending=True)
reviews_df = reviews_df.drop_duplicates(subset=['username', 'product_id'])

In [None]:
# Step 3: Filter the dataset
# Filter by MIN_INTERACTION_CNT and MAX_INTERACTION_CNT

# Filter the reviews DataFrame based on the number of interactions
user_index = reviews_df.groupby('username').size().reset_index(name='interaction_count')
reviews_df = reviews_df[reviews_df['username'].isin(user_index[user_index['interaction_count'] >= MIN_INTERACTION_CNT]['username'])]
reviews_df = reviews_df[reviews_df['username'].isin(user_index[user_index['interaction_count'] <= MAX_INTERACTION_CNT]['username'])]

# Display the shape of the filtered DataFrame
print(f"Filtered reviews DataFrame shape: {reviews_df.shape}")



In [None]:
reviews_df

In [None]:
# Step 4: Convert to item_df and interaction_df with pandas.progress_apply

from tqdm import tqdm

tqdm.pandas()

def process_game_to_item(row):
    return {
        "item_id": str(int(row["id"])),
        "item_description": {
            "app_name": str(row["app_name"]) if pd.notna(row["app_name"]) else "",
            "genres": str("/".join(row["genres"])) if row["genres"] else "",
            "tags": str("/".join(row["tags"])) if row["tags"] else "",
            "developer": str(row["developer"]) if pd.notna(row["developer"]) else "Unknown",
            "publisher": str(row["publisher"]) if pd.notna(row["publisher"]) else "Unknown",
            "specs": str("/".join(row["specs"])) if row["specs"] else "NA",  # system requirements
            "sentiment": str(row["sentiment"]) if pd.notna(row["sentiment"]) else "NA",
            "price": "$" + str(row["price"]) if pd.notna(row["price"]) else "NA",
            "release_date": str(row["release_date"]) if pd.notna(row["release_date"]) else "",
        },
        "item_features": {
            "release_date": row["release_date"] if pd.notna(row["release_date"]) else "",
            "early_access": bool(row["early_access"]) if pd.notna(row["early_access"]) else False,
        }
    }

# Convert games to item features
item_features = games_df.progress_apply(process_game_to_item, axis=1).to_list()

def process_review_to_interaction(row):
    return {
        "user_id": str(row["username"]),
        "item_id": str(row["product_id"]),
        "timestamp": str(row["date"]),  # Assuming 'date' is already in YYYY-MM-DD HH:MM:SS format
        "behavior_features": {
            "hours_played": float(row["hours"]) if pd.notna(row["hours"]) else 0.0,
            "review_text": str(row["text"]) if pd.notna(row["text"]) else "",
        }
    }

# Convert reviews to interactions
interactions = reviews_df.progress_apply(process_review_to_interaction, axis=1).to_list()

# Create user features
unique_users = reviews_df.groupby('username')[['products']].first().reset_index()

def process_user_to_user(row):
    return {
        "user_id": str(row["username"]),
        "user_description": {
            "products": str(int(row["products"])) if pd.notna(row["products"]) else "",
        },
        "user_features": {
            "products": int(row["products"]) if pd.notna(row["products"]) else 0,
        }
    }

user_features = unique_users.progress_apply(process_user_to_user, axis=1).to_list()




In [None]:
# Step 5: Save the processed data

item_df = pd.DataFrame(item_features).drop_duplicates(subset=['item_id'], keep='first')
user_df = pd.DataFrame(user_features).drop_duplicates(subset=['user_id'], keep='first')
interaction_df = pd.DataFrame(interactions)


print(f"\nFinal shapes:")
print(f"Items: {len(item_df)}")
print(f"Users: {len(user_df)}")
print(f"Interactions: {len(interaction_df)}")

# Save the processed data
os.makedirs(OUTPUT_PATH, exist_ok=True)

print("\nSaving files...")
item_df.to_json(os.path.join(OUTPUT_PATH, "raws", f"{DATASET_NAME}_item_feature.jsonl"), orient="records", lines=True)
user_df.to_json(os.path.join(OUTPUT_PATH, "raws", f"{DATASET_NAME}_user_feature.jsonl"), orient="records", lines=True)
interaction_df.to_json(os.path.join(OUTPUT_PATH, "raws", f"{DATASET_NAME}_interaction.jsonl"), orient="records", lines=True)

print("Done!")