# Dataset Construction - LastFM

This notebook aims to model any recommender system datasets to the unified sequential format of pure text string, serving as the input of the user simulator LLM.

A recommender system data usually consists of 3 critical components: user, item and behavior. The user and item are usually represented by their IDs, while the behavior is usually represented by their interactions.

The goal of this notebook is to transform the original data into a unified format, with two types of JSON format:

- Item Feature JSON: This JSON contains a list of all items in the dataset, each element is following the format:
```json
{
    "item_id": "item_id (str), a unique identifier for the item",
    "item_description": "item_description (dict[str, str]), a dictionary of item attributes and their values. All values should be processed to be pure text string that can be understood by human. All features that contain unreadable information (e.g. image, video, audio, url, ID string ,etc.) should be converted (when possible) or removed (when not possible) to be pure text string.",
    "item_features": "item_features (dict[str, Any]), a dictionary of item features (except for those in item_description) and their values. "
This JSON is stored as `<DATASET_NAME>_item_feature.jsonl`.

- User Feature JSON: This JSON contains a list of all items in the dataset, each element is following the format:
```json
{
    "user_id": "user_id (str), a unique identifier for the user",
    "user_description": "user_description (dict[str, str]), a dictionary of user attributes and their values. All values should be processed to be pure text string that can be understood by human. All features that contain unreadable information (e.g. image, video, audio, url, ID string ,etc.) should be converted (when possible) or removed (when not possible) to be pure text string.",
    "user_features": "user_features (dict[str, Any]), a dictionary of user features (except for those in user_description) and their values. "
}
```
This JSON is stored as `<DATASET_NAME>_user_feature.jsonl`.

- Interaction JSON: This JSON contains a list of all user behaviors in the dataset, each element is following the format:
```json
{
    "user_id": "user_id (str), a unique identifier for the user",
    "item_id": "item_id (str), the ID of the item that the user has interacted with",
    "timestamp": "timestamp (str), the timestamp of the interaction in the format of YYYY-MM-DD HH:MM:SS. When the timestamp is not available, it should be set to the random timestamp.",
    "behavior_features": "behavior_features (dict[str, Any]), a dictionary of behavior features and their values. All values should be processed to be pure text string that can be understood by human. All features that contain unreadable information (e.g. image, video, audio, url, ID string ,etc.) should be converted (when possible) or removed (when not possible) to be pure text string."
}
```
This JSON is stored as `<DATASET_NAME>_interaction.jsonl`.

## Download the raw data

You can download the raw data from the following links:
- [LastFM](https://files.grouplens.org/datasets/hetrec2011/hetrec2011-lastfm-2k.zip)

After downloading the raw data, you can unzip the file.

In [1]:
# Base Setting
DATASET_NAME = "LastFM"
DATASET_PATH = "<SOURCE_PATH>"
OUTPUT_PATH = "<PROJECT_PATH>/raws/"
MIN_INTERACTION_CNT = 5 
MAX_INTERACTION_CNT = 20  

In [None]:
# Step 2: Load the dataset
# This step loads the dataset as a pandas dataframe and displays the shape and head of the dataframe
# Since the structure of each dataset is different, the loading needs to be done according to the actual situation of the dataset
import os
import pandas as pd

# Assume the directory path where the dataset is located

# Define the paths of each file
artists_file = os.path.join(DATASET_PATH, "artists.dat")
tags_file = os.path.join(DATASET_PATH, "tags.dat")
user_artists_file = os.path.join(DATASET_PATH, "user_artists.dat")
user_friends_file = os.path.join(DATASET_PATH, "user_friends.dat")
user_taggedartists_ts_file = os.path.join(DATASET_PATH, "user_taggedartists-timestamps.dat")
user_taggedartists_file = os.path.join(DATASET_PATH, "user_taggedartists.dat")

# ---------------------------
# Load the data (specify the correct encoding)
# ---------------------------
# Load the artists.dat file, keeping only the id and name columns, used for item features and subsequent artist_name lookup
artists_df = pd.read_csv(
    artists_file,
    sep="\t",
    usecols=["id", "name"],
    encoding="utf-8"      # Set to UTF-8 according to the detection result
)

# Load the tags.dat file, used to map tagID to tagValue
tags_df = pd.read_csv(
    tags_file,
    sep="\t",
    encoding="ISO-8859-1"  # Set to ISO-8859-1 according to the detection result
)

# Load the user_artists.dat file, containing userID, artistID, weight
user_artists_df = pd.read_csv(
    user_artists_file,
    sep="\t",
    encoding="ascii"      # Set to ASCII according to the detection result
)

# Load the user_friends.dat file (no additional processing is done this time, but still loaded to display the data)
user_friends_df = pd.read_csv(
    user_friends_file,
    sep="\t",
    encoding="ascii"
)

# Load the user_taggedartists-timestamps.dat file, containing userID, artistID, tagID, timestamp
user_taggedartists_ts_df = pd.read_csv(
    user_taggedartists_ts_file,
    sep="\t",
    encoding="ascii"
)

# Load the user_taggedartists.dat file, containing userID, artistID, tagID, day, month, year
user_taggedartists_df = pd.read_csv(
    user_taggedartists_file,
    sep="\t",
    encoding="ascii"
)

# ---------------------------
# Display the shape after the data is loaded
# ---------------------------
print("Data loading completed:")
print(f"- artists_df (item features): {artists_df.shape}")
print(f"- tags_df: {tags_df.shape}")
print(f"- user_artists_df (user features): {user_artists_df.shape}")
print(f"- user_friends_df: {user_friends_df.shape}")
print(f"- user_taggedartists_ts_df (interaction data part): {user_taggedartists_ts_df.shape}")
print(f"- user_taggedartists_df (interaction data part): {user_taggedartists_df.shape}")


In [None]:
from collections import Counter
# ---------------------------
# Construct item_features_df
# ---------------------------
# Item features only use id and name from artists.dat
item_features_df = artists_df.copy()
item_features_df.rename(columns={"id": "item_id", "name": "artist_name"}, inplace=True)

# ---------------------------
# Construct user_features_df
# ---------------------------
# User features use userID, artistID, weight from user_artists.dat, and add artist_name based on artistID
# First, establish a mapping dictionary from artistID -> artist_name
artist_name_map = artists_df.set_index("id")["name"].to_dict()

# Extract unique userID from user_artists_df to build user_features_df
user_features_df = pd.DataFrame({'user_id': user_artists_df['userID'].unique()})

# ---------------------------
# Aggregate user_friends.dat data
# ---------------------------
# Write all friendID corresponding to each userID in user_friends.dat together
aggregated_friends = user_friends_df.groupby("userID")["friendID"].apply(list).reset_index()
aggregated_friends.rename(columns={"userID": "user_id", "friendID": "friendID"}, inplace=True)

# Merge the aggregated friend information into user_features_df (using user_id as the key)
user_features_df = pd.merge(user_features_df, aggregated_friends, on="user_id", how="left")

# ---------------------------
# Construct interaction_df
# ---------------------------
# Merge user_taggedartists-timestamps.dat with user_taggedartists.dat
interaction_df = pd.merge(
    user_taggedartists_ts_df,
    user_taggedartists_df[["userID", "artistID", "tagID", "day", "month", "year"]],
    on=["userID", "artistID", "tagID"],
    how="inner"
)

# Use tags_df to build a tagID -> tagValue mapping
tag_value_map = tags_df.set_index("tagID")["tagValue"].to_dict()
interaction_df["tagValue"] = interaction_df["tagID"].map(lambda x: tag_value_map.get(x, ""))

# Rename columns to conform to a unified format
interaction_df.rename(columns={"userID": "user_id", "artistID": "item_id"}, inplace=True)

item_features_df["tags"] = interaction_df.groupby("item_id")["tagValue"].apply(list).apply(lambda x: Counter(x).most_common(5))

# Aggregate data based on user_id and item_id
interaction_df = interaction_df.groupby(['user_id', 'item_id']).agg({
    'timestamp': 'first',  # Keep the first timestamp
    'day': 'first',  # Aggregate different values into a list
    'month': 'first',
    'year': 'first',
    'tagID': lambda x: list(set(x)),
    'tagValue': lambda x: list(set(x))
}).reset_index()


# ---------------------------
# Display the shape and header data of the final dataframes
# ---------------------------
print("\nShape of the final dataframes:")
print(f"item_features_df: {item_features_df.shape}")
print(f"user_features_df: {user_features_df.shape}")
print(f"interaction_df: {interaction_df.shape}")

print("\nitem_features_df header data:")
display(item_features_df.head())

print("\nuser_features_df header data:")
display(user_features_df.head())

print("\ninteraction_df header data:")
display(interaction_df.head())

In [None]:
# Step 3: Convert datasets to a unified format

# Convert interaction data

from tqdm import tqdm
from datetime import datetime

tqdm.pandas()

def process_item_df(row):
    """
    Process each row in the item features dataframe
    """
    return {
        "item_id": row["item_id"],
        "item_description": {
            "name": row["artist_name"],
            "tags": "/".join([tag for tag, _ in row["tags"]]) if isinstance(row["tags"], list) else ""
        },
        "item_features": {
            "tags": row["tags"]
        }
    }

def process_user_df(row):
    """
    Process each row in the user features dataframe, leaving empty if information is missing in the dataset
    """ 
    return {
        "user_id": row["user_id"],
        "user_description": {
        },
        "user_features": {
            "friends": row["friendID"]
        }
    }

def process_interaction_df(row):
    """
    Process the interaction dataframe
    """ 
    return {
        "user_id": row["user_id"],
        "item_id": row["item_id"],
        "timestamp": row["timestamp"],
        "behavior_features": {
            "tagID": row["tagID"],
            "tagValue": row["tagValue"]
        }
    }


# Apply the above functions to each row of the respective dataframes and convert to List
item_df = item_features_df.progress_apply(process_item_df, axis=1).to_list()
user_df = user_features_df.progress_apply(process_user_df, axis=1).to_list()
interaction_df = interaction_df.progress_apply(process_interaction_df, axis=1).to_list()


In [None]:
# Step 4: Save the data

# Convert the processed data into pandas dataframes
item_df = pd.DataFrame(item_df)
user_df = pd.DataFrame(user_df)
interaction_df = pd.DataFrame(interaction_df)

print(f"\nFinal shape of the dataframes:")
print(f"Items: {len(item_df)}")
print(f"Users: {len(user_df)}")
print(f"Interactions: {len(interaction_df)}")


In [6]:
interaction_df['timestamp'] = pd.to_datetime(interaction_df['timestamp'], unit='ms').dt.strftime('%Y-%m-%d %H:%M:%S')


In [None]:
interaction_df = interaction_df[
    interaction_df['user_id'].isin(user_df['user_id']) & 
    interaction_df['item_id'].isin(item_df['item_id'])
]
interaction_df

In [None]:
item_df

In [None]:
user_df

In [None]:
# Save the processed data
os.makedirs(OUTPUT_PATH, exist_ok=True)

print("\nSaving files...")

interaction_df.to_json(os.path.join(OUTPUT_PATH, f"{DATASET_NAME}_interaction.jsonl"), lines=True, orient="records")
user_df.to_json(os.path.join(OUTPUT_PATH, f"{DATASET_NAME}_user_feature.jsonl"), lines=True, orient="records")
item_df.to_json(os.path.join(OUTPUT_PATH, f"{DATASET_NAME}_item_feature.jsonl"), lines=True, orient="records")

print("\nProcessing complete.")