# Dataset Construction - Amazon

This notebook aims to model any recommender system datasets to the unified sequential format of pure text string, serving as the input of the user simulator LLM.

A recommender system data usually consists of 3 critical components: user, item and behavior. The user and item are usually represented by their IDs, while the behavior is usually represented by their interactions.

The goal of this notebook is to transform the original data into a unified format, with two types of JSON format:

- Item Feature JSON: This JSON contains a list of all items in the dataset, each element is following the format:
```json
{
    "item_id": "item_id (str), a unique identifier for the item",
    "item_description": "item_description (dict[str, str]), a dictionary of item attributes and their values. All values should be processed to be pure text string that can be understood by human. All features that contain unreadable information (e.g. image, video, audio, url, ID string ,etc.) should be converted (when possible) or removed (when not possible) to be pure text string.",
    "item_features": "item_features (dict[str, Any]), a dictionary of item features (except for those in item_description) and their values. "
This JSON is stored as `<DATASET_NAME>_item_feature.jsonl`.

- User Feature JSON: This JSON contains a list of all items in the dataset, each element is following the format:
```json
{
    "user_id": "user_id (str), a unique identifier for the user",
    "user_description": "user_description (dict[str, str]), a dictionary of user attributes and their values. All values should be processed to be pure text string that can be understood by human. All features that contain unreadable information (e.g. image, video, audio, url, ID string ,etc.) should be converted (when possible) or removed (when not possible) to be pure text string.",
    "user_features": "user_features (dict[str, Any]), a dictionary of user features (except for those in user_description) and their values. "
}
```
This JSON is stored as `<DATASET_NAME>_user_feature.jsonl`.

- Interaction JSON: This JSON contains a list of all user behaviors in the dataset, each element is following the format:
```json
{
    "user_id": "user_id (str), a unique identifier for the user",
    "item_id": "item_id (str), the ID of the item that the user has interacted with",
    "timestamp": "timestamp (str), the timestamp of the interaction in the format of YYYY-MM-DD HH:MM:SS. When the timestamp is not available, it should be set to the random timestamp.",
    "behavior_features": "behavior_features (dict[str, Any]), a dictionary of behavior features and their values. All values should be processed to be pure text string that can be understood by human. All features that contain unreadable information (e.g. image, video, audio, url, ID string ,etc.) should be converted (when possible) or removed (when not possible) to be pure text string."
}
```
This JSON is stored as `<DATASET_NAME>_interaction.jsonl`.


## Download the raw data

You can download the raw data from the following links:
- [Amazon](https://amazon-reviews-2023.github.io/)

You should download the following files:
- `meta_<SUBCATEGORY>.jsonl.gz`
- `<SUBCATEGORY>.jsonl.gz`


In [1]:
# Basic Setting of the Filtered Dataset
DATASET_NAME = "Amazon-Fashion"
SUBCATEGORY = "Amazon_Fashion"

DATASET_PATH = "<SOURCE_PATH>"
OUTPUT_PATH = "<PROJECT_PATH>/raws/"
MIN_INTERACTION_CNT = 5     # The minimum number of interactions for a user to be included in the dataset.
MAX_INTERACTION_CNT = 20  # The maximum number of interactions for a user to be included in the dataset.


In [2]:
# Step 1: Identify the file structure of the dataset

import os
# List all files and subfolders in the dataset directory, excluding hidden ones
print("Files and subfolders in dataset directory:")
for root, dirs, files in os.walk(DATASET_PATH):
    # Remove hidden directories
    dirs[:] = [d for d in dirs if not d.startswith('.')]
    
    level = root.replace(DATASET_PATH, '').count(os.sep)
    indent = ' ' * 4 * level
    folder_name = os.path.basename(root)
    if not folder_name.startswith('.'):
        print(f"{indent}{folder_name}/")
        subindent = ' ' * 4 * (level + 1)
        # Only show non-hidden files
        visible_files = [f for f in files if not f.startswith('.')]
        for f in visible_files:
            print(f"{subindent}{f}")

Files and subfolders in dataset directory:
Amazon/
    Industrial_and_Scientific.jsonl.gz
    Baby_Products.jsonl.gz
    meta_All_Beauty.jsonl.gz
    Musical_Instruments.jsonl.gz
    All_Beauty.json.gz
    Appliances.jsonl.gz
    meta_Books.jsonl.gz
    meta_Office_Products.json
    reviews_Office_Products.json
    meta_Beauty_and_Personal_Care.jsonl.gz
    meta_Health_and_Personal_Care.jsonl.gz
    Video_Games.jsonl.gz
    reviews_Video_Games.json
    meta_Digital_Music.jsonl.gz
    Beauty_and_Personal_Care.jsonl.gz
    Software.jsonl.gz
    meta_Musical_Instruments.jsonl.gz
    meta_Baby_Products.jsonl.gz
    meta_Amazon_Fashion.jsonl.gz
    All_Beauty.jsonl.gz
    Office_Products.jsonl.gz
    CDs_and_Vinyl.jsonl.gz
    Books.jsonl.gz
    meta_Musical_Instruments.json
    meta_Grocery_and_Gourmet_Food.jsonl.gz
    reviews_Musical_Instruments.json
    meta_Industrial_and_Scientific.jsonl.gz
    Health_and_Personal_Care.jsonl.gz
    meta_All_Beauty.json.gz
    Grocery_and_Gourmet_Food.

**File Structure**
> Fill the file structure here so the LLM assistant can write the code to load the dataset.

```txt

```

## Step 1: Load Dataset

**Readme Content**
> Fill the readme content here so the LLM assistant can write the code to load the dataset.

```txt

```


## Step 2: Load the dataset

In [3]:
# Step 2: Load the dataset

import os
import pandas as pd


# Load the item features
item_features_df = pd.read_json(DATASET_PATH + f'/meta_{SUBCATEGORY}.jsonl.gz', lines=True, compression='gzip')
item_features_df = item_features_df.drop_duplicates(subset='parent_asin')
print("Item features and daily features loaded")

# Load the interactions
interaction_df = pd.read_json(DATASET_PATH + f'/{SUBCATEGORY}.jsonl.gz', lines=True, compression='gzip')
print("Interactions loaded")

# Generate user features
user_df = interaction_df[['user_id']].drop_duplicates(subset='user_id')
print("User features loaded")

# Display column names and sample values for each dataframe
print("Loaded dataframes shapes:")
print(f"- Interactions: {interaction_df.shape}")
print(f"- User features: {user_df.shape}")
print(f"- Item features: {item_features_df.shape}")

print("Data loaded")
# print head
print("Interaction dataframe head:")
print(interaction_df.head())
print("User features dataframe head:")
print(user_df.head())
print("Item features dataframe head:")
print(item_features_df.head())


Item features and daily features loaded
Interactions loaded
User features loaded
Loaded dataframes shapes:
- Interactions: (2500939, 10)
- User features: (2035490, 1)
- Item features: (826108, 14)
Data loaded
Interaction dataframe head:
   rating                 title  \
0       5         Pretty locket   
1       5                     A   
2       2             Two Stars   
3       1       Won’t buy again   
4       5  I LOVE these glasses   

                                                text images        asin  \
0  I think this locket is really pretty. The insi...     []  B00LOPVX74   
1                                              Great     []  B07B4JXK8D   
2  One of the stones fell out within the first 2 ...     []  B007ZSEQ4Q   
3  Crappy socks. Money wasted. Bought to wear wit...     []  B07F2BTFS9   
4  I LOVE these glasses!  They fit perfectly over...     []  B00PKRFU4O   

  parent_asin                       user_id               timestamp  \
0  B00LOPVX74  AGBFYI2DDIKXC5Y

**Data Descriptions**
> Fill the dataframe descriptions loaded from the dataset here so the LLM assistant can write the code to load the dataset.

```txt

```


## Step 3: Filter the dataset

In [4]:
# Beauty, Fashion
MIN_INTERACTION_CNT = 5
MAX_INTERACTION_CNT = 20
MIN_RATING_NUMBER = 10
MIN_DETAILS_LENGTH = 1


In [5]:
# Downsampling

interaction_df = interaction_df.sort_values(by="timestamp").reset_index(drop=True)
interaction_df = interaction_df[int(len(interaction_df) * 0.7):]

In [6]:
# Step 3: Filter the dataset
# Filter the dataset with the minimum and maximum number of interactions


user_count = interaction_df.groupby('user_id').size()
user_count = user_count[user_count >= MIN_INTERACTION_CNT]

item_count = interaction_df.groupby('parent_asin').size()
item_count = item_count[item_count >= MIN_INTERACTION_CNT]

item_features_df = item_features_df[item_features_df['rating_number'] >= MIN_RATING_NUMBER]
item_features_df = item_features_df[item_features_df["details"].apply(len) > MIN_DETAILS_LENGTH]
item_features_df = item_features_df[item_features_df['parent_asin'].isin(item_count.index)]
interaction_df = interaction_df[interaction_df['parent_asin'].isin(item_features_df['parent_asin'])]

interaction_df = interaction_df[interaction_df['user_id'].isin(user_count.index)]
interaction_df = interaction_df[interaction_df['parent_asin'].isin(item_features_df['parent_asin']) & interaction_df['user_id'].isin(user_df['user_id'])]

print(f"- Interactions: {interaction_df.shape}")
# If there is constraints in settings, apply the constraints to the dataset to filter the dataset.
# Otherwise, skip this step.

#

- Interactions: (5025, 10)


In [7]:
inter_asin = interaction_df["parent_asin"].unique()
inter_interaction_df = item_features_df[item_features_df['parent_asin'].isin(inter_asin)]

outer_interaction_df = item_features_df[~item_features_df['parent_asin'].isin(inter_asin)]

if len(outer_interaction_df) > 65536:
    outer_interaction_df = outer_interaction_df.sample(65536, random_state=0)

concat_item_features_df = pd.concat([
    inter_interaction_df,
    outer_interaction_df
]).drop_duplicates(subset='parent_asin')

item_features_df = concat_item_features_df

print(item_features_df.shape)

(18578, 14)


## Step 4: Convert the dataset to the unified format

In [8]:
# Step 4: Convert the dataset to the unified format

# Use tqdm and pandas.progress_apply to process the dataset.
# Based on the data descriptions, convert the dataset to 3 dataframes: item_df, user_df, interaction_df.
# Write 3 row processing functions: process_item_df, process_user_df, process_interaction_df.

from tqdm import tqdm
from datetime import datetime

def convert_list_to_str(description_list):
    # Convert the list to a string with each element separated by a space
    if len(description_list) == 0:
        return ""
    return ', '.join(description_list).strip() + "\n"

def convert_dict_to_str(description_dict):
    if len(description_dict) == 0:
        return ""
    return ';'.join([f"{k}: {v}" for k, v in description_dict.items()]).strip() + "\n"

def process_item_df(row):
    return {
        "item_id": row['parent_asin'],
        "item_description": {
            "title": row['title'],
            "price": row['price'] if row['price'] else "NA",
            "store": row['store'],
            "average_rating": row['average_rating'],
            "rating_number": row['rating_number'],
            "category": "-".join(row['categories']) if len(row['categories']) > 0 else str(row['main_category']),
            "description": convert_dict_to_str(row['details']),
            "features": convert_list_to_str(row['features'])
        },
        "item_features": {

        }
    }

def process_user_df(row):
    return {
        "user_id": row['user_id'],
        "user_description": {
        },
        "user_features": {
        }
    }

def process_interaction_df(row):
    return {
        "user_id": row['user_id'],
        "item_id": row['parent_asin'],
        "timestamp": datetime.strftime(row['timestamp'], '%Y-%m-%d %H:%M:%S.%f'),
        "behavior_features": {
            "rating": row['rating'],
            "review_text": row['text'],
            "review_summary": row['title']
        }
    }
tqdm.pandas(desc="Processing items")
item_records = item_features_df.progress_apply(process_item_df, axis=1).tolist()

# Process users 
tqdm.pandas(desc="Processing users")
user_df = user_df[user_df['user_id'].isin(interaction_df['user_id'].unique())]
user_records = user_df.progress_apply(process_user_df, axis=1).tolist()

# Process interactions
tqdm.pandas(desc="Processing interactions")
interaction_records = interaction_df.progress_apply(process_interaction_df, axis=1).dropna().tolist()

# Apply the processing functions to the dataset




Processing items: 100%|██████████| 18578/18578 [00:00<00:00, 51257.45it/s]
Processing users: 100%|██████████| 1797/1797 [00:00<00:00, 213880.94it/s]
Processing interactions: 100%|██████████| 5025/5025 [00:00<00:00, 71163.11it/s]


## Step 5: Save the dataset

In [9]:
# Step 5: Save the dataset
import os
# Convert records to pandas DataFrames
interaction_df = pd.DataFrame.from_records(interaction_records)
user_df = pd.DataFrame.from_records(user_records) 
item_df = pd.DataFrame.from_records(item_records)

print("Interaction dataframe head:")
print(interaction_df.head())
print("User features dataframe head:")
print(user_df.head())
print("Item features dataframe head:")
print(item_df.head())

Interaction dataframe head:
                        user_id     item_id                   timestamp  \
0  AHX325QNRR25RGJG44MKHYCMTA7Q  B083CT6ZZ2  2020-03-04 15:40:08.116000   
1  AEPEDW5FBBJ2XYR2BIJAKUPHCMHA  B083CZSGM3  2020-03-04 20:08:27.604000   
2  AH3BXW7KLIS2VAE56UXJS2NS7I5A  B083CT6ZZ2  2020-03-04 22:35:23.110000   
3  AHV6QCNBJNSGLATP56JAWJ3C4G2A  B07W8FK4CX  2020-03-05 03:53:51.592000   
4  AGK7OPKYWYPCFR52AZHDCBWPVCPQ  B083CT6ZZ2  2020-03-05 14:02:33.331000   

                                   behavior_features  
0  {'rating': 3, 'review_text': 'This is a really...  
1  {'rating': 5, 'review_text': 'This is a very n...  
2  {'rating': 4, 'review_text': 'I didn't have hi...  
3  {'rating': 4, 'review_text': 'My daughter part...  
4  {'rating': 3, 'review_text': 'First, let me st...  
User features dataframe head:
                        user_id user_description user_features
0  AFZUK3MTBIBEDQOPAK3OATUOUKLA               {}            {}
1  AFFZVSTUS3U2ZD22A2NPZSKOCPGQ    

In [10]:
# Create output directory if it doesn't exist
os.makedirs(OUTPUT_PATH, exist_ok=True)
interaction_df.to_json(os.path.join(OUTPUT_PATH, f"{DATASET_NAME}_interaction.jsonl"), lines=True, orient="records")
user_df.to_json(os.path.join(OUTPUT_PATH, f"{DATASET_NAME}_user_feature.jsonl"), lines=True, orient="records")
item_df.to_json(os.path.join(OUTPUT_PATH, f"{DATASET_NAME}_item_feature.jsonl"), lines=True, orient="records")