# Dataset Construction - Goodreads

This notebook aims to model any recommender system datasets to the unified sequential format of pure text string, serving as the input of the user simulator LLM.

A recommender system data usually consists of 3 critical components: user, item and behavior. The user and item are usually represented by their IDs, while the behavior is usually represented by their interactions.

The goal of this notebook is to transform the original data into a unified format, with two types of JSON format:

- Item Feature JSON: This JSON contains a list of all items in the dataset, each element is following the format:
```json
{
    "item_id": "item_id (str), a unique identifier for the item",
    "item_description": "item_description (dict[str, str]), a dictionary of item attributes and their values. All values should be processed to be pure text string that can be understood by human. All features that contain unreadable information (e.g. image, video, audio, url, ID string ,etc.) should be converted (when possible) or removed (when not possible) to be pure text string.",
    "item_features": "item_features (dict[str, Any]), a dictionary of item features (except for those in item_description) and their values. "
This JSON is stored as `<DATASET_NAME>_item_feature.jsonl`.

- User Feature JSON: This JSON contains a list of all items in the dataset, each element is following the format:
```json
{
    "user_id": "user_id (str), a unique identifier for the user",
    "user_description": "user_description (dict[str, str]), a dictionary of user attributes and their values. All values should be processed to be pure text string that can be understood by human. All features that contain unreadable information (e.g. image, video, audio, url, ID string ,etc.) should be converted (when possible) or removed (when not possible) to be pure text string.",
    "user_features": "user_features (dict[str, Any]), a dictionary of user features (except for those in user_description) and their values. "
}
```
This JSON is stored as `<DATASET_NAME>_user_feature.jsonl`.

- Interaction JSON: This JSON contains a list of all user behaviors in the dataset, each element is following the format:
```json
{
    "user_id": "user_id (str), a unique identifier for the user",
    "item_id": "item_id (str), the ID of the item that the user has interacted with",
    "timestamp": "timestamp (str), the timestamp of the interaction in the format of YYYY-MM-DD HH:MM:SS. When the timestamp is not available, it should be set to the random timestamp.",
    "behavior_features": "behavior_features (dict[str, Any]), a dictionary of behavior features and their values. All values should be processed to be pure text string that can be understood by human. All features that contain unreadable information (e.g. image, video, audio, url, ID string ,etc.) should be converted (when possible) or removed (when not possible) to be pure text string."
}
```
This JSON is stored as `<DATASET_NAME>_interaction.jsonl`.


## Download the raw data

You can download the raw data from the following links:
- [Goodreads](https://mengtingwan.github.io/data/goodreads.html#datasets)

You should download the following files:
- `goodreads_interactions_comics_graphic.json.gz`
- `goodreads_books_comics_graphic.json.gz`
- `goodreads_book_genres_initial.json.gz`
- `goodreads_reviews_comics_graphic.json.gz`
- `goodreads_book_authors.json.gz`


In [1]:
# Basic Setting of the Filtered Dataset
DATASET_NAME = "goodreads"
DATASET_PATH = "<SOURCE_PATH>"
OUTPUT_PATH = "<PROJECT_PATH>/raws/"
MIN_INTERACTION_CNT = 5     # The minimum number of interactions for a user to be included in the dataset.
MAX_INTERACTION_CNT = 20  # The maximum number of interactions for a user to be included in the dataset.


In [None]:
# Step 1: Identify the file structure of the dataset

import os
# List all files in the dataset directory
print("Files in dataset directory:")
for file in os.listdir(DATASET_PATH):
    print(f"- {file}")


In [3]:
# Step 1.1: Load README

# README_FILE_NAME = "README"
# README_FILE_PATH = os.path.join(DATASET_PATH, README_FILE_NAME)

# with open(README_FILE_PATH, "r") as file:
#     README_CONTENT = file.read()

# print(README_CONTENT)


In [None]:
# Step 2: Load the dataset following the previous dataset structure

import pandas as pd

# Load interactions data
interactions_path = os.path.join(DATASET_PATH, "goodreads_interactions_comics_graphic.json.gz")
interactions_df = pd.read_json(interactions_path, lines=True, compression='gzip')

# Load books (items) data
books_path = os.path.join(DATASET_PATH, "goodreads_books_comics_graphic.json.gz")
books_df = pd.read_json(books_path, lines=True, compression='gzip')

# Load book genres data
genres_path = os.path.join(DATASET_PATH, "goodreads_book_genres_initial.json.gz")
genres_df = pd.read_json(genres_path, lines=True, compression='gzip')

# Join books with genres data
books_df = books_df.merge(genres_df, on='book_id', how='left')


# Load reviews data
reviews_path = os.path.join(DATASET_PATH, "goodreads_reviews_comics_graphic.json.gz")
reviews_df = pd.read_json(reviews_path, lines=True, compression='gzip')

# Join reviews with books data
interactions_df = interactions_df.merge(reviews_df, on='review_id', how='left', suffixes=(None, '_review'))

authors_path = os.path.join(DATASET_PATH, "goodreads_book_authors.json.gz")
authors_df = pd.read_json(authors_path, lines=True, compression='gzip')
authors_df['author_id'] = authors_df['author_id'].apply(str)
authors_df = authors_df.set_index('author_id')
# Create basic users DataFrame from interactions
users_df = pd.DataFrame({'user_id': interactions_df['user_id'].unique()})

# Display basic information about the loaded data
print("Dataset Overview:")
print(f"Number of interactions: {len(interactions_df):,}")
print(f"Number of unique users: {interactions_df['user_id'].nunique():,}")
print(f"Number of unique books: {books_df['book_id'].nunique():,}")
print(f"Number of reviews: {len(reviews_df):,}")
print("\nDataFrame columns:")
print("Interactions columns:", interactions_df.columns.tolist())
print("Books columns:", books_df.columns.tolist())
print("Genres columns:", genres_df.columns.tolist())
print("Reviews columns:", reviews_df.columns.tolist())

In [None]:
print(interactions_df.head())
print(books_df.head())
print(reviews_df.head())


In [6]:
# Context

# Dataset Overview:
# Number of interactions: 7,347,630
# Number of unique users: 342,415
# Number of unique books: 89,411
# Number of reviews: 542,338

# DataFrame columns:
# Interactions columns: ['user_id', 'book_id', 'review_id', 'is_read', 'rating', 'review_text_incomplete', 'date_added', 'date_updated', 'read_at', 'started_at']
# Books columns: ['isbn', 'text_reviews_count', 'series', 'country_code', 'language_code', 'popular_shelves', 'asin', 'is_ebook', 'average_rating', 'kindle_asin', 'similar_books', 'description', 'format', 'link', 'authors', 'publisher', 'num_pages', 'publication_day', 'isbn13', 'publication_month', 'edition_information', 'publication_year', 'url', 'image_url', 'book_id', 'ratings_count', 'work_id', 'title', 'title_without_series', 'genres']
# Reviews columns: ['user_id', 'book_id', 'review_id', 'rating', 'review_text', 'date_added', 'date_updated', 'read_at', 'started_at', 'n_votes', 'n_comments']



In [7]:
# Filter the dataset with the minimum and maximum number of interactions
MIN_INTERACTION_CNT = 5
MAX_INTERACTION_CNT = 20

user_count = interactions_df.groupby('user_id').size()
user_count = user_count[user_count >= MIN_INTERACTION_CNT]

interactions_df = interactions_df[interactions_df['user_id'].isin(user_count.index)]


In [None]:
# Step 4: Convert the dataset to the unified format
import json
from datetime import datetime
from tqdm import tqdm
import ast

def parse_date(date_str):
    """Convert various date formats to YYYY-MM-DD HH:MM:SS format"""
    if not date_str or date_str.strip() == '':
        return None
    try:
        # Parse the Goodreads date format
        dt = datetime.strptime(date_str, '%a %b %d %H:%M:%S %z %Y')
        return dt.strftime('%Y-%m-%d %H:%M:%S')
    except:
        return None

def process_author(author_id):
    """Convert author ID to author name"""
    author_list = []
    for author in author_id:
        if author['author_id'] in authors_df.index:
            author_list.append(authors_df.loc[author['author_id'], 'name'] + (f" ({author['role']})" if author['role'] else "")) 
        else:
            author_list.append("Unknown Author")
    return author_list  

def process_genres(genres_str):
    """Convert genres string to a list of genre names"""
    if not genres_str:
        return []
    try:
        genres_dict = ast.literal_eval(genres_str) if isinstance(genres_str, str) else genres_str
        return list(genres_dict.keys())
    except:
        return []

def process_user(user):
    return {
        "user_id": str(user['user_id']),
        "user_description": {

        },
        "user_features": {

        }
    }

def process_book(book):
    # Get genres for this book
    book_genres = book['genres']
    
    item_description = {
        "title": book['title'],
        "authors": "/".join(process_author(book['authors'])),
        "description": book['description'] if book['description'] else "",
        "genres": process_genres(book_genres),
        "publication_year": str(book['publication_year']) if pd.notna(book['publication_year']) else "",
        "publisher": str(book['publisher']) if book['publisher'] else "NA",
        "average_rating": float(book['average_rating']) if pd.notna(book['average_rating']) else "NA",
        "ratings_count": int(book['ratings_count']) if pd.notna(book['ratings_count']) else "NA",
        "num_pages": str(book['num_pages']) if book['num_pages'] else "NA",
    }
    
    item_features = {
        "authors": process_author(book['authors']),
        "text_reviews_count": int(book['text_reviews_count']) if pd.notna(book['text_reviews_count']) else 0,
        "language_code": book['language_code'] if pd.notna(book['language_code']) else "",
        "is_ebook": bool(book['is_ebook']) if pd.notna(book['is_ebook']) else False,
    }
    
    return {
        "item_id": str(book['book_id']),
        "item_description": item_description,
        "item_features": item_features
    }


def process_interaction(interaction):
    behavior_features = {
        "rating": float(interaction['rating']) if pd.notna(interaction['rating']) else "Not Rated",
        "is_read": bool(interaction['is_read']) if pd.notna(interaction['is_read']) else False,
    }
    

    behavior_features["review_text"] = interaction['review_text'] if pd.notna(interaction['review_text']) else ""
    behavior_features["n_votes"] = interaction['n_votes'] if pd.notna(interaction['n_votes']) else "0"
    behavior_features["n_comments"] = interaction['n_comments'] if pd.notna(interaction['n_comments']) else "0"
    
    timestamp = parse_date(interaction['date_added'])
    if not timestamp:
        return None
        
    return {
        "user_id": str(interaction['user_id']),
        "item_id": str(interaction['book_id']),
        "timestamp": timestamp,
        "behavior_features": behavior_features
    }

# Process items (books)
tqdm.pandas(desc="Processing books")
item_records = books_df.progress_apply(process_book, axis=1).tolist()

# Process users 
tqdm.pandas(desc="Processing users")
user_records = users_df.progress_apply(process_user, axis=1).tolist()

# Process interactions
tqdm.pandas(desc="Processing interactions")
interaction_records = interactions_df.progress_apply(process_interaction, axis=1).dropna().tolist()


In [10]:
# Step 5: Save the dataset
import os
# Convert records to pandas DataFrames
interaction_df = pd.DataFrame.from_records(interaction_records)
user_df = pd.DataFrame.from_records(user_records) 
item_df = pd.DataFrame.from_records(item_records)

# Create output directory if it doesn't exist
os.makedirs(OUTPUT_PATH, exist_ok=True)
interaction_df.to_json(os.path.join(OUTPUT_PATH, f"{DATASET_NAME}_interaction.jsonl"), lines=True, orient="records")
user_df.to_json(os.path.join(OUTPUT_PATH, f"{DATASET_NAME}_user_feature.jsonl"), lines=True, orient="records")
item_df.to_json(os.path.join(OUTPUT_PATH, f"{DATASET_NAME}_item_feature.jsonl"), lines=True, orient="records")