# Dataset Construction - Movielens-1M

This notebook aims to model any recommender system datasets to the unified sequential format of pure text string, serving as the input of the user simulator LLM.

A recommender system data usually consists of 3 critical components: user, item and behavior. The user and item are usually represented by their IDs, while the behavior is usually represented by their interactions.

The goal of this notebook is to transform the original data into a unified format, with two types of JSON format:

- Item Feature JSON: This JSON contains a list of all items in the dataset, each element is following the format:
```json
{
    "item_id": "item_id (str), a unique identifier for the item",
    "item_description": "item_description (dict[str, str]), a dictionary of item attributes and their values. All values should be processed to be pure text string that can be understood by human. All features that contain unreadable information (e.g. image, video, audio, url, ID string ,etc.) should be converted (when possible) or removed (when not possible) to be pure text string.",
    "item_features": "item_features (dict[str, Any]), a dictionary of item features (except for those in item_description) and their values. "
This JSON is stored as `<DATASET_NAME>_item_feature.jsonl`.

- User Feature JSON: This JSON contains a list of all items in the dataset, each element is following the format:
```json
{
    "user_id": "user_id (str), a unique identifier for the user",
    "user_description": "user_description (dict[str, str]), a dictionary of user attributes and their values. All values should be processed to be pure text string that can be understood by human. All features that contain unreadable information (e.g. image, video, audio, url, ID string ,etc.) should be converted (when possible) or removed (when not possible) to be pure text string.",
    "user_features": "user_features (dict[str, Any]), a dictionary of user features (except for those in user_description) and their values. "
}
```
This JSON is stored as `<DATASET_NAME>_user_feature.jsonl`.

- Interaction JSON: This JSON contains a list of all user behaviors in the dataset, each element is following the format:
```json
{
    "user_id": "user_id (str), a unique identifier for the user",
    "item_id": "item_id (str), the ID of the item that the user has interacted with",
    "timestamp": "timestamp (str), the timestamp of the interaction in the format of YYYY-MM-DD HH:MM:SS. When the timestamp is not available, it should be set to the random timestamp.",
    "behavior_features": "behavior_features (dict[str, Any]), a dictionary of behavior features and their values. All values should be processed to be pure text string that can be understood by human. All features that contain unreadable information (e.g. image, video, audio, url, ID string ,etc.) should be converted (when possible) or removed (when not possible) to be pure text string."
}
```
This JSON is stored as `<DATASET_NAME>_interaction.jsonl`.


## Download the raw data

You can download the raw data from the following links:
- [Movielens-1M](https://files.grouplens.org/datasets/movielens/ml-1m.zip)

After downloading the raw data, you can unzip the file.

In [1]:
# Basic Setting of the Filtered Dataset
DATASET_NAME = "ml-1m"
DATASET_PATH = "<SOURCE_PATH>"
OUTPUT_PATH = "<PROJECT_PATH>/raws/"
MIN_INTERACTION_CNT = 5     # The minimum number of interactions for a user to be included in the dataset.
MAX_INTERACTION_CNT = 20  # The maximum number of interactions for a user to be included in the dataset.


In [None]:
# Step 1: Identify the file structure of the dataset

import os
# List all files in the dataset directory
print("Files in dataset directory:")
for file in os.listdir(DATASET_PATH):
    print(f"- {file}")


In [3]:
# Step 1.1: Load README

# README_FILE_NAME = "README"
# README_FILE_PATH = os.path.join(DATASET_PATH, README_FILE_NAME)

# with open(README_FILE_PATH, "r") as file:
#     README_CONTENT = file.read()

# print(README_CONTENT)


In [None]:
import pandas as pd
# Load the ratings data
ratings_df = pd.read_csv(os.path.join(DATASET_PATH, "ratings.dat"), 
                        sep="::", 
                        header=None, 
                        names=['UserID', 'MovieID', 'Rating', 'Timestamp'],
                        engine='python')

# Load the users data
users_df = pd.read_csv(os.path.join(DATASET_PATH, "users.dat"),
                      sep="::",
                      header=None,
                      names=['UserID', 'Gender', 'Age', 'Occupation', 'Zip-code'],
                      engine='python')

# Load the movies data
movies_df = pd.read_csv(os.path.join(DATASET_PATH, "movies.dat"),
                       sep="::",
                       header=None,
                       names=['MovieID', 'Title', 'Genres'],
                       encoding='latin-1',
                       engine='python')

# Display the first few rows of each dataframe
print("Ratings DataFrame:")
print(ratings_df.head())
print("\nUsers DataFrame:")
print(users_df.head())
print("\nMovies DataFrame:")
print(movies_df.head())



In [None]:
# Step 3: Convert each item as a full-text description.

ITEM_DESCRIPTION_TEMPLATE = "'{title}' [{genres}]"

def create_movie_description(row):
    # Extract year from title if present
    title = row['Title']
    
    # Convert genres from '|' separated string to list
    genres = row['Genres']

    # Create description using a single format string
    description = ITEM_DESCRIPTION_TEMPLATE.format(title=title, genres=genres)
    
    return description

# Create text descriptions for movies
movies_df['Description'] = movies_df.apply(create_movie_description, axis=1)

print("Example movie descriptions:")
print(movies_df[['MovieID', 'Description']].head())


In [None]:
# Step Extra: Map user feature ids to text descriptions, based on README

Gender_map = {
    "F": "female",
    "M": "male"
}

Age_map = {
    1: "Under 18",
    18: "18-24",
    25: "25-34",
    35: "35-44",
    45: "45+"
}

Occupation_map = {
    0: "other",
    1: "academic/educator", 
    2: "artist",
    3: "clerical/admin",
    4: "college/grad student",
    5: "customer service",
    6: "doctor/health care",
    7: "executive/managerial",
    8: "farmer",
    9: "homemaker",
    10: "K-12 student",
    11: "lawyer",
    12: "programmer",
    13: "retired",
    14: "sales/marketing",
    15: "scientist",
    16: "self-employed",
    17: "technician/engineer",
    18: "tradesman/craftsman",
    19: "unemployed",
    20: "writer"
}

# Map Zip codes to locations using uszipcode package
# import sqlalchemy_mate
# import sys
# from sqlalchemy.ext.declarative import declarative_base

# class ExtendedBase:
#     __abstract__ = True

# if not hasattr(sqlalchemy_mate, 'ExtendedBase'):
#     sqlalchemy_mate.ExtendedBase = ExtendedBase
    
from uszipcode import SearchEngine

def get_zipcode_info(zipcode):
    try:
        search = SearchEngine()
        result = search.by_zipcode(str(zipcode))
        if result:
           return result.city + ", " + result.state
        return None
    except:
        return None

Zipcode_map = {}
unique_zipcodes = users_df['Zip-code'].unique()

for zipcode in unique_zipcodes:
    if pd.notna(zipcode):  # Skip NaN values
        info = get_zipcode_info(zipcode)
        if info:
            Zipcode_map[zipcode] = info

users_df['Zip-code'] = users_df['Zip-code'].apply(lambda x: Zipcode_map.get(x, "Unknown"))
users_df['Gender'] = users_df['Gender'].apply(lambda x: Gender_map.get(x, "Unknown"))
users_df['Age'] = users_df['Age'].apply(lambda x: Age_map.get(x, "Unknown"))
users_df['Occupation'] = users_df['Occupation'].apply(lambda x: Occupation_map.get(x, "Unknown"))


In [None]:
USER_PROFILE_TEMPLATE = """Gender: {gender}, Age: {age}, Occupation: {occupation}, Location: {zipcode}
"""

users_df['Profile'] = users_df.apply(lambda row: USER_PROFILE_TEMPLATE.format(user_id=row['UserID'], gender=row['Gender'], age=row['Age'], occupation=row['Occupation'], zipcode=row['Zip-code']), axis=1)

users_df

In [8]:
user_df = users_df
user_df = users_df.rename(columns={'UserID': 'user_id', 'Gender': 'gender', 'Age': 'age', 'Occupation': 'occupation', 'Zip-code': 'zipcode'})
item_df = movies_df
item_df = item_df.rename(columns={'MovieID': 'item_id', 'Title': 'title', 'Genres': 'genres'})
interactions_df = ratings_df
interactions_df = interactions_df.rename(columns={'UserID': 'user_id', 'MovieID': 'item_id', 'Rating': 'rating', 'Timestamp': 'timestamp'})


In [None]:
interactions_df

In [10]:
# Filter the dataset with the minimum and maximum number of interactions
MIN_INTERACTION_CNT = 5
MAX_INTERACTION_CNT = 20

user_count = interactions_df.groupby('user_id').size()
user_count = user_count[user_count >= MIN_INTERACTION_CNT]

interactions_df = interactions_df[interactions_df['user_id'].isin(user_count.index)]


In [None]:
interactions_df

In [None]:
# Step 4: Convert the dataset to the unified format
import json
import re
from datetime import datetime
from tqdm import tqdm
import numpy as np

def extract_year_from_title(title):
    """Extract year from title if present"""
    # Regular expression to find year in parentheses
    year_match = re.search(r'\((\d{4})\)', title)
    if year_match:
        return year_match.group(1)
    return None

def parse_date(date_str):
    """Convert various date formats to YYYY-MM-DD HH:MM:SS format"""
    if isinstance(date_str, (int, np.int64)):
        # If input is timestamp (integer), convert it to datetime
        try:
            dt = datetime.fromtimestamp(date_str)
            return dt.strftime('%Y-%m-%d %H:%M:%S')
        except:
            return None
            
    if not date_str or (isinstance(date_str, str) and date_str.strip() == ''):
        return None
        
    try:
        # 尝试解析Goodreads格式的日期
        dt = datetime.strptime(date_str, '%a %b %d %H:%M:%S %z %Y')
        return dt.strftime('%Y-%m-%d %H:%M:%S')
    except:
        return None


def process_movie(movie):
    
    item_description = {
        "title": movie['title'],
        "genres": movie['genres'],
        "year": extract_year_from_title(movie['title'])
    }
    
    item_features = {
        "year": extract_year_from_title(movie['title'])
    }
    
    return {
        "item_id": str(movie['item_id']),
        "item_description": item_description,
        "item_features": item_features
    }

def process_user(user):
    user_description = {
        "gender": user['gender'],
        "age": user['age'],
        "occupation": user['occupation'],
        "location": user['zipcode']
    }
    
    user_features = {
    }
    return {
        "user_id": str(user['user_id']),
        "user_description": user_description,
        "user_features": user_features
    }

def process_interaction(interaction):
    behavior_features = {
        "rating": float(interaction['rating']) if pd.notna(interaction['rating']) else "Not Rated"
    }
    
    timestamp = parse_date(interaction['timestamp'])
    if not timestamp:
        return None
        
    return {
        "user_id": str(interaction['user_id']),
        "item_id": str(interaction['item_id']),
        "timestamp": timestamp,
        "behavior_features": behavior_features
    }

# Process items (books)
tqdm.pandas(desc="Processing books")
item_records = item_df.progress_apply(process_movie, axis=1).tolist()

# Process users 
tqdm.pandas(desc="Processing users")
user_records = user_df.progress_apply(process_user, axis=1).tolist()

# Process interactions
tqdm.pandas(desc="Processing interactions")
interaction_records = interactions_df.progress_apply(process_interaction, axis=1).dropna().tolist()


In [13]:
# Step 5: Save the dataset
import os
# Convert records to pandas DataFrames
interaction_df = pd.DataFrame.from_records(interaction_records)
user_df = pd.DataFrame.from_records(user_records) 
item_df = pd.DataFrame.from_records(item_records)

In [14]:
# Create output directory if it doesn't exist
os.makedirs(OUTPUT_PATH, exist_ok=True)
interaction_df.to_json(os.path.join(OUTPUT_PATH, f"{DATASET_NAME}_interaction.jsonl"), lines=True, orient="records")
user_df.to_json(os.path.join(OUTPUT_PATH, f"{DATASET_NAME}_user_feature.jsonl"), lines=True, orient="records")
item_df.to_json(os.path.join(OUTPUT_PATH, f"{DATASET_NAME}_item_feature.jsonl"), lines=True, orient="records")