# Stage 1: Personalized Information Extraction

This notebook performs Stage 1 of CoT-Rec: extracting user preferences and item perceptions using GPT.

## Prerequisites
1. Run `preprocess_amazon.py` to generate:
   - `datasets/processed/Grocery_and_Gourmet_Food.csv`
   - `datasets/processed/Grocery_and_Gourmet_Food.json`
2. Train SASRec to generate:
   - `SASRec/checkpoint/Grocery_and_Gourmet_Food_rec_list_valid.pkl`
   - `SASRec/checkpoint/Grocery_and_Gourmet_Food_rec_list_test.pkl`
3. Upload these files to Colab or mount Google Drive


## Step 0: Setup and Installation


In [None]:
# Install required packages
!pip install openai pandas tqdm -q


In [None]:
# Mount Google Drive (if files are stored there)
from google.colab import drive
drive.mount('/content/drive')

# Or upload files directly in Colab
# Set your working directory
import os
WORK_DIR = '/content/drive/MyDrive/CoT-Rec'  # Change this to your directory
os.chdir(WORK_DIR)


Mounted at /content/drive


In [5]:
# Set OpenAI API Key
import os
os.environ['OPENAI_API_KEY'] = 'yyyapi'  # Replace with your API key

# Or use Colab secrets
# from google.colab import userdata
# os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')


## Step 1.1: Load Data and Prepare Inputs


In [9]:
import os
import json
import pickle
import random
import pandas as pd
import numpy as np
from tqdm import tqdm

# Configuration
DATASET_NAME = 'Grocery_and_Gourmet_Food'
MODE = 'random'
K = 10  # Top-k candidates

random.seed(2025)

print("="*60)
print("Step 1.1: Loading Data")
print("="*60)


Step 1.1: Loading Data


In [None]:
# Load SASRec recommendation lists
print("\n[1/3] Loading SASRec recommendations...")
with open(f'SASRec/checkpoint/{DATASET_NAME}_rec_list_valid.pkl', 'rb') as f:
    rec_list_valid = pickle.load(f)
with open(f'SASRec/checkpoint/{DATASET_NAME}_rec_list_test.pkl', 'rb') as f:
    rec_list_test = pickle.load(f)

print(f"   Loaded {len(rec_list_valid)} validation entries")
print(f"   Loaded {len(rec_list_test)} test entries")


[1/3] Loading SASRec recommendations...
   Loaded 9392 validation entries
   Loaded 7661 test entries


In [None]:
# Filter: Only keep cases where target is in top-k
print("\n[2/3] Filtering data...")
data_valid = []
for u, rec_list, i in rec_list_valid:
    if i in rec_list[:K]:
        data_valid.append((u, rec_list[:K], i))

data_test = []
for u, rec_list, i in rec_list_test:
    if i in rec_list[:K]:
        data_test.append((u, rec_list[:K], i))

print(f"   Filtered to {len(data_valid)} validation entries")
print(f"   Filtered to {len(data_test)} test entries")



[2/3] Filtering data...
   Filtered to 9392 validation entries
   Filtered to 7661 test entries


In [None]:
# Load item names and interaction data
print("\n[3/3] Loading item names and interactions...")
with open(f'{DATASET_NAME}.json', 'r') as file:
    id2name = json.load(file)
    id2name = {int(key): value for key, value in id2name.items()}

df = pd.read_csv(f'{DATASET_NAME}.csv', names=['user_id', 'item_id'], usecols=[0, 1])

print(f"   Loaded {len(id2name)} items")
print(f"   Loaded {len(df)} interactions")
print(f"   Number of users: {df['user_id'].nunique()}")
print("\n‚úÖ Step 1.1 Complete!")



[3/3] Loading item names and interactions...
   Loaded 135194 items
   Loaded 4125640 interactions
   Number of users: 419876

‚úÖ Step 1.1 Complete!


## Step 1.2: Build GPT Prompts


In [None]:
def build_request(user, rec_list, target, phase, id2name, df, k=10):
    """
    Build GPT prompt for extracting user preferences and item perceptions.

    Args:
        user: User ID
        rec_list: List of candidate item IDs
        target: Target item ID
        phase: 'valid' or 'test'
        id2name: Dictionary mapping item ID to name
        df: DataFrame with user-item interactions
        k: Number of items in history

    Returns:
        Prompt string for GPT
    """
    delta = 2 if phase == 'valid' else 1

    # Example Interaction History and Candidate Pool
    example_history = (
        "Frontier Co-op Ground Chipotle, 1-Pound Bulk\n"
        "SunButter No Sugar Added Sunflower Butter\n"
        "SweetLeaf Stevia Sweet Drops Lemon Drop\n"
        "Frontier Co-op Cinnamon Powder, Ceylon\n"
        "SweetLeaf Sweet Drops Stevia Clear\n"
        "ALTOIDS Arctic Peppermint Mints\n"
        "Organic Cacao Powder, 1lb\n"
        "RX Nut Butter, 6 Flavor Variety Pack\n"
        "Watkins Pure Almond Extract\n"
        "NuNaturals Stevia Syrup\n"
    )
    example_candidates = (
        "A. Shrewd Food Protein Puffs\n"
        "B. Carbquik Biscuit & Baking Mix\n"
        "C. ChocZero's Strawberry Sugar-Free Syrup\n"
        "D. Lakanto Sugar Free Maple Syrup\n"
        "E. 4th & Heart Himalayan Pink Salt Grass-Fed Ghee\n"
        "F. Amazon Brand - Solimo Medium Roast Coffee Pods\n"
        "G. ChocZero's Keto Bark\n"
        "H. Swerve Sweetener, Confectioners\n"
        "I. Victor Allen's Coffee Caramel Macchiato\n"
        "J. Lakanto Golden Monk Fruit Sweetener\n"
    )

    example_output = (
        "{\n"
        "  \"user_history_perception\": {\n"
        "    \"Frontier Co-op Ground Chipotle, 1-Pound Bulk\": \"Smoked dried chili powder with a rich smoky and earthy aroma, suitable for Southwest and Mexican cuisine.\",\n"
        "    \"SunButter No Sugar Added Sunflower Butter\": \"Sugar-free sunflower butter with natural flavor, nutritious and suitable as a healthy snack or spread.\",\n"
        "    \"SweetLeaf Stevia Sweet Drops Lemon Drop\": \"Liquid stevia drops with zero calories, sugar-free, and a hint of lemon, ideal as a healthy alternative for beverages or baking.\",\n"
        "    \"Frontier Co-op Cinnamon Powder, Ceylon\": \"Organic Ceylon cinnamon powder with a fresh and sweet aroma, certified natural, commonly used in baking, beverages, and desserts.\",\n"
        "    \"SweetLeaf Sweet Drops Stevia Clear\": \"Liquid stevia drops with zero calories and sugar-free, suitable for low-carb or sugar-free diets.\",\n"
        "    \"ALTOIDS Arctic Peppermint Mints\": \"Portable peppermint mints with a cooling flavor, useful as a snack or breath freshener.\",\n"
        "    \"Organic Cacao Powder, 1lb\": \"Unsweetened cacao powder with a rich dark chocolate flavor, certified natural, ideal for baking and beverages.\",\n"
        "    \"RX Nut Butter, 6 Flavor Variety Pack\": \"Nut butter in small packages, high protein, low sugar, and available in various flavors, convenient for healthy snacking.\",\n"
        "    \"Watkins Pure Almond Extract\": \"High-quality almond extract with a rich aroma, suitable for baking or beverage flavoring.\",\n"
        "    \"NuNaturals Stevia Syrup\": \"Plant-based zero-calorie syrup, sugar-free, suitable as a healthy substitute for desserts and beverages.\"\n"
        "  },\n"
        "  \"user_preferences\": \"The user prefers sugar-free, natural foods, focusing on healthy sweeteners, seasonings, and snacks. They are possibly pursuing weight loss or a low-carb diet, emphasizing portability and variety.\",\n"
        "  \"candidate_temp_perception\": {\n"
        "    \"Shrewd Food Protein Puffs\": \"High-protein, low-carb, gluten-free healthy snack. [Comment:] As a user, I find this snack very convenient and nutritious, perfectly fitting my dietary habits.\",\n"
        "    \"Carbquik Biscuit & Baking Mix\": \"Low-carb baking mix suitable for making various low-sugar pastries. [Comment:] I think this product is ideal for creating healthy, low-sugar baked goods and perfectly aligns with my needs.\",\n"
        "    \"ChocZero's Strawberry Sugar-Free Syrup\": \"Sugar-free strawberry-flavored syrup. [Comment:] This syrup is an excellent addition to my low-sugar diet and is highly practical.\",\n"
        "    \"Lakanto Sugar Free Maple Syrup\": \"Sugar-free maple syrup, low-carb and natural sweetener. [Comment:] I feel this maple syrup works wonderfully in beverages or baking and aligns well with my healthy eating goals.\",\n"
        "    \"4th & Heart Himalayan Pink Salt Grass-Fed Ghee\": \"Natural lactose-free grass-fed ghee. [Comment:] This ghee makes me feel connected to natural and healthy cooking, a perfect choice for wholesome meals.\",\n"
        "    \"Amazon Brand - Solimo Medium Roast Coffee Pods\": \"Medium roast coffee pods convenient for quick coffee preparation. [Comment:] While convenient, this product does not meet my low-sugar dietary focus, so I might not prioritize it.\",\n"
        "    \"ChocZero's Keto Bark\": \"Sugar-free dark chocolate snack, low-carb with natural ingredients. [Comment:] I love this healthy sugar-free snack; it tastes amazing!\",\n"
        "    \"Swerve Sweetener, Confectioners\": \"Sugar-free sweetener powder suitable for low-carb and sugar-free baking. [Comment:] As a user, I think this is a perfect sugar substitute and highly practical.\",\n"
        "    \"Victor Allen's Coffee Caramel Macchiato\": \"Caramel macchiato coffee pods convenient for consumption. [Comment:] This product might not fit my dietary preferences due to its sugar content.\",\n"
        "    \"Lakanto Golden Monk Fruit Sweetener\": \"Sugar-free monk fruit sweetener, low-carb and zero-calorie. [Comment:] This is one of my favorite healthy sweeteners, ideal for baking or beverages.\"\n"
        "  },\n"
        "  \"candidate_perception\": {\n"
        "    \"Shrewd Food Protein Puffs\": \"Convenient and nutritious snacks\",\n"
        "    \"Carbquik Biscuit & Baking Mix\": \"Low-carb baking mix\",\n"
        "    \"ChocZero's Strawberry Sugar-Free Syrup\": \"Low-sugar alternative sweetener\",\n"
        "    \"Lakanto Sugar Free Maple Syrup\": \"Natural and low-carb sweetener\",\n"
        "    \"4th & Heart Himalayan Pink Salt Grass-Fed Ghee\": \"Natural and wholesome cooking ingredient\",\n"
        "    \"Amazon Brand - Solimo Medium Roast Coffee Pods\": \"Convenient but lacks health focus\",\n"
        "    \"ChocZero's Keto Bark\": \"Healthy sugar-free snack\",\n"
        "    \"Swerve Sweetener, Confectioners\": \"Excellent sugar substitute\",\n"
        "    \"Victor Allen's Coffee Caramel Macchiato\": \"Convenient but contains sugar\",\n"
        "    \"Lakanto Golden Monk Fruit Sweetener\": \"Ideal for low-carb and healthy baking\"\n"
        "  }\n"
        "}"
    )

    # Current Interaction History and Candidate Pool
    candidates = [id2name[i] for i in rec_list]
    candidates = '\n'.join(candidates)

    history = df[df['user_id'] == user]['item_id'].values[-(k + delta):-delta]
    history_ = []
    for item_id in history:
        item_name = id2name[item_id]
        history_.append(f"{item_name}")
    history = '\n'.join(history_)

    # Construct prompt
    prompt = (
        f"### Instruction\n"
        f"This is a sequential recommendation task involving grocery and gourmet food preferences. Given a user's grocery interaction history and a set of candidate items for the next interaction, your task is as follows:\n\n"
        f"1. Provide an objective description of each item in the user's interaction history, focusing on factual features such as ingredients, health benefits, or notable qualities of each item.\n"
        f"2. Based on these descriptions, predict the user's overall preferences and describe their likely personality and tastes in detail in no more than 80 words.\n"
        f"   - The summarized user preferences should be based on the frequency and regularity of user behavior rather than occasional occurrences.\n"
        f"   - Avoid using generic or vague terms; be specific and relevant.\n"
        f"3. Use the predicted preferences to evaluate each candidate item. Each evaluation must include:\n"
        f"   - Objective features of the item (factual description).\n"
        f"   - User-specific comments based on their preferences, preceded by the tag `[Comment:]` to distinguish them from the factual description.\n"
        f"4. Output the result in JSON format with the following fields:\n"
        f"   - `user_history_perception`: Objective descriptions for items in the user's interaction history.\n"
        f"   - `user_preferences`: A summary of the user's preferences.\n"
        f"   - `candidate_temp_perception`: Evaluations for items in the candidate set, including both factual descriptions and user-specific comments (prefixed with `[Comment:]`).\n"
        f"   - `candidate_perception`: Summarized user-relevant aspects from `candidate_temp_perception` comments, highlighting the most significant point of interest or concern for each item.\n"
        f"5. Ensure the JSON format is strictly correct and complete.\n"
        f"   - Every item in the interaction history and candidate set must be included.\n"
        f"   - Do not omit any items or use ellipses (...).\n"
        f"6. Directly output the JSON format without additional explanations or comments.\n"
        f"7. Strictly follow the format and style in the example provided below. Ensure all required fields are present and formatted correctly.\n\n"
        f"### Example\n"
        f"**User Item Interaction History:**\n{example_history}\n"
        f"**Candidate Items:**\n{example_candidates}\n\n"
        f"**Expected Output:**\n{example_output}\n\n"
        f"### Input\n"
        f"**User Item Interaction History:**\n{history}\n"
        f"**Candidate Items:**\n{candidates}\n\n"
        f"### Output\n"
    )

    return prompt

print("‚úÖ Prompt building function ready!")


‚úÖ Prompt building function ready!


## Step 1.3: Create JSONL Files for GPT Batch API


In [None]:
import os

# Create output directory
os.makedirs('gpt_sft_data', exist_ok=True)

print("="*60)
print("Step 1.3: Creating JSONL Files")
print("="*60)

MAX_ENTRIES_PER_FILE = 800  # OpenAI Batch API limit

for phase in ['valid', 'test']:
    print(f"\nProcessing {phase} data...")
    data = data_valid if phase == 'valid' else data_test
    file_index = 1
    entries = []

    for idx, (user, rec_list, target) in tqdm(enumerate(data), desc=f"Processing {phase}"):
        random.shuffle(rec_list)  # Shuffle for robustness

        if MODE == 'random':
            prompt = build_request(user, rec_list, target, phase, id2name, df, K)

            data_entry = {
                "custom_id": str(user),
                "method": "POST",
                "url": "/v1/chat/completions",
                "body": {
                    "model": "gpt-4o-mini",
                    "messages": [
                        {"role": "system", "content": "You are a helpful assistant."},
                        {"role": "user", "content": prompt}
                    ]
                }
            }
            entries.append(data_entry)

            # Save when reaching max entries
            if len(entries) == MAX_ENTRIES_PER_FILE:
                output_file = f'gpt_sft_data/{DATASET_NAME}_{MODE}_{phase}_part{file_index}.jsonl'
                with open(output_file, 'w', encoding='utf-8') as file:
                    for entry in entries:
                        file.write(json.dumps(entry, ensure_ascii=False) + '\n')
                print(f"   Saved {output_file} ({len(entries)} entries)")
                entries = []
                file_index += 1

    # Save remaining entries
    if entries:
        output_file = f'gpt_sft_data/{DATASET_NAME}_{MODE}_{phase}_part{file_index}.jsonl'
        with open(output_file, 'w', encoding='utf-8') as file:
            for entry in entries:
                file.write(json.dumps(entry, ensure_ascii=False) + '\n')
        print(f"   Saved {output_file} ({len(entries)} entries)")

print("\n‚úÖ Step 1.3 Complete!")


Step 1.3: Creating JSONL Files

Processing valid data...


Processing valid: 817it [00:06, 101.43it/s]

   Saved gpt_sft_data/Grocery_and_Gourmet_Food_random_valid_part1.jsonl (800 entries)


Processing valid: 1607it [00:15, 41.58it/s]

   Saved gpt_sft_data/Grocery_and_Gourmet_Food_random_valid_part2.jsonl (800 entries)


Processing valid: 2410it [00:23, 54.51it/s]

   Saved gpt_sft_data/Grocery_and_Gourmet_Food_random_valid_part3.jsonl (800 entries)


Processing valid: 3219it [00:30, 89.74it/s]

   Saved gpt_sft_data/Grocery_and_Gourmet_Food_random_valid_part4.jsonl (800 entries)


Processing valid: 4025it [00:37, 100.76it/s]

   Saved gpt_sft_data/Grocery_and_Gourmet_Food_random_valid_part5.jsonl (800 entries)


Processing valid: 4822it [00:44, 88.56it/s]

   Saved gpt_sft_data/Grocery_and_Gourmet_Food_random_valid_part6.jsonl (800 entries)


Processing valid: 5626it [00:50, 99.37it/s]

   Saved gpt_sft_data/Grocery_and_Gourmet_Food_random_valid_part7.jsonl (800 entries)


Processing valid: 6416it [00:57, 60.00it/s]

   Saved gpt_sft_data/Grocery_and_Gourmet_Food_random_valid_part8.jsonl (800 entries)


Processing valid: 7221it [01:04, 99.33it/s]

   Saved gpt_sft_data/Grocery_and_Gourmet_Food_random_valid_part9.jsonl (800 entries)


Processing valid: 8014it [01:11, 62.90it/s]

   Saved gpt_sft_data/Grocery_and_Gourmet_Food_random_valid_part10.jsonl (800 entries)


Processing valid: 8823it [01:19, 82.45it/s]

   Saved gpt_sft_data/Grocery_and_Gourmet_Food_random_valid_part11.jsonl (800 entries)


Processing valid: 9392it [01:24, 111.77it/s]


   Saved gpt_sft_data/Grocery_and_Gourmet_Food_random_valid_part12.jsonl (592 entries)

Processing test data...


Processing test: 818it [00:07, 80.50it/s]

   Saved gpt_sft_data/Grocery_and_Gourmet_Food_random_test_part1.jsonl (800 entries)


Processing test: 1615it [00:13, 67.16it/s]

   Saved gpt_sft_data/Grocery_and_Gourmet_Food_random_test_part2.jsonl (800 entries)


Processing test: 2420it [00:20, 94.97it/s]

   Saved gpt_sft_data/Grocery_and_Gourmet_Food_random_test_part3.jsonl (800 entries)


Processing test: 3210it [00:27, 80.02it/s]

   Saved gpt_sft_data/Grocery_and_Gourmet_Food_random_test_part4.jsonl (800 entries)


Processing test: 4024it [00:34, 108.99it/s]

   Saved gpt_sft_data/Grocery_and_Gourmet_Food_random_test_part5.jsonl (800 entries)


Processing test: 4822it [00:40, 105.28it/s]

   Saved gpt_sft_data/Grocery_and_Gourmet_Food_random_test_part6.jsonl (800 entries)


Processing test: 5613it [00:47, 98.00it/s]

   Saved gpt_sft_data/Grocery_and_Gourmet_Food_random_test_part7.jsonl (800 entries)


Processing test: 6414it [00:53, 98.89it/s]

   Saved gpt_sft_data/Grocery_and_Gourmet_Food_random_test_part8.jsonl (800 entries)


Processing test: 7216it [01:01, 94.63it/s]

   Saved gpt_sft_data/Grocery_and_Gourmet_Food_random_test_part9.jsonl (800 entries)


Processing test: 7661it [01:04, 118.51it/s]


   Saved gpt_sft_data/Grocery_and_Gourmet_Food_random_test_part10.jsonl (461 entries)

‚úÖ Step 1.3 Complete!


## Step 1.4-1.6: Upload to OpenAI Batch API and Submit Job


In [6]:
from openai import OpenAI
from pathlib import Path
import time
import glob

client = OpenAI(api_key=os.environ['OPENAI_API_KEY'])

print("="*60)
print("Step 1.4-1.6: OpenAI Batch API Processing")
print("="*60)


Step 1.4-1.6: OpenAI Batch API Processing


In [None]:
# Process each JSONL file SEQUENTIALLY to avoid token limit
# OpenAI has a limit of 2M enqueued tokens per model per organization
batch_info = {}  # Store batch IDs for later retrieval

import time

def estimate_tokens_from_file(jsonl_file):
    """Rough estimate of tokens in a JSONL file"""
    with open(jsonl_file, 'r', encoding='utf-8') as f:
        total_chars = sum(len(line) for line in f)
    # Rough estimate: 1 token ‚âà 4 characters
    return int(total_chars / 4)

def wait_for_batch_space(client, model='gpt-4o-mini', max_wait_minutes=60):
    """Wait until there's space in the batch queue"""
    print(f"\n  ‚è≥ Checking batch queue status...")
    start_time = time.time()

    while True:
        # List all batches
        batches = client.batches.list(limit=100)

        # Count enqueued tokens (rough estimate)
        in_progress_batches = [b for b in batches.data if b.status in ['validating', 'in_progress', 'finalizing']]

        if len(in_progress_batches) == 0:
            print(f"  ‚úÖ No batches in queue, proceeding...")
            return True

        print(f"  ‚è≥ {len(in_progress_batches)} batch(es) still processing...")

        # Check if we've waited too long
        elapsed_minutes = (time.time() - start_time) / 60
        if elapsed_minutes > max_wait_minutes:
            print(f"  ‚ö†Ô∏è  Waited {elapsed_minutes:.1f} minutes. Proceeding anyway...")
            return True

        # Wait before checking again
        time.sleep(30)  # Check every 30 seconds

for phase in ['valid', 'test']:
    print(f"\n{'='*60}")
    print(f"Processing {phase.upper()} phase")
    print(f"{'='*60}")
    batch_info[phase] = {}

    # Find all part files for this phase
    jsonl_files = sorted(glob.glob(f'gpt_sft_data/{DATASET_NAME}_{MODE}_{phase}_part*.jsonl'))

    print(f"Found {len(jsonl_files)} file(s) to process")

    for idx, jsonl_file in enumerate(jsonl_files, 1):
        part_num = jsonl_file.split('_part')[1].split('.')[0]
        print(f"\n[{idx}/{len(jsonl_files)}] Processing {jsonl_file}...")

        # Estimate tokens
        estimated_tokens = estimate_tokens_from_file(jsonl_file)
        print(f"  Estimated tokens: ~{estimated_tokens:,}")

        # Wait for queue space if not first file
        if idx > 1:
            wait_for_batch_space(client)

        try:
            # Step 1.4: Upload file
            print(f"  üì§ Uploading file...")
            file_object = client.files.create(
                file=Path(jsonl_file),
                purpose="batch"
            )
            file_id = file_object.id
            print(f"    ‚úÖ File ID: {file_id}")

            # Step 1.5: Submit batch job
            print(f"  üì§ Submitting batch job...")
            batch = client.batches.create(
                input_file_id=file_id,
                endpoint="/v1/chat/completions",
                completion_window="24h"
            )
            batch_id = batch.id
            print(f"    ‚úÖ Batch ID: {batch_id}")
            print(f"    Status: {batch.status}")

            batch_info[phase][part_num] = {
                'file_id': file_id,
                'batch_id': batch_id,
                'jsonl_file': jsonl_file
            }

            # If batch failed immediately, check why
            if batch.status == 'failed':
                if hasattr(batch, 'errors') and batch.errors:
                    error_msg = batch.errors.data[0].message if batch.errors.data else "Unknown error"
                    print(f"    ‚ùå Batch failed: {error_msg}")
                    if 'token_limit_exceeded' in error_msg:
                        print(f"    ‚è≥ Waiting 2 minutes before retrying...")
                        time.sleep(120)
                        # Retry once
                        batch = client.batches.create(
                            input_file_id=file_id,
                            endpoint="/v1/chat/completions",
                            completion_window="24h"
                        )
                        batch_id = batch.id
                        batch_info[phase][part_num]['batch_id'] = batch_id
                        print(f"    ‚úÖ Retry Batch ID: {batch_id}")

        except Exception as e:
            print(f"    ‚ùå Error: {e}")
            if 'token_limit_exceeded' in str(e):
                print(f"    ‚è≥ Token limit reached. Waiting 2 minutes...")
                time.sleep(120)
                # Retry
                try:
                    batch = client.batches.create(
                        input_file_id=file_id,
                        endpoint="/v1/chat/completions",
                        completion_window="24h"
                    )
                    batch_id = batch.id
                    batch_info[phase][part_num] = {
                        'file_id': file_id,
                        'batch_id': batch_id,
                        'jsonl_file': jsonl_file
                    }
                    print(f"    ‚úÖ Retry successful. Batch ID: {batch_id}")
                except Exception as retry_e:
                    print(f"    ‚ùå Retry also failed: {retry_e}")

print("\n" + "="*60)
print("‚úÖ Batch submission complete!")
print("="*60)
print("\n‚ö†Ô∏è  Note: Batch processing can take hours.")
print("   Use Step 1.6 to check status periodically.")
print(f"\nüìä Submitted batches:")
for phase in ['valid', 'test']:
    if phase in batch_info:
        print(f"  {phase.upper()}: {len(batch_info[phase])} batch(es)")


Processing VALID phase
Found 12 file(s) to process

[1/12] Processing gpt_sft_data/Grocery_and_Gourmet_Food_random_valid_part1.jsonl...
  Estimated tokens: ~1,932,369
  üì§ Uploading file...
    ‚úÖ File ID: file-W1UJYoHFi9DPiDy9vdWx8W
  üì§ Submitting batch job...
    ‚úÖ Batch ID: batch_692116c4964c8190a208ed97ee5d9f38
    Status: validating

[2/12] Processing gpt_sft_data/Grocery_and_Gourmet_Food_random_valid_part10.jsonl...
  Estimated tokens: ~1,933,694

  ‚è≥ Checking batch queue status...
  ‚è≥ 1 batch(es) still processing...
  ‚è≥ 1 batch(es) still processing...
  ‚è≥ 1 batch(es) still processing...
  ‚è≥ 1 batch(es) still processing...
  ‚è≥ 1 batch(es) still processing...
  ‚è≥ 1 batch(es) still processing...
  ‚è≥ 1 batch(es) still processing...
  ‚è≥ 1 batch(es) still processing...
  ‚è≥ 1 batch(es) still processing...
  ‚è≥ 1 batch(es) still processing...
  ‚è≥ 1 batch(es) still processing...
  ‚è≥ 1 batch(es) still processing...
  ‚è≥ 1 batch(es) still processing...
  

In [None]:
# Step 1.6: Check batch status
print("Checking batch status...")
for phase in ['valid', 'test']:
    print(f"\n{phase.upper()} phase:")
    for part_num, info in batch_info[phase].items():
        batch = client.batches.retrieve(info['batch_id'])
        print(f"  Part {part_num}: Status = {batch.status}")
        if batch.status == 'completed':
            print(f"    Output file ID: {batch.output_file_id}")
            print(f"    Error file ID: {batch.error_file_id}")
            info['output_file_id'] = batch.output_file_id
            info['error_file_id'] = batch.error_file_id
        elif batch.status == 'failed':
            print(f"    ‚ùå Batch failed!")
            if batch.error_file_id:
                print(f"    Error file ID: {batch.error_file_id}")


Checking batch status...

VALID phase:
  Part 1: Status = completed
    Output file ID: file-S5t7VVgFTXxRr74QMzWppC
    Error file ID: None
  Part 10: Status = completed
    Output file ID: file-5bxmZssU8C3BBMjAV6JeQg
    Error file ID: None
  Part 11: Status = completed
    Output file ID: file-ECC3nHwvYuymdjhf7r86tQ
    Error file ID: None
  Part 12: Status = completed
    Output file ID: file-QdHuA27NSBjUXPRccGSkBx
    Error file ID: None
  Part 2: Status = completed
    Output file ID: file-DW13stgC7YCS1rk21Sp3Kt
    Error file ID: None
  Part 3: Status = completed
    Output file ID: file-DT2kBWRSAkGeE5dCC4t1VC
    Error file ID: None
  Part 4: Status = completed
    Output file ID: file-2YMc2eh9yS5A2196aFprf6
    Error file ID: None
  Part 5: Status = completed
    Output file ID: file-W9xW9Ftt5ST55STMhxjNWU
    Error file ID: None
  Part 6: Status = completed
    Output file ID: file-8Uee39X9J9ZbrywAVr5Bu1
    Error file ID: None
  Part 7: Status = completed
    Output file ID: 

## Step 1.7: Download GPT Results


In [None]:
print("="*60)
print("Step 1.7: Downloading Results")
print("="*60)

for phase in ['valid', 'test']:
    print(f"\nDownloading {phase} results...")
    for part_num, info in batch_info[phase].items():
        if 'output_file_id' in info:
            print(f"  Downloading part {part_num}...")
            content = client.files.content(file_id=info['output_file_id'])
            output_file = f"{DATASET_NAME}_{MODE}_{phase}_part{part_num}_result.jsonl"
            content.write_to_file(output_file)
            print(f"    Saved: {output_file}")
            info['result_file'] = output_file
        else:
            print(f"  ‚ö†Ô∏è  Part {part_num} not completed yet")

print("\n‚úÖ Step 1.7 Complete!")


Step 1.7: Downloading Results

Downloading valid results...
  Downloading part 1...
    Saved: Grocery_and_Gourmet_Food_random_valid_part1_result.jsonl
  Downloading part 10...
    Saved: Grocery_and_Gourmet_Food_random_valid_part10_result.jsonl
  Downloading part 11...
    Saved: Grocery_and_Gourmet_Food_random_valid_part11_result.jsonl
  Downloading part 12...
    Saved: Grocery_and_Gourmet_Food_random_valid_part12_result.jsonl
  Downloading part 2...
    Saved: Grocery_and_Gourmet_Food_random_valid_part2_result.jsonl
  Downloading part 3...
    Saved: Grocery_and_Gourmet_Food_random_valid_part3_result.jsonl
  Downloading part 4...
    Saved: Grocery_and_Gourmet_Food_random_valid_part4_result.jsonl
  Downloading part 5...
    Saved: Grocery_and_Gourmet_Food_random_valid_part5_result.jsonl
  Downloading part 6...
    Saved: Grocery_and_Gourmet_Food_random_valid_part6_result.jsonl
  Downloading part 7...
    Saved: Grocery_and_Gourmet_Food_random_valid_part7_result.jsonl
  Downloading 

## Step 1.8: Parse and Extract Information


In [1]:
# Mount Google Drive (if files are stored there)
from google.colab import drive
drive.mount('/content/drive')

# Or upload files directly in Colab
# Set your working directory
import os
WORK_DIR = '/content/drive/MyDrive/CoT-Rec'  # Change this to your directory
os.chdir(WORK_DIR)

Mounted at /content/drive


In [7]:
import re

def clean_json_content(content):
    """Clean JSON-like content by removing trailing commas."""
    cleaned_content = re.sub(r',\s*$', '', content.strip())
    return cleaned_content

def parse_dict_content(content, custom_id, field_name):
    """Parse and clean dictionary-like content."""
    cleaned_content = clean_json_content(content)
    try:
        return json.loads("{" + cleaned_content + "}")
    except json.JSONDecodeError as e:
        print(f"Error parsing {field_name} (custom_id: {custom_id}): {content[:100]}...")
        return {}

def is_valid_extraction(content_dict):
    """Validate if all required fields are present."""
    required_fields = ["user_preferences", "candidate_perception"]
    return all(field in content_dict and content_dict[field] for field in required_fields)

def process_jsonl_file(file_path):
    """
    Process JSONL result file and extract user preferences and candidate perceptions.

    Returns:
        custom_id_to_content: Dictionary mapping user_id to extracted content
        failed_custom_ids: List of failed user IDs
    """
    custom_id_to_content = {}
    failed_custom_ids = []

    # Regex patterns to extract required fields
    patterns = {
        "user_history_perception": r'"user_history_perception"\s*:\s*\{(.*?)\}',
        "user_preferences": r'"user_preferences"\s*:\s*"(.*?)"',
        "candidate_temp_perception": r'"candidate_temp_perception"\s*:\s*\{(.*?)\}',
        "candidate_perception": r'"candidate_perception"\s*:\s*\{(.*?)\}'
    }

    with open(file_path, 'r', encoding='utf-8') as file:
        for line_number, line in enumerate(file, start=1):
            try:
                data = json.loads(line)
                custom_id = data.get('custom_id')
                content = data.get('response', {}).get('body', {}).get('choices', [])[0].get('message', {}).get('content')

                if not custom_id:
                    print(f"Line {line_number}: Missing custom_id.")
                    failed_custom_ids.append(None)
                    continue

                if not content:
                    print(f"Line {line_number}: Missing content for custom_id {custom_id}.")
                    failed_custom_ids.append(custom_id)
                    continue

                try:
                    # Try parsing content as JSON directly
                    content_json = json.loads(content)
                    if is_valid_extraction(content_json):
                        custom_id_to_content[custom_id] = {
                            "user_history_perception": content_json.get("user_history_perception", {}),
                            "user_preferences": content_json.get("user_preferences", ""),
                            "candidate_temp_perception": content_json.get("candidate_temp_perception", {}),
                            "candidate_perception": content_json.get("candidate_perception", {})
                        }
                    else:
                        print(f"Missing fields for custom_id {custom_id}.")
                        failed_custom_ids.append(custom_id)

                except json.JSONDecodeError:
                    # Fall back to regex extraction
                    extracted_content = {}
                    for field, pattern in patterns.items():
                        match = re.search(pattern, content, re.DOTALL)
                        if match:
                            if field == "user_preferences":
                                extracted_content[field] = match.group(1)
                            else:
                                extracted_content[field] = parse_dict_content(match.group(1), custom_id, field)
                        # else:
                        #     print(f"Missing {field} for custom_id {custom_id}.")
                        #     failed_custom_ids.append(custom_id)
                        # Note: Only user_preferences and candidate_perception are required
                        # Other fields (user_history_perception, candidate_temp_perception) are optional
                        # and won't cause failure if missing

                    if is_valid_extraction(extracted_content):
                        custom_id_to_content[custom_id] = {
                            "user_history_perception": extracted_content.get("user_history_perception", {}),
                            "user_preferences": extracted_content.get("user_preferences", ""),
                            "candidate_temp_perception": extracted_content.get("candidate_temp_perception", {}),
                            "candidate_perception": extracted_content.get("candidate_perception", {})
                        }
                    else:
                        failed_custom_ids.append(custom_id)

            except json.JSONDecodeError as e:
                print(f"Line {line_number}: Error decoding JSON: {e}")
                failed_custom_ids.append(custom_id if 'custom_id' in locals() else None)

    unique_failed_custom_ids = list(set(failed_custom_ids))
    return custom_id_to_content, unique_failed_custom_ids

print("‚úÖ Parsing functions ready!")


‚úÖ Parsing functions ready!


only extracts 2: user_preferences and candidate_perception

In [None]:
import re

def clean_json_content(content):
    """Clean JSON-like content by removing trailing commas."""
    cleaned_content = re.sub(r',\s*$', '', content.strip())
    return cleaned_content

def parse_dict_content(content, custom_id, field_name):
    """Parse and clean dictionary-like content."""
    cleaned_content = clean_json_content(content)
    try:
        return json.loads("{" + cleaned_content + "}")
    except json.JSONDecodeError as e:
        print(f"Error parsing {field_name} (custom_id: {custom_id}): {content[:100]}...")
        return {}

def is_valid_extraction(content_dict):
    """Validate if all required fields are present."""
    required_fields = ["user_preferences", "candidate_perception"]
    return all(field in content_dict and content_dict[field] for field in required_fields)

def process_jsonl_file(file_path):
    """
    Process JSONL result file and extract user preferences and candidate perceptions.

    Returns:
        custom_id_to_content: Dictionary mapping user_id to extracted content
        failed_custom_ids: List of failed user IDs
    """
    custom_id_to_content = {}
    failed_custom_ids = []

    # Regex patterns to extract required fields
    patterns = {
        "user_preferences": r'"user_preferences"\s*:\s*"(.*?)"',
        "candidate_perception": r'"candidate_perception"\s*:\s*\{(.*?)\}'
    }

    with open(file_path, 'r', encoding='utf-8') as file:
        for line_number, line in enumerate(file, start=1):
            try:
                data = json.loads(line)
                custom_id = data.get('custom_id')
                content = data.get('response', {}).get('body', {}).get('choices', [])[0].get('message', {}).get('content')

                if not custom_id:
                    print(f"Line {line_number}: Missing custom_id.")
                    failed_custom_ids.append(None)
                    continue

                if not content:
                    print(f"Line {line_number}: Missing content for custom_id {custom_id}.")
                    failed_custom_ids.append(custom_id)
                    continue

                try:
                    # Try parsing content as JSON directly
                    content_json = json.loads(content)
                    if is_valid_extraction(content_json):
                        custom_id_to_content[custom_id] = {
                            "user_preferences": content_json.get("user_preferences", ""),
                            "candidate_perception": content_json.get("candidate_perception", {})
                        }
                    else:
                        print(f"Missing fields for custom_id {custom_id}.")
                        failed_custom_ids.append(custom_id)

                except json.JSONDecodeError:
                    # Fall back to regex extraction
                    extracted_content = {}
                    for field, pattern in patterns.items():
                        match = re.search(pattern, content, re.DOTALL)
                        if match:
                            if field == "user_preferences":
                                extracted_content[field] = match.group(1)
                            else:
                                extracted_content[field] = parse_dict_content(match.group(1), custom_id, field)
                        else:
                            print(f"Missing {field} for custom_id {custom_id}.")
                            failed_custom_ids.append(custom_id)

                    if is_valid_extraction(extracted_content):
                        custom_id_to_content[custom_id] = extracted_content
                    else:
                        failed_custom_ids.append(custom_id)

            except json.JSONDecodeError as e:
                print(f"Line {line_number}: Error decoding JSON: {e}")
                failed_custom_ids.append(custom_id if 'custom_id' in locals() else None)

    unique_failed_custom_ids = list(set(failed_custom_ids))
    return custom_id_to_content, unique_failed_custom_ids

print("‚úÖ Parsing functions ready!")


‚úÖ Parsing functions ready!


In [10]:
print("="*60)
print("Step 1.8: Parsing GPT Results")
print("="*60)

all_results = {'valid': {}, 'test': {}}

for phase in ['valid', 'test']:
    print(f"\nProcessing {phase} phase...")
    phase_results = {}

    # Process all result files for this phase
    result_files = sorted(glob.glob(f"{DATASET_NAME}_{MODE}_{phase}_part*_result.jsonl"))

    for result_file in result_files:
        print(f"  Processing {result_file}...")
        result, failed_ids = process_jsonl_file(result_file)

        # Merge results
        for user_id, content in result.items():
            phase_results[user_id] = content

        print(f"    Extracted {len(result)} entries")
        if failed_ids:
            print(f"    Failed: {len(failed_ids)} entries")

    all_results[phase] = phase_results
    print(f"\n  Total extracted for {phase}: {len(phase_results)} users")

print("\n‚úÖ Step 1.8 Complete!")


Step 1.8: Parsing GPT Results

Processing valid phase...
  Processing Grocery_and_Gourmet_Food_random_valid_part10_result.jsonl...
Error parsing candidate_temp_perception (custom_id: 326514): 
    "Kiss My Keto Bread Zero Carb (0g-Net) ‚Äì Wheat Bread Loaf, Low Calorie Bread ‚Äì Sugar Free Bread...
Error parsing candidate_temp_perception (custom_id: 326680): 
    "FONDX Fondant, Vanilla Flavor, Blue, 5 lb": "Vanilla-flavored blue fondant, ideal for covering...
Error parsing candidate_temp_perception (custom_id: 335925): 
    "NUTPODS Toasted Marshmallow Unsweetened Dairy Free Creamer, 11.2 FZ": "Unsweetened, dairy-free...
Error parsing candidate_temp_perception (custom_id: 341773): 
    "Hint Water Peach, Pure Water Infused with Peach, Zero Sugar, Zero Calories, Zero Sweeteners, Z...
Error parsing candidate_temp_perception (custom_id: 345622): 
    "Sparkling Ice +Caffeine Tropical Punch Sparkling Water with Caffeine, Zero Sugar, with Antioxi...
Error parsing candidate_temp_perception 

In [11]:
print("="*60)
print("Step 1.9: Saving Final Results")
print("="*60)

# Save validation results
valid_output_file = f'{DATASET_NAME}_valid.pkl'
with open(valid_output_file, 'wb') as file:
    pickle.dump(all_results['valid'], file)
print(f"\n‚úÖ Saved validation results: {valid_output_file}")
print(f"   Users: {len(all_results['valid'])}")

# Save test results
test_output_file = f'{DATASET_NAME}_test.pkl'
with open(test_output_file, 'wb') as file:
    pickle.dump(all_results['test'], file)
print(f"\n‚úÖ Saved test results: {test_output_file}")
print(f"   Users: {len(all_results['test'])}")

# Display sample result
if all_results['valid']:
    sample_user = list(all_results['valid'].keys())[0]
    print(f"\nüìã Sample result (user {sample_user}):")
    sample = all_results['valid'][sample_user]
    print(f"   Preferences: {sample['user_preferences'][:100]}...")
    print(f"   Perceptions: {len(sample['candidate_perception'])} items")
    if sample['candidate_perception']:
        first_item = list(sample['candidate_perception'].keys())[0]
        print(f"   Example: {first_item} -> {sample['candidate_perception'][first_item]}")

print("\n" + "="*60)
print("üéâ Stage 1 Complete!")
print("="*60)
print("\nNext steps:")
print("1. Use the pickle files in Stage 2 (0_Grocery_and_Gourmet_Food_sft1.py)")
print("2. Train LLM model with personalized information")
print("3. Run inference (2_inference.py)")


Step 1.9: Saving Final Results

‚úÖ Saved validation results: Grocery_and_Gourmet_Food_valid.pkl
   Users: 9373

‚úÖ Saved test results: Grocery_and_Gourmet_Food_test.pkl
   Users: 6849

üìã Sample result (user 321648):
   Preferences: The user enjoys sweet beverages and flavors, particularly hot cocoa and flavored drink mixes, with a...
   Perceptions: 10 items
   Example: Crush, Grape ‚Äì Powder Drink Mix - (12 boxes, 72 sticks) ‚Äì Sugar Free & Delicious, Makes 72 flavored water beverages -> Appealing sugar-free grape flavor.

üéâ Stage 1 Complete!

Next steps:
1. Use the pickle files in Stage 2 (0_Grocery_and_Gourmet_Food_sft1.py)
2. Train LLM model with personalized information
3. Run inference (2_inference.py)
