# WebShop LATs Experiment Notebook

This notebook replicates the functionality of `run.py` for running WebShop experiments using Language Agent Tree Search (LATs).

## Overview
- **Task**: WebShop shopping environment
- **Method**: Language Agent Tree Search (LATs)
- **Models**: GPT-3.5, GPT-4, or other supported backends
- **Features**: Self-reflection, trajectory learning, and iterative improvement


In [6]:
# Import required libraries
import os
import json
import logging
import sys
import copy
import itertools
import numpy as np
from functools import partial
import requests
import random
from bs4 import BeautifulSoup
from bs4.element import Comment
import backoff
import openai
from transformers import GPT2Tokenizer
import matplotlib.pyplot as plt
import pandas as pd
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Add the current directory to Python path to import local modules
webshop_path = os.path.abspath(os.path.join(os.getcwd(), '../LanguageAgentTreeSearch/webshop'))
sys.path.append(webshop_path)

# Import local modules
from models import LiteLLMModel
from webshop import WebShopTask

# # Import specific functions from lats.py that we need
from lats import (
    select_node, expand_node, rollout, backpropagate, 
    collect_all_nodes, Node, env, evaluate_node
)

print("All imports successful!")


All imports successful!


## Configuration and Setup

Configure your experiment parameters here. You can modify these values to run different experiments.

**Note**: This notebook uses LiteLLM, which requires the `TOGETHER_API_KEY` environment variable to be set.


In [14]:
from dotenv import load_dotenv
load_dotenv(override=True)

# Check environment variables
if not os.getenv("TOGETHER_API_KEY"):
    print("WARNING: TOGETHER_API_KEY environment variable is not set!")
    print("Please set it before running the experiment:")
    print("export TOGETHER_API_KEY='your_api_key_here'")
else:
    print("✓ TOGETHER_API_KEY is set")

# Configuration parameters - modify these as needed
class ExperimentConfig:
    def __init__(self):
        # Model configuration - using LiteLLM model names
        self.backend = "together_ai/Qwen/Qwen3-Next-80B-A3B-Instruct"
        self.temperature = 1.0
        self.max_tokens = 4096
        
        # Task configuration
        self.task_start_index = 900
        self.task_end_index = 1000
        
        # Sampling configuration
        self.prompt_sample = 'standard'  # Options: 'standard', 'cot'
        self.n_generate_sample = 1
        self.n_evaluate_sample = 1
        
        # Search configuration
        self.iterations = 30
        
        # Logging configuration
        self.log_file = f'webshop_experiment_{datetime.now().strftime("%Y%m%d_%H%M%S")}.log'
        
        # Display configuration
        self.print_progress = True

# Create configuration instance
config = ExperimentConfig()

print("\nExperiment Configuration:")
print(f"Backend: {config.backend}")
print(f"Temperature: {config.temperature}")
print(f"Max Tokens: {config.max_tokens}")
print(f"Task range: {config.task_start_index} to {config.task_end_index}")
print(f"Prompt sample: {config.prompt_sample}")
print(f"Iterations: {config.iterations}")
print(f"Log file: {config.log_file}")


✓ TOGETHER_API_KEY is set

Experiment Configuration:
Backend: together_ai/Qwen/Qwen3-Next-80B-A3B-Instruct
Temperature: 1.0
Max Tokens: 4096
Task range: 900 to 1000
Prompt sample: standard
Iterations: 30
Log file: webshop_experiment_20251012_160826.log


In [15]:
# Initialize the LiteLLM model
print("Initializing LiteLLM model...")
model = LiteLLMModel(
    model=config.backend,
    temperature=config.temperature,
    max_tokens=config.max_tokens
)
print(f"LiteLLM model initialized: {config.backend}")

# Initialize the WebShop task
task = WebShopTask()
print("WebShop Task initialized:")
print(task)


Initializing LiteLLM model...
LiteLLM model initialized: together_ai/Qwen/Qwen3-Next-80B-A3B-Instruct
WebShop Task initialized:
<webshop.WebShopTask object at 0x15b88c690>


In [None]:
# Create wrapper functions to make LiteLLM compatible with existing code
def gpt(prompt, model=None, temperature=None, max_tokens=None, n=1, stop=None):
    """
    Wrapper function to make LiteLLM compatible with the existing gpt function interface
    """
    if isinstance(prompt, str):
        # Single prompt
        if n == 1:
            return [model.send_request(prompt)]
        else:
            # Multiple samples of the same prompt
            prompts = [prompt] * n
            return model.send_requests(prompts)
    else:
        # List of prompts
        return model.send_requests(prompt)

def gpt_usage(backend=None):
    """
    Wrapper function for usage tracking - LiteLLM doesn't have built-in usage tracking
    so we'll return a placeholder structure
    """
    return {
        'completion_tokens': 0,
        'prompt_tokens': 0,
        'cost': 0.0
    }

# Test the wrapper
print("Testing LiteLLM wrapper...")
test_prompt = "Hello, this is a test prompt."
test_response = gpt(test_prompt, model=model, n=1)
print(f"Test response: {test_response[0][:100]}...")
print("LiteLLM wrapper functions created successfully!")

Testing LiteLLM wrapper...
Test response: Hello! It looks like you're testing the waters—welcome! 😊  
How can I assist you today? Whether it’s...
LiteLLM wrapper functions created successfully!


In [None]:
# Redefine lats_search function to work with LiteLLM (no globals!)
def lats_search_with_litellm(args, task, idx, model, iterations=50, to_print=True):
    """
    Clean lats_search function that works with LiteLLM model - no global variables!
    """
    # Create a gpt function that uses our LiteLLM model
    def gpt_with_litellm(prompt, n=1, stop=None):
        return gpt(prompt, model=model, n=n, stop=stop)
    
    # Set up the environment
    action = 'reset'
    logging.basicConfig(filename=args.log, level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s', filemode='a')
    
    x = env.step(idx, action)[0]
    if to_print:
        print(idx, x)
    
    root = Node(state=None, question=x)
    root.env_state = copy.deepcopy(env.sessions)
    all_nodes = []
    failed_trajectories = []  # Local variable, not global
    reflection_map = []        # Local variable, not global
    terminal_nodes = []

    for i in range(iterations):
        logging.info(f"Iteration {i + 1}...")
        node = select_node(root)

        while node is None or (node.is_terminal and node.reward != 1):
            logging.info(f"Need to backtrack or terminal node with reward 0 found at iteration {i + 1}, reselecting...")
            node = select_node(root)
        
        if node is None:
            logging.info("All paths lead to terminal nodes with reward 0. Ending search.")
            break

        if node.is_terminal and node.reward == 1:
            logging.info(f"Terminal node with reward 1 found at iteration {i + 1}")
            return node.state, node.value, all_nodes, node.reward, node.em
        
        expand_node(node, args, task, idx)

        while node.is_terminal:
            logging.info(f"Depth limit node found at iteration {i + 1}, reselecting...")
            node = select_node(root)
            expand_node(node, args, task, idx)

        val = evaluate_node(node, args, task, idx)
        # Simulation or rollout
        terminal_node = rollout(max(node.children, key=lambda child: child.value), args, task, idx, max_depth=15)
        terminal_nodes.append(terminal_node)

        if terminal_node.reward == 1:
            logging.info("Successful trajectory found")
            logging.info(f"Terminal node with reward 1 found at iteration {i + 1}")
            return terminal_node.state, terminal_node.value, terminal_node.reward, terminal_node.em
        # Backpropagate reward
        backpropagate(terminal_node, terminal_node.reward)
        
        all_nodes = [(node, node.reward) for node in collect_all_nodes(root)]
        print("searching all nodes...")
        # Check for terminal nodes with a reward of 1
        terminal_nodes_with_reward_1 = [node for node, reward in all_nodes if node.is_terminal and node.reward == 1]

        if terminal_nodes_with_reward_1:
            logging.info("Successful trajectory found")
            logging.info(f"Terminal node with reward 1 found at iteration {i + 1}")
            best_node = max(terminal_nodes_with_reward_1, key=lambda x: x.reward)
            return best_node.state, best_node.value, best_node.reward, best_node.em
    
        for j, (node, value) in enumerate(all_nodes):
            logging.info(f"Node {j+1}: {str(node)}")

        node_strings = '\n'.join(str(node[0]) for node in all_nodes)
        logging.info(f"State of all_nodes after iteration {i + 1}:\n{node_strings}")

    #best_child = max(root.children, key=lambda x: x.reward)
    all_nodes_list = collect_all_nodes(root)
    all_nodes_list.extend(terminal_nodes)
    best_child = max(all_nodes_list, key=lambda x: x.reward)
    print("best value found", best_child.reward)
    if best_child.reward == 1:
        logging.info("Successful trajectory found")
    else:
        logging.info("Unsuccessful/Partially Successful trajectory found")
    return best_child.state, best_child.value, best_child.reward, best_child.em

print("Clean lats_search function defined with LiteLLM support - no globals!")


In [None]:
# Setup logging
logging.basicConfig(
    filename=config.log_file, 
    level=logging.INFO, 
    format='%(asctime)s - %(levelname)s - %(message)s', 
    filemode='a'
)

print(f"Logging configured. Log file: {config.log_file}")


In [None]:
# Main experiment function - replicates the functionality of run.py
def run_experiment(config):
    """
    Run the WebShop experiment with LATs search.
    This function replicates the main logic from run.py
    """
    # Initialize tracking variables
    logs = []
    task_accs = []
    info = []
    count = 0
    n = config.task_end_index - config.task_start_index
    
    print(f"Starting experiment with {n} tasks...")
    print(f"Task range: {config.task_start_index} to {config.task_end_index}")
    print("=" * 50)
    
    # Create a simple args object for compatibility with lats_search
    class Args:
        def __init__(self, config):
            self.backend = config.backend
            self.temperature = config.temperature
            self.prompt_sample = config.prompt_sample
            self.n_generate_sample = config.n_generate_sample
            self.n_evaluate_sample = config.n_evaluate_sample
            self.iterations = config.iterations
            self.log = config.log_file
    
    args = Args(config)
    
    # Main experiment loop
    for i in range(config.task_start_index, config.task_end_index):
        print(f"\n--- Task {i+1} (index {i}) ---")
        
        try:
            # Run LATs search for this task using our custom function
            state, value, reward, em = lats_search_with_litellm(args, task, f'fixed_{i}', model, config.iterations, config.print_progress)
            
            # Track results
            task_accs.append(reward)
            print(f"Task {i+1} - Best reward: {reward}")
            
            # Print progress every task
            if (i+1) % 1 == 0:
                r = sum(task_accs) / len(task_accs)  # Average reward
                sr = len([_ for _ in task_accs if _ == 1]) / len(task_accs)  # Success rate
                fr = count / len(task_accs)  # Failure rate
                print(f"Progress - Task {i+1}: Avg Reward: {r:.3f}, Success Rate: {sr:.3f}, Failure Rate: {fr:.3f}")
                print('-' * 30)
            
            # Log results
            r, sr, fr = sum(task_accs) / len(task_accs), len([_ for _ in task_accs if _ == 1]) / n, count / n
            logging.info(f"RESULTS: {r}, {sr}, {fr}")
            
        except Exception as e:
            print(f"Error in task {i+1}: {str(e)}")
            logging.error(f"Error in task {i+1}: {str(e)}")
            task_accs.append(0)  # Add 0 reward for failed task
            count += 1
    
    # Final results
    n = config.task_end_index - config.task_start_index
    final_r = sum(task_accs) / len(task_accs) if task_accs else 0
    final_sr = len([_ for _ in task_accs if _ == 1]) / n if n > 0 else 0
    final_fr = count / n if n > 0 else 0
    
    print("\n" + "=" * 50)
    print("FINAL RESULTS:")
    print(f"Average Reward: {final_r:.3f}")
    print(f"Success Rate: {final_sr:.3f}")
    print(f"Failure Rate: {final_fr:.3f}")
    print(f"Total Tasks: {n}")
    # Note: LiteLLM doesn't provide built-in usage tracking
    usage_info = gpt_usage(config.backend)
    print(f"Usage tracking: {usage_info}")
    
    return {
        'task_accs': task_accs,
        'final_r': final_r,
        'final_sr': final_sr,
        'final_fr': final_fr,
        'usage': usage_info
    }

print("Experiment function defined. Ready to run!")

In [None]:
# Run the experiment
print("Starting WebShop LATs Experiment...")
print("=" * 60)

# Run the experiment and capture results
results = run_experiment(config)

print("\nExperiment completed!")
print("Results saved to:", results)


In [None]:
# Analyze and visualize results
def analyze_results(results):
    """
    Analyze the experiment results and create visualizations
    """
    task_accs = results['task_accs']
    
    # Create a DataFrame for analysis
    df = pd.DataFrame({
        'Task': range(len(task_accs)),
        'Reward': task_accs,
        'Success': [1 if r == 1.0 else 0 for r in task_accs]
    })
    
    print("=== EXPERIMENT ANALYSIS ===")
    print(f"Total Tasks: {len(task_accs)}")
    print(f"Average Reward: {results['final_r']:.3f}")
    print(f"Success Rate: {results['final_sr']:.3f}")
    print(f"Failure Rate: {results['final_fr']:.3f}")
    print(f"Max Reward: {max(task_accs):.3f}")
    print(f"Min Reward: {min(task_accs):.3f}")
    
    # Usage information
    usage = results['usage']
    print(f"\n=== USAGE STATISTICS ===")
    print(f"Completion Tokens: {usage['completion_tokens']:,}")
    print(f"Prompt Tokens: {usage['prompt_tokens']:,}")
    print(f"Total Cost: ${usage['cost']:.4f}")
    
    return df

# Analyze results
if 'results' in locals():
    df = analyze_results(results)
else:
    print("No results available yet. Run the experiment first.")


In [None]:
# Create visualizations
def create_visualizations(df):
    """
    Create visualizations of the experiment results
    """
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    
    # 1. Reward over tasks
    axes[0, 0].plot(df['Task'], df['Reward'], 'b-', alpha=0.7, linewidth=1)
    axes[0, 0].set_title('Reward per Task')
    axes[0, 0].set_xlabel('Task Number')
    axes[0, 0].set_ylabel('Reward')
    axes[0, 0].grid(True, alpha=0.3)
    
    # 2. Success rate over time (cumulative)
    df['Cumulative_Success'] = df['Success'].cumsum()
    df['Cumulative_Success_Rate'] = df['Cumulative_Success'] / (df['Task'] + 1)
    axes[0, 1].plot(df['Task'], df['Cumulative_Success_Rate'], 'g-', linewidth=2)
    axes[0, 1].set_title('Cumulative Success Rate')
    axes[0, 1].set_xlabel('Task Number')
    axes[0, 1].set_ylabel('Success Rate')
    axes[0, 1].grid(True, alpha=0.3)
    
    # 3. Reward distribution
    axes[1, 0].hist(df['Reward'], bins=20, alpha=0.7, color='skyblue', edgecolor='black')
    axes[1, 0].set_title('Reward Distribution')
    axes[1, 0].set_xlabel('Reward')
    axes[1, 0].set_ylabel('Frequency')
    axes[1, 0].grid(True, alpha=0.3)
    
    # 4. Success vs Failure
    success_count = df['Success'].sum()
    failure_count = len(df) - success_count
    axes[1, 1].pie([success_count, failure_count], 
                   labels=['Success', 'Failure'], 
                   autopct='%1.1f%%',
                   colors=['lightgreen', 'lightcoral'])
    axes[1, 1].set_title('Success vs Failure')
    
    plt.tight_layout()
    plt.show()
    
    return df

# Create visualizations if results are available
if 'results' in locals() and 'df' in locals():
    df_with_viz = create_visualizations(df)
else:
    print("No results available for visualization. Run the experiment first.")


In [None]:
# Save results to file
def save_results(results, filename=None):
    """
    Save experiment results to a JSON file
    """
    if filename is None:
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        filename = f"webshop_results_{timestamp}.json"
    
    # Prepare results for JSON serialization
    save_data = {
        'config': {
            'backend': config.backend,
            'temperature': config.temperature,
            'task_start_index': config.task_start_index,
            'task_end_index': config.task_end_index,
            'prompt_sample': config.prompt_sample,
            'iterations': config.iterations
        },
        'results': {
            'task_accs': results['task_accs'],
            'final_r': results['final_r'],
            'final_sr': results['final_sr'],
            'final_fr': results['final_fr'],
            'usage': results['usage']
        },
        'timestamp': datetime.now().isoformat()
    }
    
    with open(filename, 'w') as f:
        json.dump(save_data, f, indent=2)
    
    print(f"Results saved to: {filename}")
    return filename

# Save results if available
if 'results' in locals():
    results_file = save_results(results)
else:
    print("No results to save. Run the experiment first.")


## Quick Start Guide

### Prerequisites:
1. Set the `TOGETHER_API_KEY` environment variable
2. Make sure you have the required dependencies installed

### To run a single task experiment:
```python
# Modify config for a single task
config.task_start_index = 900
config.task_end_index = 901
config.iterations = 10  # Reduce iterations for faster testing

# Run the experiment
results = run_experiment(config)
```

### To run a larger experiment:
```python
# Modify config for multiple tasks
config.task_start_index = 900
config.task_end_index = 910  # 10 tasks
config.iterations = 30

# Run the experiment
results = run_experiment(config)
```

### To change the model:
```python
# Change the backend model (LiteLLM supports many models)
config.backend = 'gpt-4'  # or 'gpt-3.5-turbo', 'gpt-3.5-turbo-16k', 'claude-3-sonnet', etc.
config.temperature = 0.7  # Adjust temperature
config.max_tokens = 4096  # Adjust max tokens
```

### To use Chain-of-Thought prompting:
```python
# Enable CoT prompting
config.prompt_sample = 'cot'
```

### Note about LiteLLM:
This notebook uses LiteLLM for model access, which provides:
- Support for multiple model providers
- Automatic retry logic
- Concurrent request handling
- Unified interface across different models