## Introduction

Processing large volumes of text with Large Language Models (LLMs) can be slow and inefficient if not handled correctly. This post introduces the `BatchProcessor`, a powerful utility designed to streamline this process. It leverages asynchronous requests, parallel processing, and intelligent rate-limiting to efficiently process lists of texts or entire Pandas DataFrames.

A key feature is its integration with Pydantic, which allows you to define a structured output format for the LLM. This ensures that the model's responses are automatically validated and parsed into clean, type-safe Python objects, which is essential for building reliable data extraction pipelines.

The `BatchProcessor` is built upon two other key components:

-   `llm_config.py`: Manages LLM API configurations, making it easy to switch between different models and providers (like Azure OpenAI, OpenRouter, etc.).
-   `rate_limiters.py`: Handles the complexities of API rate limiting to prevent errors and maximize throughput.

Let's dive in and see how it works.

## Setup and Initialization

First, let's import the necessary libraries and initialize the `BatchProcessor`.

In [1]:
from utils.llm.batch_llm import BatchProcessor
from utils.llm.llm_config import LLMConfigs
from pydantic import BaseModel
import pandas as pd
from pprint import pprint

# Set up the configuration for the desired LLM
# Here, we use OpenRouter to access Google's Gemini Flash model
config = LLMConfigs.openrouter(model="google/gemini-2.5-flash-lite")

# Initialize the processor with the config
processor = BatchProcessor(llm_config=config)

### Configuring the Rate Limiter

When you initialize the `BatchProcessor`, you can also control its rate-limiting behavior. This is crucial for managing API costs and staying within the limits of your LLM provider. It has the following arguments:
-   `max_tokens_per_minute`: Sets the overall throughput limit for the processor.
-   `safety_margin`: A buffer to prevent exceeding the rate limit due to slight variations in token counting. A value of `0.9` means the processor will target 90% of the specified limit.
-   `max_concurrent`: Controls the number of asynchronous requests that can be active at any given time. Increasing this can improve speed, but may lead to rate-limiting errors if set too high.

In [2]:
# Example of custom rate-limiting settings
custom_processor = BatchProcessor(
    llm_config=config,
    max_tokens_per_minute=100000, # Max tokens to process per minute
    safety_margin=0.9,             # Use 90% of the token limit to be safe
    max_concurrent=100             # Max number of parallel API calls
)

## Core Functionality

The processor offers several methods to handle different data formats.

### Processing a Single Text

The simplest use case is processing a single string. You can get a raw string response or a structured Pydantic model.

In [3]:
# Define a Pydantic model for structured output
class Book(BaseModel):
    title: str
    author: str

# Process with a response model to get a Pydantic object
structured_result = await processor.process_single(
    "The Name of the Wind by Patrick Rothfuss", 
    response_model=Book
)

print("--- Structured Output ---")
pprint(structured_result)
print(f"Type: {type(structured_result)}")

# Process without a response model to get a raw string
raw_result = await processor.process_single(
    "Give the title and author of this book in json format: The Name of the Wind by Patrick Rothfuss"
)

print("\n--- Raw String Output ---")
print(raw_result)
print(f"Type: {type(raw_result)}")

--- Structured Output ---
Book(title='The Name of the Wind', author='Patrick Rothfuss')
Type: <class '__main__.Book'>

--- Raw String Output ---
```json
{
  "title": "The Name of the Wind",
  "author": "Patrick Rothfuss"
}
```
Type: <class 'str'>


### Processing a Batch of Texts

For multiple texts, `process_batch` sends requests in parallel, significantly speeding up the workflow.

In [4]:
test_texts = [
    "To Kill a Mockingbird by Harper Lee (1960)",
    "The Great Gatsby Author: F. Scott Fitzgerald Publication Year: 1925",
    "Book: 1984 Writer: George Orwell Year: 1949",
    "Austen, Jane. Pride and Prejudice. 1813.",
    "One Hundred Years of Solitude - Gabriel García Márquez, published in 1967"
]

class BookWithYear(BaseModel):
    title: str
    author: str
    year: int

# Process the batch with a response model
results = await processor.process_batch(test_texts, response_model=BookWithYear)

print("--- Batch Results (Structured) ---")
pprint(results)

--- Batch Results (Structured) ---
[BookWithYear(title='To Kill a Mockingbird', author='Harper Lee', year=1960),
 BookWithYear(title='The Great Gatsby', author='F. Scott Fitzgerald', year=1925),
 BookWithYear(title='1984', author='George Orwell', year=1949),
 BookWithYear(title='Pride and Prejudice', author='Jane Austen', year=1813),
 BookWithYear(title='One Hundred Years of Solitude', author='Gabriel García Márquez', year=1967)]


### Processing a Pandas DataFrame

Perhaps the most powerful feature is `process_dataframe`, which applies LLM processing to a DataFrame column. We can still use a Pydantic model with the ```response_model``` parameter and even have the option of saving the output as a list of objects that include fields for our input dataframe's columns using by setting ```output_format='objects'```.

In [5]:
# Create a sample DataFrame
df = pd.DataFrame({'id': [1, 2, 3, 4, 5], 'book_info': test_texts})

print("--- Original DataFrame ---")
display(df)

# Process the 'book_info' column and append results as new columns
results_df = await processor.process_dataframe(
    df, 
    prompt_column='book_info', 
    response_model=BookWithYear
)

print("\n--- Processed DataFrame ---")
display(results_df)

# You can also get a list of enriched objects instead of a DataFrame
object_list = await processor.process_dataframe(
    df, 
    prompt_column='book_info', 
    response_model=BookWithYear, 
    output_format='objects'
)

print("\n--- Output as Objects ---")
pprint(object_list[0])

--- Original DataFrame ---


Unnamed: 0,id,book_info
0,1,To Kill a Mockingbird by Harper Lee (1960)
1,2,The Great Gatsby Author: F. Scott Fitzgerald P...
2,3,Book: 1984 Writer: George Orwell Year: 1949
3,4,"Austen, Jane. Pride and Prejudice. 1813."
4,5,One Hundred Years of Solitude - Gabriel García...



--- Processed DataFrame ---


Unnamed: 0,id,book_info,title,author,year
0,1,To Kill a Mockingbird by Harper Lee (1960),To Kill a Mockingbird,Harper Lee,1960
1,2,The Great Gatsby Author: F. Scott Fitzgerald P...,The Great Gatsby,F. Scott Fitzgerald,1925
2,3,Book: 1984 Writer: George Orwell Year: 1949,1984,George Orwell,1949
3,4,"Austen, Jane. Pride and Prejudice. 1813.",Pride and Prejudice,Jane Austen,1813
4,5,One Hundred Years of Solitude - Gabriel García...,One Hundred Years of Solitude,Gabriel García Márquez,1967



--- Output as Objects ---
EnrichedBookWithYear(title='To Kill a Mockingbird', author='Harper Lee', year=1960, id=1, book_info='To Kill a Mockingbird by Harper Lee (1960)')


### Without using Pydantic

If no response_model is provided, the raw LLM output is added to a new llm_output column. We can specify the name of the ouput column using the ```output_column_name``` parameter.

In [10]:

df['prompts'] = 'Give the title, author and published year of this book separated by a comma: ' + df['book_info']
df_default = await processor.process_dataframe(
    df, 
    prompt_column='prompts',
    output_column_name='formatted_book')

print("\n--- Output Without response_model ---")
display(df_default)


--- Output Without response_model ---


Unnamed: 0,id,book_info,prompts,formatted_book
0,1,To Kill a Mockingbird by Harper Lee (1960),"Give the title, author and published year of this book separated by a comma: To Kill a Mockingbird by Harper Lee (1960)","To Kill a Mockingbird, Harper Lee, 1960"
1,2,The Great Gatsby Author: F. Scott Fitzgerald Publication Year: 1925,"Give the title, author and published year of this book separated by a comma: The Great Gatsby Author: F. Scott Fitzgerald Publication Year: 1925","The Great Gatsby, F. Scott Fitzgerald, 1925"
2,3,Book: 1984 Writer: George Orwell Year: 1949,"Give the title, author and published year of this book separated by a comma: Book: 1984 Writer: George Orwell Year: 1949","1984, George Orwell, 1949"
3,4,"Austen, Jane. Pride and Prejudice. 1813.","Give the title, author and published year of this book separated by a comma: Austen, Jane. Pride and Prejudice. 1813.","Pride and Prejudice, Jane Austen, 1813"
4,5,"One Hundred Years of Solitude - Gabriel García Márquez, published in 1967","Give the title, author and published year of this book separated by a comma: One Hundred Years of Solitude - Gabriel García Márquez, published in ...","One Hundred Years of Solitude, Gabriel García Márquez, 1967"


## Advanced Use Case: Comparing LLM Outputs

The `BatchProcessor` makes it easy to compare the performance of different models on the same task. Here, we'll use it to solve math problems from the `gsm8k` benchmark using three different models and compare their answers side-by-side.

In [11]:
# 1. Load the gsm8k math benchmark from Hugging Face
gsm8k_df = pd.read_parquet("hf://datasets/openai/gsm8k/main/test-00000-of-00001.parquet")[:5]

# 2. Define a detailed prompt for the models
instructions = """
    You are a math problem solver. For each problem, show your work step-by-step and use the following format:
    1. Write each calculation step clearly.
    2. For any arithmetic operation, show it in double angle brackets: <<calculation=result>>.
    3. End your response with "#### [final_answer]" where final_answer is just the numerical result.
    
    Solve this problem:
"""

gsm8k_df['prompt'] = instructions + gsm8k_df['question']

# 3. Set up configs and processors for each model we want to compare
models_to_compare = {
    "gemini_flash": "google/gemini-2.5-flash-lite",
    "claude_sonnet": "anthropic/claude-3.5-sonnet",
    "qwen235b": "qwen/qwen3-235b-a22b-2507"
}

processors = {
    name: BatchProcessor(llm_config=LLMConfigs.openrouter(model=model_id))
    for name, model_id in models_to_compare.items()
}

# Set pandas options for better display
pd.set_option('display.max_colwidth', 150) # Adjust column width

print("--- Models to Compare ---")
print(list(models_to_compare.keys()))

print("\n--- Sample Prompt ---")
print(gsm8k_df['prompt'].iloc[0])

--- Models to Compare ---
['gemini_flash', 'claude_sonnet', 'qwen235b']

--- Sample Prompt ---

    You are a math problem solver. For each problem, show your work step-by-step and use the following format:
    1. Write each calculation step clearly.
    2. For any arithmetic operation, show it in double angle brackets: <<calculation=result>>.
    3. End your response with "#### [final_answer]" where final_answer is just the numerical result.

    Solve this problem:
Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?


Now, we'll run the DataFrame through each processor sequentially. Each run will add a new column containing that model's response.

In [8]:
results_df = gsm8k_df.copy()

for name, processor in processors.items():
    print(f"Processing with {name}...")
    results_df = await processor.process_dataframe(
        results_df, 
        prompt_column='prompt', 
        output_column_name=name
    )

# Display the final comparison
print("\n--- Model Comparison Results ---")
display(results_df[['question', 'answer'] + list(models_to_compare.keys())])

Processing with gemini_flash...
Processing with claude_sonnet...
Processing with qwen235b...

--- Model Comparison Results ---


Unnamed: 0,question,answer,gemini_flash,claude_sonnet,qwen235b
0,Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the...,Janet sells 16 - 3 - 4 = <<16-3-4=9>>9 duck eggs a day.\nShe makes 9 * 2 = $<<9*2=18>>18 every day at the farmer’s market.\n#### 18,Here's how to solve the problem step-by-step:\n\n1. **Calculate the total number of eggs eaten and used for baking.**\n * Eggs eaten for bre...,Let me solve this step by step:\n\n1. Start with total eggs per day:\n * Daily eggs = 16\n\n2. Subtract eggs Janet eats for breakfast:\n * <<1...,1. Janet’s ducks lay 16 eggs per day. \n2. She eats 3 eggs for breakfast: <<16 - 3 = 13>> \n3. She uses 4 eggs to bake muffins: <<13 - 4 = 9>> ...
1,A robe takes 2 bolts of blue fiber and half that much white fiber. How many bolts in total does it take?,It takes 2/2=<<2/2=1>>1 bolt of white fiber\nSo the total amount of fabric is 2+1=<<2+1=3>>3 bolts of fabric\n#### 3,Here's how to solve the problem step-by-step:\n\n1. **Determine the amount of blue fiber:**\n The problem states that a robe takes 2 bolts of ...,Let me solve this step by step:\n\n1. Blue fiber needed = 2 bolts\n2. White fiber needed = half of blue fiber\n * <<2 ÷ 2 = 1>> bolt of white fi...,"1. The robe takes 2 bolts of blue fiber. \n2. It takes half as much white fiber as blue fiber. \n <<2 / 2 = 1>> \n So, it takes 1 bolt of w..."
2,"Josh decides to try flipping a house. He buys a house for $80,000 and then puts in $50,000 in repairs. This increased the value of the house by ...","The cost of the house and repairs came out to 80,000+50,000=$<<80000+50000=130000>>130,000\nHe increased the value of the house by 80,000*1.5=<<80...",Here's how to solve this problem step-by-step:\n\n1. **Calculate the total cost of the house:**\n The cost of the house and the repairs are ad...,Let me solve this step by step:\n\n1. Calculate total investment:\n * Purchase price + Repairs\n * <<80000 + 50000 = 130000>>\n\n2. Original v...,"1. First, calculate the total cost of the house including repairs. \n Total cost = Purchase price + Repair cost \n <<80000 + 50000 = 130000>..."
3,James decides to run 3 sprints 3 times a week. He runs 60 meters each sprint. How many total meters does he run a week?,He sprints 3*3=<<3*3=9>>9 times\nSo he runs 9*60=<<9*60=540>>540 meters\n#### 540,"1. Calculate the total distance James runs in one session.\n He runs 3 sprints per session, and each sprint is 60 meters.\n <<calculation=3...","Let me solve this step by step:\n\n1. First, find meters per session:\n * Each sprint is 60 meters\n * He does 3 sprints per session\n * <<6...","1. James runs 3 sprints each time he trains. \n2. Each sprint is 60 meters. \n3. So, the distance per training session is: \n<<3 * 60 = 180>> m..."
4,"Every day, Wendi feeds each of her chickens three cups of mixed chicken feed, containing seeds, mealworms and vegetables to help keep them healthy...","If each chicken eats 3 cups of feed per day, then for 20 chickens they would need 3*20=<<3*20=60>>60 cups of feed per day.\nIf she feeds the flock...",Here's how to solve this problem step-by-step:\n\n1. **Calculate the total feed needed per chicken:**\n Each chicken is fed 3 cups of feed per...,"Let me solve this step by step.\n\n1. First, let's calculate total daily feed needed\n * Number of chickens: 20\n * Cups per chicken: 3\n * ...","1. First, determine how many cups of feed each chicken gets per day. \n Each chicken gets 3 cups of feed per day. \n Wendi has 20 chickens. ..."


This side-by-side comparison is invaluable for evaluating model performance and choosing the best one for your specific task.