## Introduction — Stop Copying Receipts by Hand

Manual data entry from receipts, invoices, and contracts wastes hours and introduces errors. What if you could automatically extract structured data from these documents in minutes?

In this article, you'll learn how to transform receipt images into structured data using LlamaIndex, then export the results to a spreadsheet for analysis.


## What You Will Learn

- Convert scanned receipts to structured data with LlamaParse and Pydantic models
- Validate extraction accuracy by comparing results against ground truth annotations
- Fix parsing errors by preprocessing low-quality images
- Export clean receipt data to spreadsheet format

## Introduction to LlamaIndex

[LlamaIndex](https://www.llamaindex.ai/) is a framework that connects LLMs with your data through three core capabilities:

1. **Data ingestion**: Built-in readers for PDFs, images, web pages, and databases that automatically parse content into processable nodes.
2. **Structured extraction**: LLM-powered conversion of unstructured text into Pydantic models with automatic validation.
3. **Retrieval and indexing**: Vector stores and semantic search that enable context-augmented queries over your documents.

It eliminates boilerplate code for loading, parsing, and querying data, letting you focus on building LLM applications.

The table below compares LlamaIndex with two other popular frameworks for LLM applications:

| Framework | Purpose | Best For |
| --------- | ------- | -------- |
| **LlamaIndex** | Document ingestion and structured extraction | Converting unstructured documents into query-ready data |
| **LangChain** | LLM orchestration and tool integration | Building conversational agents with multiple LLM calls |
| **LangGraph** | Stateful workflow management | Coordinating long-running, multi-agent processes |

### Installation

Start with installing the required packages for this tutorial, including:

- llama-index: Core LlamaIndex framework with base indexing and retrieval functionality
- llama-parse: Document parsing service for PDFs, images, and complex layouts
- llama-index-program-openai: OpenAI integration for structured data extraction with Pydantic
- python-dotenv: Load environment variables from .env files
- rapidfuzz: Fuzzy string matching library for comparing company names with minor variations

```bash
pip install llama-index llama-parse llama-index-program-openai python-dotenv rapidfuzz
```

### Environment Setup

Create a `.env` file to store your API keys:

```text
# .env
LLAMA_CLOUD_API_KEY="your-llama-parse-key"
OPENAI_API_KEY="your-openai-key"
```

Get your API keys from:
- **LlamaParse API**: [cloud.llamaindex.ai](https://cloud.llamaindex.ai)
- **OpenAI API**: [platform.openai.com/api-keys](https://platform.openai.com/api-keys)

Load the environment variables from the `.env` file with `load_dotenv`:

In [None]:
from dotenv import load_dotenv
import os

load_dotenv()

Configure the default LLM with `Settings`:

In [None]:
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI

Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0)
Settings.context_window = 8000

`Settings` stores global defaults so every query engine and program reuses the same LLM configuration. Keeping temperature at 0 nudges the model to return deterministic, structured outputs.

## Basic Image Processing with LlamaParse

In this tutorial, we will use the [SROIE Dataset v2](https://www.kaggle.com/datasets/urbikn/sroie-datasetv2) from Kaggle. This dataset contains real-world receipt scans from the ICDAR 2019 competition.

You can download the dataset directly from Kaggle's website or use the Kaggle CLI: 

```bash
# Install the Kaggle CLI once
uv pip install kaggle

# Configure Kaggle credentials (run once per environment)
export KAGGLE_USERNAME=your_username
export KAGGLE_KEY=your_api_key

# Create a workspace folder and download the full archive (~1 GB)
mkdir -p data
kaggle datasets download urbikn/sroie-datasetv2 -p data

# Extract everything and inspect a few image files
unzip -q -o data/sroie-datasetv2.zip -d data
find data/SROIE2019 -maxdepth 3 -type f -name "*.jpg" | head
```

This tutorial uses data from the `data/SROIE2019/train/` directory, which contains:

- `img`: Original receipt images
- `entities`: Ground truth annotations for validation

Load the first 10 receipts into a list of paths:

In [None]:
from pathlib import Path

receipt_dir = Path("data/SROIE2019/train/img")
num_receipts = 10
receipt_paths = sorted(receipt_dir.glob("*.jpg"))[:num_receipts]

In [None]:
receipt_paths

Take a look at the first receipt:

In [None]:
from IPython.display import Image

first_receipt_path = receipt_paths[0]
Image(filename=first_receipt_path)

Next, use `LlamaParse` to convert the first receipt into markdown.

In [None]:
from llama_parse import LlamaParse


# Parse receipts with LlamaParse
parser = LlamaParse(
    api_key=os.environ["LLAMA_CLOUD_API_KEY"],
    result_type="markdown",  # Output format
    num_workers=4,  # Number of parallel workers for faster processing
    language="en",  # Language hint for OCR accuracy
    skip_diagonal_text=True,  # Ignore rotated or diagonal text
)
first_receipt = parser.load_data(first_receipt_path)[0]

Preview the markdown for the first receipt:

In [None]:
# Preview the first receipt
preview = "\n".join(first_receipt.text.splitlines()[:10])
print(preview)

LlamaParse successfully converts receipt images to text, but there is no structure: vendor names, dates, and totals are all mixed together in plain text. This format is not ideal for exporting to spreadsheets or analytics tools for further analysis.

The next section uses Pydantic models to extract structured fields like `company`, `total`, and `purchase_date` automatically.

## Structured Data Extraction with Pydantic

[Pydantic](https://docs.pydantic.dev/) is a Python library that uses type hints for data validation and automatic type conversion. By defining a receipt schema once, you can extract consistent structured data from receipts regardless of their format or layout.

Start by defining two Pydantic models that represent receipt structure:

In [None]:
from datetime import date
from typing import List, Optional
from pydantic import BaseModel, Field, ValidationInfo, model_validator


class ReceiptItem(BaseModel):
    """Represents a single line item extracted from a receipt."""

    description: str = Field(description="Item name exactly as shown on the receipt")
    quantity: int = Field(default=1, ge=1, description="Integer quantity of the item")
    unit_price: Optional[float] = Field(
        default=None, ge=0, description="Price per unit in the receipt currency"
    )
    discount_amount: float = Field(
        default=0.0, ge=0, description="Discount applied to this line item"
    )


class Receipt(BaseModel):
    """Structured fields extracted from a retail receipt."""

    company: str = Field(description="Business or merchant name")
    purchase_date: Optional[date] = Field(
        default=None, description="Date in YYYY-MM-DD format"
    )
    address: Optional[str] = Field(default=None, description="Address of the business")
    total: float = Field(description="Final charged amount")
    items: List[ReceiptItem] = Field(default_factory=list)

Create an `OpenAIPydanticProgram` that instructs the LLM to extract data according to our `Receipt` model:

In [None]:
from llama_index.program.openai import OpenAIPydanticProgram

prompt = """
You are extracting structured data from a receipt.
Use the provided text to populate the Receipt model.
Interpret every receipt date as day-first.
If a field is missing, return null.

{context_str}
"""

receipt_program = OpenAIPydanticProgram.from_defaults(
    output_cls=Receipt,
    llm=Settings.llm,
    prompt_template_str=prompt,
)

Process the first parsed document to make sure everything works before scaling to the full batch:

In [None]:
# Process the first receipt
structured_first_receipt = receipt_program(context_str=first_receipt.text)

# Print the receipt as a JSON string for better readability
print(structured_first_receipt.model_dump_json(indent=2))

LlamaIndex populates the Pydantic schema with extracted values:

- `company` - Vendor name from the receipt header
- `purchase_date` - Parsed date (2018-12-25)
- `total` - Final amount (9.0)
- `items` - Line items with description, quantity, and price

Now that the extraction works, let's scale it to process all receipts in a batch. The function uses each receipt's filename as a unique identifier:

In [None]:
def extract_documents(paths: List[str], prompt: str, id_column: str = "receipt_id") -> List[dict]:
    """Extract structured data from documents using LlamaParse and LLM."""
    results: List[dict] = []

    # Initialize parser with OCR settings
    parser = LlamaParse(
        api_key=os.environ["LLAMA_CLOUD_API_KEY"],
        result_type="markdown",
        num_workers=4,
        language="en",
        skip_diagonal_text=True,
    )

    # Convert images to markdown text
    documents = parser.load_data(paths)

    # Create structured extraction program
    program = OpenAIPydanticProgram.from_defaults(
        output_cls=Receipt,
        llm=Settings.llm,
        prompt_template_str=prompt,
    )

    # Extract structured data from each document
    for path, doc in zip(paths, documents):
        document_id = Path(path).stem
        parsed_document = program(context_str=doc.text)
        results.append(
            {
                id_column: document_id,
                "data": parsed_document,
            }
        )
    return results

# Extract structured data from all receipts
structured_receipts = extract_documents(receipt_paths, prompt)

Convert the extracted receipts into a DataFrame for easier inspection:

In [None]:
import pandas as pd


def transform_receipt_columns(df: pd.DataFrame) -> pd.DataFrame:
    """Apply standard transformations to receipt DataFrame columns."""
    df = df.copy()
    df["company"] = df["company"].str.upper()
    df["total"] = pd.to_numeric(df["total"], errors="coerce")
    df["purchase_date"] = pd.to_datetime(
        df["purchase_date"], errors="coerce", dayfirst=True
    ).dt.date
    return df


def create_extracted_df(records: List[dict], id_column: str = "receipt_id") -> pd.DataFrame:
    df = pd.DataFrame(
        [
            {
                id_column: record[id_column],
                "company": record["data"].company,
                "total": record["data"].total,
                "purchase_date": record["data"].purchase_date,
            }
            for record in records
        ]
    )
    return transform_receipt_columns(df)


extracted_df = create_extracted_df(structured_receipts)
extracted_df

Most receipts are extracted correctly, but receipt X51005200938 shows issues:

- The company name is incomplete ("TH MNAN")
- Total is 0 instead of the actual amount
- Date (2023-10-11) appears incorrect

## Compare Extraction with Ground Truth

To verify the extraction accuracy, load the ground-truth annotations from `data/SROIE2019/train/entities`:


In [None]:
def normalize_date(value: str) -> str:
    """Normalize date strings to consistent format."""
    value = (value or "").strip()
    if not value:
        return value
    # Convert hyphens to slashes
    value = value.replace("-", "/")
    parts = value.split("/")
    # Convert 2-digit years to 4-digit (e.g., 18 -> 2018)
    if len(parts[-1]) == 2:
        parts[-1] = f"20{parts[-1]}"
    return "/".join(parts)


def create_ground_truth_df(
    label_paths: List[str], id_column: str = "receipt_id"
) -> pd.DataFrame:
    """Create ground truth DataFrame from label JSON files."""
    records = []
    # Load each JSON file and extract key fields
    for path in label_paths:
        payload = pd.read_json(Path(path), typ="series").to_dict()
        records.append(
            {
                id_column: Path(path).stem,
                "company": payload.get("company"),
                "total": payload.get("total"),
                "purchase_date": normalize_date(payload.get("date")),
            }
        )

    df = pd.DataFrame(records)
    # Apply same transformations as extracted data
    return transform_receipt_columns(df)


# Load ground truth annotations
label_dir = Path("data/SROIE2019/train/entities")
label_paths = sorted(label_dir.glob("*.txt"))[:num_receipts]

ground_truth_df = create_ground_truth_df(label_paths)
ground_truth_df

Let's validate extraction accuracy by comparing results against ground truth.

Company names often have minor variations (spacing, punctuation, extra characters), so we'll use [fuzzy matching](https://codecut.ai/text-similarity-fuzzy-matching-guide/) to tolerate these formatting differences.

In [None]:
from rapidfuzz import fuzz


def fuzzy_match_score(text1: str, text2: str) -> int:
    """Calculate fuzzy match score between two strings."""
    return fuzz.token_set_ratio(str(text1), str(text2))

Test the fuzzy matching with sample company names:

In [None]:
# Nearly identical strings score high
print(f"Score: {fuzzy_match_score('BOOK TA K SDN BHD', 'BOOK TA .K SDN BHD'):.2f}")

# Different punctuation still matches well
print(f"Score: {fuzzy_match_score('MR D.I.Y. JOHOR', 'MR DIY JOHOR'):.2f}")

# Completely different strings score low
print(f"Score: {fuzzy_match_score('ABC TRADING', 'XYZ COMPANY'):.2f}")

Now build a comparison function that merges extracted and ground truth data, then applies fuzzy matching for company names and exact matching for numeric fields:

In [None]:
def compare_receipts(
    extracted_df: pd.DataFrame,
    ground_truth_df: pd.DataFrame,
    id_column: str,
    fuzzy_match_cols: List[str],
    exact_match_cols: List[str],
    fuzzy_threshold: int = 80,
) -> pd.DataFrame:
    """Compare extracted and ground truth data with explicit column specifications."""
    comparison_df = extracted_df.merge(
        ground_truth_df,
        on=id_column,
        how="inner",
        suffixes=("_extracted", "_truth"),
    )

    # Fuzzy matching
    for col in fuzzy_match_cols:
        extracted_col = f"{col}_extracted"
        truth_col = f"{col}_truth"
        comparison_df[f"{col}_score"] = comparison_df.apply(
            lambda row: fuzzy_match_score(row[extracted_col], row[truth_col]),
            axis=1,
        )
        comparison_df[f"{col}_match"] = comparison_df[f"{col}_score"] >= fuzzy_threshold

    # Exact matching
    for col in exact_match_cols:
        extracted_col = f"{col}_extracted"
        truth_col = f"{col}_truth"
        comparison_df[f"{col}_match"] = (
            comparison_df[extracted_col] == comparison_df[truth_col]
        )

    return comparison_df


comparison_df = compare_receipts(
    extracted_df,
    ground_truth_df,
    id_column="receipt_id",
    fuzzy_match_cols=["company"],
    exact_match_cols=["total", "purchase_date"],
)

Inspect any rows where the company, total, or purchase-date checks fail:

In [None]:
def get_mismatch_rows(comparison_df: pd.DataFrame) -> pd.DataFrame:
    """Get mismatched rows, excluding match indicator columns."""
    # Extract match columns and data columns
    match_columns = [col for col in comparison_df.columns if col.endswith("_match")]
    data_columns = sorted([col for col in comparison_df.columns if col.endswith("_extracted") or col.endswith("_truth")])

    # Check for rows where not all matches are True
    has_mismatch = comparison_df[match_columns].all(axis=1).eq(False)

    return comparison_df[has_mismatch][data_columns]


mismatch_df = get_mismatch_rows(comparison_df)


mismatch_df

This confirms what we saw earlier. All receipts match the ground truth annotations except for receipt ID X51005200938 for the following fields:

- Company name
- Total
- Purchase date

Let's take a closer look at this receipt to see if we can identify the issue.

In [None]:
import IPython.display as display

file_to_inspect = receipt_dir / "X51005200938.jpg"

display.Image(filename=file_to_inspect)

This receipt appears smaller than the others in the dataset, which may affect OCR readability. In the next section, we will scale up the receipt to improve the extraction.


## Process the Images for Better Extraction

Create a function to scale up the receipt:

In [None]:
from PIL import Image


def scale_image(image_path: Path, output_dir: Path, scale_factor: int = 3) -> Path:
    """Scale up an image using high-quality resampling.

    Args:
        image_path: Path to the original image
        output_dir: Directory to save the scaled image
        scale_factor: Factor to scale up the image (default: 3x)

    Returns:
        Path to the scaled image
    """
    # Load the image
    img = Image.open(image_path)

    # Scale up the image using high-quality resampling
    new_size = (img.width * scale_factor, img.height * scale_factor)
    img_resized = img.resize(new_size, Image.Resampling.LANCZOS)

    # Save to output directory with same filename
    output_dir.mkdir(parents=True, exist_ok=True)
    output_path = output_dir / image_path.name
    img_resized.save(output_path, quality=95)

    return output_path

Apply the function to the problematic receipt:


In [None]:
problematic_receipt_path = receipt_dir / "X51005200938.jpg"
adjusted_receipt_dir = Path("data/SROIE2019/train/img_adjusted")

scaled_image_path = scale_image(problematic_receipt_path, adjusted_receipt_dir, scale_factor=3)

Let's extract the structured data from the scaled image:

In [None]:
problematic_structured_receipts = extract_documents([scaled_image_path], prompt)
problematic_extracted_df = create_extracted_df(problematic_structured_receipts)

problematic_extracted_df

Nice! Scaling fixes the extraction. Company name and purchase date are now accurate. The total is 112.46 vs 112.45, acceptable since 112.45 actually looks like 112.46 when printed on the receipt.

## Export Clean Data to CSV or Excel

Apply the scaling fix to all receipts. Copy the remaining images to the processed directory, excluding the already-scaled receipt:

In [None]:
import shutil

clean_receipt_paths = [scaled_image_path]
# Copy all receipts except the already processed one
for receipt_path in receipt_paths:
    if receipt_path != problematic_receipt_path:  # Skip the already scaled image
        output_path = adjusted_receipt_dir / receipt_path.name
        shutil.copy2(receipt_path, output_path)
        clean_receipt_paths.append(output_path)
        print(f"Copied {receipt_path.name}")

Let's run the pipeline again with the processed images:

In [None]:
clean_structured_receipts = extract_documents(clean_receipt_paths, prompt)
clean_extracted_df = create_extracted_df(clean_structured_receipts)
clean_extracted_df

Awesome! All receipts now match the ground truth annotations.

Now we can export the dataset to a spreadsheet with just a few lines of code:

In [None]:
import pandas as pd

# Export to CSV
output_path = Path("reports/receipts.csv")
output_path.parent.mkdir(parents=True, exist_ok=True)
clean_extracted_df.to_csv(output_path, index=False)
print(f"Exported {len(clean_extracted_df)} receipts to {output_path}")

The exported data can now be imported into spreadsheet applications, analytics tools, or business intelligence platforms.

## Try It Yourself

The concepts from this tutorial are available as a reusable pipeline in [this GitHub repository](https://github.com/CodeCutTech/Data-science/tree/master/llm/smart_data_extraction_llamaindex). The code includes:

- **Generic pipeline** ([`document_extraction_pipeline.py`](https://github.com/CodeCutTech/Data-science/blob/master/llm/smart_data_extraction_llamaindex/document_extraction_pipeline.py)): Reusable extraction function that works with any Pydantic schema
- **Receipt pipeline** ([`extract_receipts_pipeline.py`](https://github.com/CodeCutTech/Data-science/blob/master/llm/smart_data_extraction_llamaindex/extract_receipts_pipeline.py)): Complete example with Receipt schema, image scaling, and data transformations

Run the receipt extraction example:

```bash
uv run extract_receipts_pipeline.py
```

Or create your own extractor by importing `extract_structured_data()` and providing your custom Pydantic schema, extraction prompt, and optional preprocessing functions.