# LlamaExtract Evaluation

**Introduction**:

The purpose of this notebook is to evaluate the efficacy of the cloud-based LlamaExtract document extraction service on the same W2 dataset. The overall accuracy was 71%, significantly less than the 97% accuracy of the fine-tuned Qwen-2.5-VL model.

Steps:
  - Bootstrap the environment
  - Load and prepare test data
  - Create an extraction agent
  - Queue extraction jobs
  - Monitor job status and collect results
  - Evaluate and compare extracted data with ground truth
  - Save and report the results

See the README for a detailed discussion of project setup steps, background, and measured performance. The full results are available in the `results` directory.




# Bootstrap Environment

In [None]:
# Set working directory
import os
os.environ["APP_PROJECT_DIR"] = "/content/ai-document-extraction"  # override with project directory
os.chdir(os.environ["APP_PROJECT_DIR"])

# Install packages and bootstrap environment
%pip install -q python-dotenv
from src.utils.env_setup import setup_environment
env = setup_environment()
%pip install -q -r requirements-{env}.txt

Note: you may need to restart the kernel to use updated packages.
Loaded application properties from: /Users/admin/workspace/ai-document-extraction/.env.local
Working directory: /Users/admin/workspace/ai-document-extraction
Note: you may need to restart the kernel to use updated packages.


In [2]:
from src.utils import data_loader
from src.model import reporting
from llama_cloud_services import LlamaExtract
import os
from pydantic import BaseModel, Field
from src.utils import data_loader, w2_dataset
from src.model import evaluator
from collections import OrderedDict
import pandas as pd


# file and directory paths
base_dir = os.environ["APP_PROJECT_DIR"]
datasets_dir = os.environ["APP_DATA_DIR"]
output_dir = os.environ["APP_OUTPUT_DIR"]
dataset_w2s_dir = f"{datasets_dir}/w2s"
dataset_raw_dir = f"{dataset_w2s_dir}/raw"
dataset_raw_pdfs_dir = f"{dataset_raw_dir}/pdfs"
dataset_processed_dir = f"{dataset_w2s_dir}/processed"
dataset_processed_final_dir = f"{dataset_processed_dir}/final"
output_results_dir = f"{output_dir}/llama_extract"
output_results_file = f"{output_results_dir}/results.csv"
output_report_file = f"{output_results_dir}/results_report.txt"
output_report_ADP1_file = f"{output_results_dir}/results_report_ADP1.txt"
output_report_ADP2_file = f"{output_results_dir}/results_report_ADP2.txt"
output_report_IRS1_file = f"{output_results_dir}/results_report_IRS1.txt"
output_report_IRS2_file = f"{output_results_dir}/results_report_IRS2.txt"

Create helper functions for creating an extraction agent and processing LlamaExtract responses.

In [None]:
def flatten_dict(dict):
    flattened = {}
    for i, key in enumerate(dict.keys()):
        if key == "states":
            states = dict[key]
            for state in states:
                for state_key in state.keys():
                    skey = state_key.replace("state", "state_" + str(i + 1))
                    flattened[skey] = state[state_key]
        else:
            flattened[key] = dict[key]
    return flattened


def normalize_keys(dict):
    # created ordered dict
    d_tmp = {}
    # standardize key format
    for key in dict.keys():
        new_key = key.lower().replace(" ", "_")
        d_tmp[new_key] = dict[key]
    return OrderedDict(sorted(d_tmp.items()))


def create_agent():

    extractor = LlamaExtract(api_key=os.getenv("APP_LI_TOKEN"))

    class State(BaseModel):
        state: str = Field(description="State (Box 15)")
        state_wages_and_tips: str = Field(description="State Wages and Tips (Box 16)")
        state_income_tax_withheld: str = Field(
            description="State Income Tax Withheld (Box 17)"
        )

    class W2(BaseModel):
        employee_name: str = Field(description="Employee Name (Box e)")
        employer_name: str = Field(description="Employer Name (Box c)")
        wages_and_tips: str = Field(description="Wages and Tips (Box 1)")
        federal_income_tax_withheld: str = Field(
            description="Federal Income Tax Withheld (Box 2)"
        )
        social_security_wages: str = Field(description="Social Security Wages (Box 3)")
        medicare_wages_and_tips: str = Field(
            description="Medicare Wages and Tips (Box 5)"
        )
        states: list[State] = Field(description="One or more states")

    # Create extraction agent
    agents = extractor.list_agents()
    if agents:
        for agent in agents:
            extractor.delete_agent(agent.id)
    agent = extractor.create_agent(name="w2-extractor", data_schema=W2)
    return agent

# Load and prepare data

In [24]:
# Load test data
metadata = data_loader.get_metadata(
    f"{dataset_processed_final_dir}/test/metadata.jsonl",
    f"{dataset_processed_final_dir}/test",
)

# collect ground truths, pdf paths, and form types
gt_results = []
pdf_paths = []
form_types = []
for idx, data in enumerate(metadata):

    image_path = data[0]
    ground_truth_json = data[1]

    # get ground truth
    norm_gt = normalize_keys(ground_truth_json)

    gt_results.append(norm_gt)

    # get pdf path (we don't need images)
    file_name = os.path.basename(image_path)
    base, _ = os.path.splitext(file_name)
    pdf_path = f"{dataset_raw_pdfs_dir}/{base}.pdf"
    pdf_paths.append(pdf_path)

    # get form type
    form_type = w2_dataset.get_w2_form_type(file_name)
    form_types.append(form_type)

# Create agent and queue extraction jobs

In [5]:
# create agent
agent = create_agent()

# queue jobs
jobs = await agent.queue_extraction(pdf_paths)

No project_id provided, fetching default project.


Uploading files: 100%|██████████| 100/100 [00:10<00:00,  9.15it/s]
Creating extraction jobs: 100%|██████████| 100/100 [00:10<00:00,  9.27it/s]


Since the jobs are asynchronous, run and re-run this cell until the entire batch is complete. 

In [10]:
num_jobs = len(jobs)
completed = 0
failed = 0
pending = num_jobs

for job in jobs:

    status = agent.get_extraction_job(job_id=job.id).status

    if status == "SUCCESS":
        completed += 1
        pending -= 1
    elif status == "ERROR":
        failed += 1
        pending -= 1

print(f"Jobs completed: {completed}, failed: {failed}, pending: {pending}")

Jobs completed: 100, failed: 0, pending: 0


Post-process the responses from LlamaExtract in order to match the format of the ground truth.

In [19]:
results = []
for job in jobs:
    extract_run = agent.get_extraction_run_for_job(job.id)
    if extract_run.status == "SUCCESS":
        results.append(extract_run.data)
    else:
        print(f"Extraction status for job {job.id}: {extract_run.status}")

pred_results = []
for result in results:
    flattened = flatten_dict(result)
    norm_result = normalize_keys(flattened)
    pred_results.append(norm_result)

Compare the predictions against the ground truth, and save to results.csv.

In [25]:
all_rows = []
for i, (norm_gt, norm_pred) in enumerate(zip(gt_results, pred_results)):

    # compare fields
    rows = evaluator.compare_fields(norm_gt, norm_pred, i)

    # add form type
    for row in rows:
        row.append(form_types[i])

    # collect results
    all_rows.extend(rows)

# create dataframe
df = pd.DataFrame(
    all_rows,
    columns=[
        "Comparison ID",
        "Field",
        "Predicted Value",
        "Ground Truth Value",
        "Match",
        "Form Type",
    ],
)

# Save comparison results to CSV
os.makedirs(output_results_dir, exist_ok=True)
df.to_csv(output_results_file, index=False)

Generate detailed reports of the results.

In [26]:
# Read from persisted CSV
df = pd.read_csv(output_results_file)

# output main report - to file and std out
reporting.output_results(df, output_report_file)

# output report for each form type - to file only
form_types = [
    ("ADP1", output_report_ADP1_file),
    ("ADP2", output_report_ADP2_file),
    ("IRS1", output_report_IRS1_file),
    ("IRS2", output_report_IRS2_file),
]
for form_type, report_file_path in form_types:
    reporting.output_results_by_form_type(df, report_file_path, form_type)

**Overall Accuracy**: 70.75%

**Field Summary**:
| Field                       |   total_comparisons |   matches |   mismatches |   accuracy |   mismatch_percentage |
|:----------------------------|--------------------:|----------:|-------------:|-----------:|----------------------:|
| employee_name               |                 100 |        92 |            8 |       0.92 |              2.2792   |
| employer_name               |                 100 |        83 |           17 |       0.83 |              4.8433   |
| federal_income_tax_withheld |                 100 |        99 |            1 |       0.99 |              0.2849   |
| medicare_wages_and_tips     |                 100 |        91 |            9 |       0.91 |              2.5641   |
| social_security_wages       |                 100 |        92 |            8 |       0.92 |              2.2792   |
| state_1                     |                 100 |        45 |           55 |       0.45 |             15.6695   |
| state