## Setup

First, install the SDK and set up your API key. You can get an API key from [app.docudevs.ai](https://app.docudevs.ai).

In [None]:
# Install the SDK if needed
# %pip install docu-devs-api-client pydantic

In [None]:
import os
from docudevs import DocuDevsClient

# Get your API key from environment or set it directly
API_KEY = os.getenv("DOCUDEVS_API_KEY", "your-api-key-here")

client = DocuDevsClient(token=API_KEY)

## Simple Prompt-Based Extraction

The easiest way to extract data is to just describe what you want in plain English.
Let's extract some basic info from an invoice.

In [None]:
# Load a sample invoice (replace with your own file)
with open("docs/invoice.pdf", "rb") as f:
    invoice_bytes = f.read()

print(f"Loaded invoice: {len(invoice_bytes)} bytes")

In [None]:
# Submit the document with a simple prompt
job_id = await client.submit_and_process_document(
    document=invoice_bytes,
    document_mime_type="application/pdf",
    prompt="Extract the invoice number, date, total amount, and vendor name."
)

print(f"Job submitted: {job_id}")

In [None]:
# Wait for the result
result = await client.wait_until_ready(job_id, result_format="json")
print(result)

## Type-Safe Extraction with Pydantic

For production use, you want **consistent, validated data**. Pydantic models let you:
- Define exactly what fields you need
- Specify data types (numbers are numbers, not strings)
- Add validation rules
- Get autocomplete in your IDE

Let's define a proper invoice schema:

In [None]:
from pydantic import BaseModel, Field
from typing import Optional
import json


class LineItem(BaseModel):
    """A single line item on the invoice."""
    description: str = Field(description="Description of the item or service")
    quantity: int = Field(default=1, description="Quantity ordered")
    unit_price: float = Field(description="Price per unit")
    total: float = Field(description="Line total (quantity * unit_price)")


class Invoice(BaseModel):
    """Extracted invoice data."""
    invoice_number: str = Field(description="Unique invoice identifier")
    date: str = Field(description="Invoice date in YYYY-MM-DD format")
    vendor_name: str = Field(description="Name of the vendor/seller")
    vendor_address: Optional[str] = Field(default=None, description="Vendor's address")
    line_items: list[LineItem] = Field(default_factory=list, description="List of items")
    subtotal: Optional[float] = Field(default=None, description="Subtotal before tax")
    tax: Optional[float] = Field(default=None, description="Tax amount")
    total: float = Field(description="Total amount due")


# Generate the JSON schema from the Pydantic model
schema_json = json.dumps(Invoice.model_json_schema())
print("Generated schema:")
print(json.dumps(Invoice.model_json_schema(), indent=2))

In [None]:
# Process the document with our schema
job_id = await client.submit_and_process_document(
    document=invoice_bytes,
    document_mime_type="application/pdf",
    schema=schema_json,
    prompt="Extract all invoice details according to the schema."
)

print(f"Job submitted: {job_id}")

In [None]:
# Get the result and parse it back into our Pydantic model
result = await client.wait_until_ready(job_id, result_format="json")

# Validate and parse with Pydantic
invoice = Invoice.model_validate(result)

print(f"Invoice #{invoice.invoice_number}")
print(f"Date: {invoice.date}")
print(f"Vendor: {invoice.vendor_name}")
print(f"Total: ${invoice.total:.2f}")
print(f"\nLine items ({len(invoice.line_items)}):")
for item in invoice.line_items:
    print(f"  - {item.description}: {item.quantity} x ${item.unit_price:.2f} = ${item.total:.2f}")

## Extracting Barcodes and QR Codes

Many documents contain barcodes or QR codes. DocuDevs can extract these too - just enable the `barcodes` option.

In [None]:
# Load a document with a QR code (if you have one)
# For this example, we'll use a Swiss QR-bill invoice
try:
    with open("docs/invoice.pdf", "rb") as f:
        qr_doc = f.read()
    
    job_id = await client.submit_and_process_document(
        document=qr_doc,
        document_mime_type="application/pdf",
        prompt="Extract all information including any QR code data.",
        barcodes=True  # Enable barcode/QR scanning
    )
    
    result = await client.wait_until_ready(job_id, result_format="json")
    print(json.dumps(result, indent=2))
except FileNotFoundError:
    print("QR-bill sample not found. Try with your own document!")

## Tips for Better Extraction

1. **Be specific in your prompts**: Instead of "extract data", say "extract the invoice number, date, and line items"

2. **Use Pydantic models**: They enforce structure and make your code more maintainable

3. **Add field descriptions**: The AI uses these to understand what you want

4. **Start simple**: Get basic extraction working, then add complexity

5. **Test with representative documents**: Use real samples that represent your actual workload

## Next Steps

- **[Map-Reduce Extraction](02-map-reduce.ipynb)**: Handle very long documents (50+ pages)
- **[Knowledge Search](03-knowledge-search.ipynb)**: Enrich extractions with your own knowledge base
- **[Operations](04-operations.ipynb)**: Error analysis and follow-up questions on documents