## Make sure you set your API key

The Landing.AI system is proprietary, so you need an api key to use the python library. Make sure you exported a key, you can obtain one on [the visual playground page](https://va.landing.ai/my/settings/api-key)

In [None]:
!export VISION_AGENT_API_KEY=<your-api-key>

## Parse File

In this introductory notebook, we will only explore the Parse and Extract API function. That's all we need for this demo, but you can also [clasify or separate a large document into subdocuments](https://docs.landing.ai/ade/ade-python#split:-getting-started)

In [None]:
from pathlib import Path
from landingai_ade import LandingAIADE

client = LandingAIADE()

# Replace with your file path
response = client.parse(p
    document=Path("/path/to/file/document"), # clone this repo and access ./test-docs/example-fr.png
    model="dpt-2-latest"
)
print(response.chunks)

# Save Markdown output (useful if you plan to run extract on the Markdown)
with open("output.md", "w", encoding="utf-8") as f:
    f.write(response.markdown)
    
# If you take a look at the parsed document you'll see that the table was handled very well 
# and even information that are hard to read like the FasoTax percentages are clearly extractable.

With the example code above you can also parse online documents, for that you just need to use the document_url= parameter (eg.  document_url="https://example.com/document.pdf") 

## Information Extraction

You can call the extract function with the markdown string returned by the parse function or pass the path or url of an already pasted document.

```
# Parse the document
parse_response = client.parse(
    document=Path("/path/to/document.pdf"),
    model="dpt-2-latest"
)

# Extract data using the markdown string from parse response
extract_response = client.extract(
    schema=schema_json,
    markdown=parse_response.markdown,  # Pass markdown string directly
    model="extract-latest"
)

# Or Extract from a local markdown file
extract_response = client.extract(
    schema=schema_json,
    markdown=Path("/path/to/output.md"),
    model="extract-latest"
)

# Or extract from a remote markdown file
extract_response = client.extract(
    schema=schema_json,
    markdown_url="https://example.com/document.md",
    model="extract-latest"
)
```

But the most important part of the extraction workflow is to define a schema that accurately describe the information you want to extract from your document. This is, sort of the prompt engineering part of the thing, even if the system performs very at extracted visible/labeled information with detailed descriptions, in general the more detailed your descriptions are the better it is for the extraction of the information

Schemas passed to the API are in JSON format, but I recommend using Pydantic models to define your extraction schema to enhance it with type information and structured descriptions, so that the extracted information is type-aware.

In [None]:
from __future__ import annotations

import os
from io import BytesIO
from pathlib import Path

from landingai_ade import LandingAIADE
from landingai_ade.lib import pydantic_to_json_schema

from typing import List
from pydantic import BaseModel, Field

class CommercialContact(BaseModel):
    name: str = Field(
        ...,
        description='The name of the commercial contact person.',
        title='Contact Name',
    )
    phone_number: str = Field(
        ...,
        description='The phone number of the commercial contact.',
        title='Phone Number',
    )
class PricingPlan(BaseModel):
    plan_name: str = Field(
        ...,
        description='The name of the fiber optic plan (e.g., ESSENTIEL, CONFORT, PREMIUM).',
        title='Plan Name',
    )
    speed: str = Field(
        ...,
        description='The advertised internet speed for the plan in Mb/s.',
        title='Internet Speed',
    )
# With this  schema we wanna extract the language of the document (inferred by the agent)
# the contact of the commercial person for subscriptions
# the pricing plans details for different fiber optic offers
# and the total price for a 6-month prepaid "Confort" plan including FMS, subscription, and FasoTax
class FiberOpticServiceInformation(BaseModel):
    language: str = Field(
        ...,
        description='The inferred language of the document.',
        title='Document Language',
    )
    commercial_contact: CommercialContact = Field(
        ...,
        description='Details of the commercial contact person for subscriptions.',
        title='Commercial Contact Information',
    )
    offers_summary: str = Field(
        ...,
        description='A brief summary of the main offers mentioned at the beginning of the document.',
        title='Initial Offers Summary',
    )
    pricing_plans: List[PricingPlan] = Field(
        ...,
        description='An array of detailed pricing plans for different fiber optic offers.',
        title='Detailed Pricing Plans',
    )
    confort_6_month_total_price: int = Field(
        ...,
        description="The total price for a 6-month prepaid 'Confort' plan, including FMS, subscription, and FasoTax.",
        title='Confort 6-Month Prepaid Total Price',
    )

# Extract structured data using the schema
extract_response = client.extract(
    schema=schema,
    markdown=Path("/path/to/output.md"),
)

print(extract_response)


See the Pydantic model defintion as if you were defining a class and its attribute while designing a database, you can symbolized complex class relationships with nested structures (see [Nested subfields in the docs](https://docs.landing.ai/ade/ade-python#extract-nested-subfields)). You can also [Extract Variable-Length Data with List Objects](https://docs.landing.ai/ade/ade-python#extract-variable-length-data-with-list-objects).
You can also use pre-created schemas saved to JSON

## Linking Extracted Data to Document Locations

Use the reference IDs from extraction_metadata to find the exact location where data was extracted in the source document. This is useful for visual validation, quality assurance, or building confidence scores.

In [None]:
# Link extracted field to its source location
chunk_id = extract_response.extraction_metadata["commercial_contact"]["name"]["references"][0]

print(f"Commercial name (chunk_id {chunk_id}): {extract_response.extraction['commercial_contact']['name']}")

phone = extract_response.extraction['commercial_contact']['phone_number']
print(f"Commercial phone number: {phone}")

You can use [va.landing.ai](https://va.landing.ai/) to have an LLM generate your pydantic schemas (that's one functionality that we will implement for our demo app).
I also recommend checking thier [documentation](https://docs.landing.ai/) for further information and API documentation, it is a great AI augmented doc. 