# **Extract structures information from uploaded invoices**

# 1. Environment and Authentication Setup
* I typically use [poetry](https://python-poetry.org/docs/dependency-specification/) for managing dependencies, but since we’re working in Google Colab and time is limited, I installed the required packages directly using %pip
* Use the google-genai SDK to interact with Gemini models and fetch the API key securely from Colab's userdata.
* To get started with the Gemini API:
    1. ✅ You only need a Google account (Google Cloud setup is not required)
    2. 🔗 Visit [AI Studio](https://aistudio.google.com/prompts/new_chat)
    3. 🔐 Click “Get API Key” (top-left)
    4. ➕ Create a new key and store it in Colab secrets under the name "GOOGLE_API_KEY"




In [22]:
%pip install "google-genai>=1"
%pip install gradio



In [23]:
from google import genai
from google.colab import userdata

# Create a client
api_key = userdata.get('GOOGLE_API_KEY')
client = genai.Client(api_key=api_key)

# Define the model you are going to use
model_id =  "gemini-2.0-flash" # or "gemini-2.0-flash-lite-preview-02-05"  , "gemini-2.0-pro-exp-02-05"



# 2. Work with PDFs and other fils
Gemini models can process not only text but also images, PDFs, and even videos. These files can be passed to the model either as base64-encoded strings or more efficiently using the Files API.
The google-genai Python SDK provides convenient methods like .upload() and .delete() to manage your files.

Here’s how it works:

* First, upload your file using the client’s upload() method.
* After uploading, you’ll receive a file reference (File object) that can be used directly in your prompt.

ℹ️ Important Notes:
* The Files API allows you to store up to 20 GB of data per project.
* Individual files can be up to 2 GB in size.
* Files are automatically deleted after 48 hours.
* Files are not downloadable, but they remain accessible through your API key during that time.
* Uploading files via the API is currently free of charge.

In [24]:
resume_pdf = client.files.upload(file="/content/A.Pcv (1).pdf", config={'display_name': 'invoice'})


In [25]:
file_size = client.models.count_tokens(model=model_id,contents=resume_pdf)
print(f'File: {resume_pdf.display_name} equals to {file_size.total_tokens} tokens')

File: invoice equals to 1549 tokens


# 3. Structured Outputs with Gemini 2.0 and Pydantic
Gemini 2.0 supports Structured Outputs, a powerful feature that ensures the model returns responses in a predefined and strictly validated format, such as a JSON schema.

This allows you to:
* ✅ Enforce consistency in model outputs
* ✅ Integrate results seamlessly into your application logic
* ✅ Avoid fragile post-processing or unreliable parsing

To achieve this, we use **Pydantic**, a Python library for data validation and schema enforcement. By defining a schema with **BaseModel**, we tell Gemini exactly how the output should be structured.



In [26]:
from pydantic import BaseModel, Field

# Define a Pydantic model
# Use the Field class to add a description and default value to provide more context to the model
class Topic(BaseModel):
    name: str = Field(description="The name of the topic")

class Person(BaseModel):
    first_name: str = Field(description="The first name of the person")
    last_name: str = Field(description="The last name of the person")
    email: str = Field(description="The email address of a person")
    work_experience: list[Topic] = Field(description="The work experience, if not provided please return an empty list")
    education: list[Topic] = Field(description="The education, if not provided please return an empty list")

# Define the prompt
prompt = (
    "Adelajda Papa is an accomplished Albanian professional currently based in Berlin, Germany. "
    "She works as a Senior Data Scientist at the Federal Ministry of Finance (BDR), where she leads projects at the intersection of artificial intelligence and public sector innovation. "
    "She holds a Master’s degree in Finance and has extensive experience in data science, machine learning, and applied AI for impact-driven use cases."
)
# Generate a response using the Person model
response = client.models.generate_content(model=model_id, contents=prompt, config={'response_mime_type': 'application/json', 'response_schema': Person})

# print the response as a json string
print(response.text)

{
  "first_name": "Adelajda",
  "last_name": "Papa",
  "email": "unknown",
  "work_experience": [
    {
      "name": "Senior Data Scientist at the Federal Ministry of Finance (BDR)"
    }
  ],
  "education": [
    {
      "name": "Master’s degree in Finance"
    }
  ]
}


In [27]:
# sdk automatically converts the response to the pydantic model
Adelajda: Person = response.parsed

# access an attribute of the json response
print(f"{Adelajda}")


first_name='Adelajda' last_name='Papa' email='unknown' work_experience=[Topic(name='Senior Data Scientist at the Federal Ministry of Finance (BDR)')] education=[Topic(name='Master’s degree in Finance')]


# 4. Extract Structured Data from PDFs Using Gemini 2.0
To extract structured information from invoice files, I combined the Files API with the structured output feature. I created a reusable function that accepts a file path and a Pydantic model.

The function works in three steps:
* It uploads the file to Gemini using the Files API.
* It sends a prompt along with the uploaded file and the defined schema to the Gemini model.
* It returns the structured response, parsed directly into a Python object that matches the schema.

This approach allows me to process PDFs or images of invoices and receive clean, structured JSON output without the need for post-processing.

In [28]:
def extract_structured_data(file_path: str, model: BaseModel):
    # Upload the file to the File API
    file = client.files.upload(file=file_path, config={'display_name': file_path.split('/')[-1].split('.')[0]})
    # Generate a structured response using the Gemini API
    prompt = f"Extract the structured data from the following PDF file"
    response = client.models.generate_content(model=model_id, contents=[prompt, file], config={'response_mime_type': 'application/json', 'response_schema': model})
    # Convert the response to the pydantic model and return it
    return response.parsed

In [29]:
from pydantic import BaseModel, Field

class Item(BaseModel):
    description: str = Field(description="The description of the item")
    quantity: float = Field(description="The Qty of the item")
    gross_worth: float = Field(description="The gross worth of the item")

class Invoice(BaseModel):
    """Extract the invoice number, date and all list items with description, quantity and gross worth and the total gross worth."""
    invoice_number: str = Field(description="The invoice number e.g. 1234567890")
    date: str = Field(description="The date of the invoice e.g. 2024-01-01")
    items: list[Item] = Field(description="The list of items with description, quantity and gross worth")
    total_gross_worth: float = Field(description="The total gross worth of the invoice")


result = extract_structured_data("/content/sample-invoice.pdf", Invoice)
print(type(result))
print(f"Extracted Invoice: {result.invoice_number} on {result.date} with total gross worth {result.total_gross_worth}")
for item in result.items:
    print(f"Item: {item.description} with quantity {item.quantity} and gross worth {item.gross_worth}")


<class '__main__.Invoice'>
Extracted Invoice: 123100401 on 1. März 2024 with total gross worth 453.53
Item: Basic Fee wmView with quantity 1.0 and gross worth 130.0
Item: Basis fee for additional user accounts with quantity 0.0 and gross worth 0.0
Item: Basic Fee wmPos with quantity 0.0 and gross worth 0.0
Item: Basic Fee wmGuide with quantity 0.0 and gross worth 0.0
Item: Change of user accounts with quantity 0.0 and gross worth 0.0
Item: Transaction Fee T1 with quantity 14.0 and gross worth 8.12
Item: Transaction Fee T2 with quantity 0.0 and gross worth 0.0
Item: Transaction Fee T3 with quantity 162.0 and gross worth 243.0
Item: Transaction Fee T4 with quantity 0.0 and gross worth 0.0
Item: Transaction Fee T5 with quantity 0.0 and gross worth 0.0
Item: Transaction Fee T6 with quantity 0.0 and gross worth 0.0
Item: Transaction Fee G1 with quantity 0.0 and gross worth 0.0
Item: Transaction Fee G2 with quantity 0.0 and gross worth 0.0
Item: Transaction Fee G3 with quantity 0.0 and gross

# 5. 🚀 Forward
To make the parser more accurate and production-ready, I extended the project with the following improvements:


* ✅ Updated the Invoice schema with more precise and realistic fields, such as:
  * Full customer and business contact information
  * Detailed line items (description, quantity, unit price, subtotal)
  * Tax and total calculation
  * Optional fields like payment method or due date
* ✍️ Refined the system prompt to be clearer and task-specific. This helps guide the model toward exactly what should be extracted, improving output quality and reliability.

* 🖥 Built an interactive Gradio app, allowing users to:
  * Upload PDF or image invoices
  * Preview the file
  * Instantly see structured JSON output
* 📊 Tested the system with real-world invoice layouts, using examples from [Roboflow](https://universe.roboflow.com) Universe. This dataset provides a variety of document styles for more robust evaluation.

In [36]:
# Create a client
api_key = userdata.get('GOOGLE_API_KEY')
client = genai.Client(api_key=api_key)

# Define the model you are going to use
model_id =  "gemini-2.0-flash-lite-preview-02-05" # or "gemini-2.0-pro-exp-02-05"


In [37]:
from pydantic import BaseModel, Field
from datetime import date
from typing import List, Optional

class ContactInformation(BaseModel):
    name: str = Field(description="Name of the person or business")
    address: str = Field(description="Address details")
    phone: Optional[str] = Field(default=None, description="Phone number")
    email: Optional[str] = Field(default=None, description="Email address")

class Item(BaseModel):
    description: str = Field(description="Description of the item")
    quantity: float = Field(description="Quantity of the item")
    unit_price: float = Field(description="Price per unit of the item")
    subtotal: float = Field(description="Subtotal for this item")

class Invoice_2(BaseModel):
    invoice_number: str = Field(description="Unique identifier for the invoice")
    business_info: ContactInformation = Field(description="Your business details")
    customer_info: ContactInformation = Field(description="Client details")
    items: List[Item] = Field(description="List of itemized goods or services")
    tax: float = Field(description="Applicable tax amount")
    total: float = Field(description="Total amount after tax")
    payment_methods: Optional[str] = Field(default=None, description="Payment methods (e.g., Bank details, PayPal)")



In [38]:
result = extract_structured_data("/content/sample-invoice.pdf", Invoice_2)
print(result)

invoice_number='123100401' business_info=ContactInformation(name='CPB Software (Germany) GmbH', address='Im Bruch 3 - 63897 Miltenberg/Main', phone=None, email='info@cpb-software.com') customer_info=ContactInformation(name='Mr. John Doe', address='Musterstr. 23\n12345 Musterstadt', phone=None, email=None) items=[Item(description='Basic Fee wmView', quantity=1.0, unit_price=130.0, subtotal=130.0), Item(description='Basis fee for additional user accounts', quantity=0.0, unit_price=10.0, subtotal=0.0), Item(description='Basic Fee wmGuide', quantity=0.0, unit_price=50.0, subtotal=0.0), Item(description='Change of user accounts', quantity=0.0, unit_price=1000.0, subtotal=0.0), Item(description='Transaction Fee T1', quantity=0.0, unit_price=10.0, subtotal=0.0), Item(description='Transaction Fee T1', quantity=14.0, unit_price=0.58, subtotal=8.12), Item(description='Transaction Fee T2', quantity=0.0, unit_price=0.7, subtotal=0.0), Item(description='Transaction Fee T3', quantity=162.0, unit_pri

In [40]:
def extract_structured_data_json(file_path: str, model: BaseModel):
    # Upload the file to the File API
    file = client.files.upload(file=file_path, config={'display_name': file_path.split('/')[-1].split('.')[0]})
    # Generate a structured response using the Gemini API
    prompt = f"Extract the structured data from the following PDF file"
    response = client.models.generate_content(model=model_id, contents=[prompt, file], config={'response_mime_type': 'application/json', 'response_schema': model})
    # Convert the response to the pydantic model and return it
    return response.text

In [41]:
result = extract_structured_data_json("/content/sample-invoice.pdf", Invoice_2)
print(result)

{
  "invoice_number": "123100401",
  "business_info": {
    "name": "CPB SOFTWARE (GERMANY) GMBH",
    "address": "Im Bruch 3, 63897 Miltenberg",
    "phone": "+49 9371 9786 0",
    "email": "germany@cpb-software.com"
  },
  "customer_info": {
    "name": "Mr. John Doe",
    "address": "Musterstr. 23\n12345 Musterstadt",
    "phone": null,
    "email": null
  },
  "items": [
    {
      "description": "Basic Fee wmView",
      "quantity": 1.0,
      "unit_price": 130.0,
      "subtotal": 130.0
    },
    {
      "description": "Basis fee for additional user accounts",
      "quantity": 0.0,
      "unit_price": 10.0,
      "subtotal": 0.0
    },
    {
      "description": "Basic Fee wmPos",
      "quantity": 0.0,
      "unit_price": 50.0,
      "subtotal": 0.0
    },
    {
      "description": "Basic Fee wmGuide",
      "quantity": 0.0,
      "unit_price": 1000.0,
      "subtotal": 0.0
    },
    {
      "description": "Change of user accounts",
      "quantity": 0.0,
      "unit_price"

In [42]:
import gradio as gr
import json

In [43]:
def extract_structured_data_json(file_path: str, model: BaseModel):
    # Upload the file to the File API
    file = client.files.upload(file=file_path, config={'display_name': file_path.split('/')[-1].split('.')[0]})
    # Generate a structured response using the Gemini API
    prompt = f"Extract the structured data from the following PDF file"
    response = client.models.generate_content(model=model_id, contents=[prompt, file], config={'response_mime_type': 'application/json', 'response_schema': model})
    # Convert the response to the pydantic model and return it
    return response.text

# Gradio interface with a themed layout
with gr.Blocks(theme=gr.themes.Soft(), css="body {background-color: #1e1e2f; color: white;} .container {padding: 20px;} .gr-block {overflow: visible; max-height: none;}") as app:
    gr.Markdown("## 🧾 Invoice Parser - Structured JSON Extractor", elem_id="title")

    with gr.Row():
        with gr.Column():
            file_input = gr.File(label="Upload Your Document or Image", file_types=[".pdf", ".docx", ".txt", ".png", ".jpg", ".jpeg"], interactive=True)
            file_viewer = gr.Image(label="Uploaded File Preview", elem_classes=["file-preview"])
            process_button = gr.Button("Process File")

        with gr.Column():
            output_text = gr.JSON(label="Parsed JSON Output", elem_classes=["output-area"])

    # Handle the file processing logic
    def process_file(file_obj):
        if file_obj:
            return json.loads(extract_structured_data_json(file_obj.name, Invoice_2))
        return {"error": "Please upload a file."}

    def show_uploaded_file(file_obj):
        if file_obj:
            return file_obj.name

    file_input.change(show_uploaded_file, inputs=file_input, outputs=file_viewer)
    process_button.click(process_file, inputs=file_input, outputs=output_text)

    gr.Markdown("---")
    gr.Markdown("Built By Adela with ❤️ using Gradio")

app.launch()


It looks like you are running Gradio on a hosted a Jupyter notebook. For the Gradio app to work, sharing must be enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://7916ece22b223e5a73.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


