[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/your-repo/your-notebook.ipynb)

# ExtractThinker get started

This notebook demonstrates how to start using ExtractThinker, using a simple example with Pypdf as the document loader and GPT-4o-mini as the LLM.

## Setup

First, let's install the required libraries:

In [9]:
!pip install extract-thinker


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m



Then set up the OpenAI API key:

In [10]:
# My OpenAI Key
import os

os.environ["OPENAI_API_KEY"] = ""

## Create a DocumentLoader. 

To make it simple, we'll use PyPDF


In [11]:
from extract_thinker import DocumentLoaderPyPdf

document_loader = DocumentLoaderPyPdf()


## Create an Extractor 
An extractor is the main class that coordinates the extraction process. It needs a DocumentLoader and an LLM.

In [12]:
from extract_thinker import Extractor

extractor = Extractor()

Load the document loader into the extractor and the LLM

In [13]:
extractor.load_document_loader(document_loader)

extractor.load_llm("gpt-4o-mini")

## Define the contract and extract the data

In [14]:
from extract_thinker import Contract

class InvoiceContract(Contract):
    invoice_number: str
    invoice_date: str

In [3]:
!pip install ipywidgets
!pip install IPython


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Extract Data from Uploaded File according to the contract

In [10]:
import ipywidgets as widgets
from IPython.display import display
import tempfile

# Create file upload widget and output widget
file_upload = widgets.FileUpload(accept='.pdf', description='Upload PDF')
output = widgets.Output()

def on_file_uploaded(change):
    # Only proceed if there's a new file uploaded
    if change['new']:
        # Get the first uploaded file
        uploaded_file = next(iter(change['new'].values()))
        
        # Display information within the output widget
        with output:
            print(f"File uploaded: {uploaded_file['metadata']['name']}")
            print(f"File size: {len(uploaded_file['content'])} bytes")

            with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as temp_file:
                temp_file.write(uploaded_file['content'])
                temp_file_path = temp_file.name

                result = extractor.extract(temp_file_path, InvoiceContract)
                print(result)

# Observe changes to the file upload widget's value
file_upload.observe(on_file_uploaded, names='value')

# Display both widgets
display(file_upload, output)

FileUpload(value=(), accept='.pdf', description='Upload PDF')

Output()