<img src="https://drive.google.com/uc?export=view&id=1wYSMgJtARFdvTt5g7E20mE4NmwUFUuog" width="200">

[![Build Fast with AI](https://img.shields.io/badge/BuildFastWithAI-GenAI%20Bootcamp-blue?style=for-the-badge&logo=artificial-intelligence)](https://www.buildfastwithai.com/genai-course)
[![EduChain GitHub](https://img.shields.io/github/stars/satvik314/educhain?style=for-the-badge&logo=github&color=gold)](https://github.com/satvik314/educhain)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1mTNZbL1G5BFjzE2JpQOdexUoZbZ-u_RO?usp=sharing)
## Master Generative AI in 6 Weeks
**What You'll Learn:**
- Build with Latest LLMs
- Create Custom AI Apps
- Learn from Industry Experts
- Join Innovation Community
Transform your AI ideas into reality through hands-on projects and expert mentorship.
[Start Your Journey](https://www.buildfastwithai.com/genai-course)
*Empowering the Next Generation of AI Innovators

##**🧠 ExtractThinker: Intelligent Document Processing for LLMs**

 ExtractThinker is an open-source **Document Intelligence** framework designed for seamless
 integration with **Large Language Models (LLMs)**. It provides an ORM-style interface for
 **efficient document processing**, **customizable data extraction**, and **flexible classification**. 🚀  

✨ **Key Features**:  
 - 📄 **Multi-format Document Support** – Works with PDFs, images, and spreadsheets.  
 - 🔍 **Advanced Data Extraction** – Define precise extraction contracts using Pydantic models.  
 - ⚡ **Asynchronous Processing** – Optimized for handling large documents efficiently.  
 - 🏗 **Flexible Document Loaders** – Supports Tesseract OCR, Azure Form Recognizer, AWS Textract, and more.  
 - 🤖 **Seamless LLM Integration** – Compatible with OpenAI, Anthropic, Cohere, and other AI models.  
 - 🔧 **ORM-style Interface** – Intuitive, developer-friendly API for document processing.  

 ExtractThinker is a powerful tool for **automating document understanding, classification,
 and knowledge extraction** in AI-driven applications. 📚  



###**Setup and Installation**

In [None]:
!pip install extract-thinker pypdf

In [None]:
import os
from google.colab import userdata

os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

###**Create a DocumentLoader**

In [None]:
from extract_thinker import DocumentLoaderPyPdf

document_loader = DocumentLoaderPyPdf()

###**Create an Extractor**

In [None]:
from extract_thinker import Extractor

extractor = Extractor()

In [None]:
extractor.load_document_loader(document_loader)

extractor.load_llm("gpt-4o-mini")

###**Define the contract and extract the data**

In [None]:
from extract_thinker import Contract

class InvoiceContract(Contract):
    invoice_number: str
    invoice_date: str

In [None]:
!pip install ipywidgets
!pip install IPython

###**Download invoice.pdf from GitHub**


In [None]:
!wget -O invoice.pdf "https://github.com/enoch3712/ExtractThinker/raw/main/examples/invoice.pdf"

###**Extract Data from Uploaded File according to the contract**

In [None]:
import ipywidgets as widgets
from IPython.display import display
import tempfile

file_upload = widgets.FileUpload(accept='.pdf', description='Upload PDF')

file_path_input = widgets.Text(
    placeholder="Enter the downloaded PDF file path",
    description="File Path:",
    layout=widgets.Layout(width='80%')
)

output = widgets.Output()

def process_pdf(file_path):
    """Extract data from the provided PDF file."""
    with output:
        print(f"Processing File: {file_path}")
        result = extractor.extract(file_path, InvoiceContract)
        print(result)

def on_file_uploaded(change):
    """Handle uploaded file processing."""
    if change['new']:
        uploaded_file = next(iter(change['new'].values()))

        with output:
            print(f"File uploaded: {uploaded_file['metadata']['name']}")
            print(f"File size: {len(uploaded_file['content'])} bytes")

            with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as temp_file:
                temp_file.write(uploaded_file['content'])
                temp_file_path = temp_file.name
                process_pdf(temp_file_path)

def on_path_submit(change):
    """Handle file processing via specified path."""
    if file_path_input.value:
        process_pdf(file_path_input.value)

file_upload.observe(on_file_uploaded, names='value')
file_path_input.on_submit(on_path_submit)

display(file_upload, file_path_input, output)


FileUpload(value={}, accept='.pdf', description='Upload PDF')

Text(value='', description='File Path:', layout=Layout(width='80%'), placeholder='Enter the downloaded PDF fil…

Output()

###**Classification Example**

###**Document Classification and Extraction with ExtractThinker**

In [None]:
import os
from dotenv import load_dotenv
from extract_thinker import (
    Extractor, Classification, Process, ClassificationStrategy,
    DocumentLoaderPyPdf, Contract
)


###**Define Contracts for Document Extraction**

In [None]:
load_dotenv()

class InvoiceContract(Contract):
    invoice_number: str
    invoice_date: str

class DriverLicenseContract(Contract):
    name: str
    license_number: str

##**Initialize Extractor with Document Loader and LLM**








In [None]:
extractor = Extractor()
extractor.load_document_loader(DocumentLoaderPyPdf())
extractor.load_llm("gpt-4o-mini")


##**Classify Document with Extractor and Contracts**

In [None]:
classifications = [
    Classification(
        name="Invoice",
        description="An invoice document",
        contract=InvoiceContract,
        extractor=extractor,
    ),
    Classification(
        name="Driver License",
        description="A driver's license document",
        contract=DriverLicenseContract,
        extractor=extractor,
    ),
]

result = extractor.classify(
    "invoice.pdf",
    classifications,
)

print(f"Document classified as: {result.name}")
print(f"Confidence level: {result.confidence}")

Document classified as: Invoice
Confidence level: 10
