DocIntelligence is a powerful document processing engine that transforms PDFs into structured, searchable knowledge bases. It combines advanced layout analysis with intelligent content extraction to make your documents truly AI-ready for RAG (Retrieval-Augmented Generation) applications.
In today's AI-driven world, RAG (Retrieval-Augmented Generation) has become essential for creating AI applications that can access and reason about specific document collections. While many excellent open-source PDF processors rely solely on OCR (Optical Character Recognition), they often struggle with:
- Complex document layouts
- Mixed content types (text, tables, images, formulas)
- Maintaining document structure and hierarchy
- Generating high-quality embeddings for semantic search
DocIntelligence addresses these challenges by combining advanced layout analysis with intelligent content extraction, making your documents truly AI-ready.
| Feature | Description | Benefit |
|---|---|---|
| π Smart Layout Analysis | Automatically detects and preserves document structure | Maintains context and relationships between elements |
| π Multi-Modal Extraction | Handles text, tables, images, and formulas | Complete document understanding |
| π§ Semantic Embeddings | Generates vector embeddings for all content | Enables powerful similarity search |
| βοΈ Flexible Storage | Store locally or in Google Cloud (BigQuery/Storage) | Scales with your needs |
| π Knowledge Base Ready | Structured output perfect for RAG applications | Build smarter AI applications |
- Clone the repository:
git clone https://github.com/Zereo0317/DocIntelligence.git
cd DocIntelligence- Install dependencies:
pip install -e . # Install DocIntelligence in development modeDocIntelligence uses Google Cloud Platform (GCP) for OCR and optionally for enhanced storage features. You can set up your environment using tools like python-dotenv or direnv.
-
Set up GCP Authentication:
- Create a Google Cloud Project
- Enable the Cloud Vision API
- Create a Service Account with these roles:
Cloud Vision API UserStorage Object Viewer(if using Cloud Storage)BigQuery Data Editor(if using BigQuery)
- Download the JSON key file
-
Create a
.envfile in your project root:
# Required: Google Cloud Vision API for OCR
GCP_PROJECT_ID="" # Your GCP Project ID
GCP_LOCATION="" # Service location (defaults to us-central1)
# Optional: Cloud Storage for visual & tabular content
GCP_BUCKET_NAME="" # Cloud Storage bucket name
GOOGLE_APPLICATION_CREDENTIALS="" # Path to service account JSON file
# Optional: BigQuery for embeddings & search
GCP_DATASET_ID="" # BigQuery dataset name
GCP_CONNECTION_ID="" # BigQuery connection ID
GCP_BIGQUERY_INSERT_BATCH=500 # Batch size for insertionsDocIntelligence offers two main processing modes:
Perfect for smaller projects or testing. Stores results locally:
from dotenv import load_dotenv
load_dotenv() # Load variables from .env file
from DocIntelligence import DocIntelligence
engine = DocIntelligence()
# Process documents and get results directly
elements, documents, embeddings = engine.process_documents(
input_dir="./Documents/",
output_dir="./Output/",
store_to_db=False, # False => Store locally, True => Store in BigQuery
cloud_storage=False # False => Store locally, True => Store in Google Cloud Storage
)Ideal for large-scale applications. Stores results in GCP:
engine = DocIntelligence()
# Process and store in GCP
engine.process_documents(
input_dir="./Documents/",
output_dir="./Output/",
store_to_db=True, # Store in BigQuery
cloud_storage=True # Use Google Cloud Storage
)-
store_to_db=False: Results are stored in the local output directory- Extracted text stored as JSON
- Images and tables saved as files
- Embeddings returned as Python objects
-
store_to_db=True: Results are stored in BigQuery- Structured data stored in tables
- Enables fast querying and search
- Perfect for production deployments
-
cloud_storage=False: Visual elements stored locally -
cloud_storage=True: Visual elements stored in Google Cloud Storage- Better for sharing and scalability
- Enables cloud-based processing
When store_to_db=False, the function returns three lists of dictionaries:
List of detected elements from the document:
{
'doc_id': str, # Document ID (Required)
'page_num': int, # Page number (Required)
'element_type': str, # Element type (Text, Table, Picture, etc.)
'element_id': str, # Unique identifier
'storage_path': str, # Storage path for the element
'embedding_id': str, # Embedding identifier
'content': str, # Element content
'metadata': dict, # Element metadata in JSON format
'mapped_to_element_id': str,# Related element ID if applicable
'store_in_bigquery': bool, # Whether to store in BigQuery
'section': str, # Section information
'title': str # Title information
}List of document metadata:
{
'doc_id': str, # Document ID (Required)
'title': str, # Document title
'total_pages': int, # Total number of pages
'storage_path': str # Storage path
}List of vector embeddings:
{
'embedding_id': str, # Embedding identifier
'vector': "" | List[float] # Empty string when store_to_db=False,
# List of floats when store_to_db=True
'original_text': str, # Original text content
'content_type': str, # Content type
'doc_id': str, # Document ID
'page_num': int, # Page number
'element_id': str, # Element identifier
'mapped_to_element_id': str, # Related element ID
'coordinates': dict # Position coordinates in JSON
}Output/
βββ {document_name}/
βββ images/ # Page images
βββ labeled/ # Extracted elements
β βββ text/ # Text blocks
β βββ table/ # Table images
β βββ formula/ # Mathematical formulas
β βββ picture/ # Figures and diagrams
βββ ocr/ # Processed content
βββ text/ # OCR results
βββ table/ # Table data
βββ caption/ # Image captions
We welcome contributions! Here's how you can help:
- Check our Issues for tasks
- Fork the repository and create a new branch
- Submit a Pull Request with your changes
- Join our community discussions
Need technical support or want to connect? Feel free to email Zereo at zereo@zereo-ai.com or connect on LinkedIn. We're always happy to help!
We're actively developing new features to make DocIntelligence even more powerful:
-
Knowledge Graph Integration
- Automatically build knowledge graphs from documents
- Discover relationships between concepts
- Enable graph-based querying
-
AgentBasis Integration
- Seamless connection to our upcoming AgentBasis framework
- Build intelligent agents that can reason over your documents
- Create powerful RAG applications with minimal code
- Python β₯ 3.10
- Google Cloud Vision API (for OCR)
- Optional: Google Cloud Storage and BigQuery (for cloud features)
DocIntelligence is released under the MIT License. You are free to use, modify, and distribute the code for both commercial and non-commercial purposes.