<a href="https://colab.research.google.com/github/gitguy007/2026-ColabWorks/blob/main/ColabWorks!_Knowledge_Extraction_Using_Dynamic_Schema_Generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
Article URL : https://medium.com/data-science-collective/building-a-generic-knowledge-extraction-ai-framework-for-organization-specific-use-cases-cbb52ce93e48
Source URL : https://github.com/umairalipathan1980/Knowledge-Extraction-Using-Dynamic-Schema-Generation

In [None]:
# Clone the repository
!git clone https://github.com/umairalipathan1980/Knowledge-Extraction-Using-Dynamic-Schema-Generation.git
%cd Knowledge-Extraction-Using-Dynamic-Schema-Generation

# Or install from source
!pip install .

# Configure environment
!cp .env.example .env
# Edit .env with your API keys

In [None]:
from extractors import get_openai_config, SchemaGenerator, DataExtractor

# Step 1: Configure OpenAI client (choose Azure or standard OpenAI)
config = get_openai_config(use_azure=False)  # Set to False for standard OpenAI

# Step 2: Generate schema from requirements
generator = SchemaGenerator(config=config)

requirements = """
Extract project information:
- Project title (string)
- Budget in EUR (decimal)
- Start date (date)
- Status (enum: active or completed)
"""

schema = generator.generate_schema(user_requirements=requirements)

Generating schema from requirements...
Analyzing structure type...
[detect_structure_type] tokens=566
✓ Structure type: flat
  Reasoning: The requirements describe a single project entity per document (one project title, one budget, one start date, one status). There is no indication of multiple projects or a list/table of items within the document, so a flat structure is appropriate.
  → Using flat structure
[parse_user_requirements] tokens=1593

GENERATED PYDANTIC MODEL

Extraction Schema

class project_information_extraction_Extraction(BaseModel):
    """Extraction model for project_information_extraction"""
    model_config = ConfigDict(extra='forbid')

    project_title: str | None = Field(None, description="Title of the project")
    budget_in_eur: decimal.Decimal | None = Field(None, description="Project budget in EUR")
    start_date: str | None = Field(None, description="Project start date")
    status: Optional[Literal['active', 'completed']] = Field(None, description="Projec

In [None]:
# Step 3: Extract data using the generated schema
from extractors.parsers import VisionParser

extractor = DataExtractor(config=config)

# Configure parser
config = get_openai_config(use_azure=False)
parser = VisionParser(
    openai_config=config,
    use_context=True,      # Use previous page context for better continuity
    dpi=300,               # Image resolution (200-300 recommended)
    clean_output=True      # Merge and clean tables across pages
)

# Parse PDF to markdown
markdown_pages = parser.convert_pdf("input/PO.pdf")
parser.save_markdown(markdown_pages, "output/document.md")

# Use markdown as extraction input
results = extractor.extract(
    extraction_model=schema,
    requirements=generator.item_requirements,
    user_requirements=requirements,
    documents=markdown_pages  # Use parsed markdown
)

Converting PDF to images using PyMuPDF (DPI: 300)...
Converted 1 pages to images
Parsing page 1 with vision model...
Markdown saved to: output/document.md

EXTRACTION PHASE

Processing document 1/1...
  [extraction] tokens=813

✓ Completed extraction from 1 document(s)
