Automated Data Extraction Pipeline for βState of the Cyber Security Sector in Ireland 2022β
This project implements an automated document intelligence pipeline that:
- Parses the target PDF
- Extracts all quantitative metrics (tables, charts, textual values)
- Assigns source-of-truth metadata
- Normalizes data for longitudinal economic analysis
- Exposes structured outputs via REST APIs
- Supports async processing with Celery
The system is designed to handle:
- High-fidelity tables
- Vector-based charts
- Narrative numeric statements
- Hierarchical document structures
PDF Upload
β
Recursive Document Parser
β
Extraction Engine
βββ Table Extractor
βββ Textual Metric Extractor
βββ Chart Stub Extractor
βββ LLM-Assisted Interpretation
β
Normalizer
β
Confidence Scoring Engine
β
Database Storage
β
API Export (JSON / CSV)
app/
βββ api/
β βββ upload.py # Upload endpoint
β βββ metrics.py # Fetch structured metrics
β βββ export.py # Export CSV
β
βββ services/
β βββ pdf_parser.py
β βββ recursive_parser.py
β βββ extractor.py
β βββ tokenizer.py
β βββ normalizer.py
β βββ confidence.py
β βββ chart_stub.py
β βββ llm_client.py
β
βββ workers/
β βββ celery_app.py
β βββ tasks.py
β
βββ utils/
β βββ file_utils.py
β βββ logger.py
β
βββ database.py
βββ models.py
βββ schemas.py
βββ main.py
Preserves document hierarchy for contextual metric extraction.
- Structured table parsing
- Numeric token extraction from text
- Vector chart parsing (stub support)
- LLM-based semantic interpretation
Each metric includes:
- Page number
- Source text snippet
- Extraction method
- Confidence score
Scores are computed based on:
- Structural reliability
- Extraction method
- Pattern matching certainty
- LLM response consistency
Optional Celery-based background processing.
Example extracted metric:
{
"metric_name": "Total Cyber Security Exports",
"value": 1.2,
"unit": "Billion EUR",
"year": 2022,
"source": {
"page": 14,
"text_snippet": "Exports reached β¬1.2 billion in 2022",
"confidence": 0.91,
"method": "table_parser"
}
}python -m venv venv
source venv/bin/activate # mac/linux
# venv\Scripts\activate # windowspip install -r requirements.txtCreate a .env file:
UPLOAD_DIR=uploads
ENVIRONMENT=development
REDIS_URL=
DATABASE_URL=sqlite:///./test.db
USE_CELERY=false
CEREBRAS_API_KEY=your_key_here
CEREBRAS_MODEL=llama-3.1-8b
uvicorn app.main:app --reloadAccess API docs:
http://localhost:8000/docs
Start worker:
celery -A app.workers.celery_app worker --pool=threads --concurrency=2 --loglevel=infoStart API:
uvicorn app.main:app --reloadPOST /upload
Uploads and processes the PDF.
GET /metrics
Returns structured JSON dataset.
GET /export/csv
Downloads extracted dataset in CSV format.
PDF documents contain nested structures (sections, subsections, tables). Recursive traversal preserves context and improves semantic accuracy.
Certain complex charts and ambiguous text require semantic interpretation beyond regex-based extraction.
Downstream economic analysis requires traceability and reliability scoring.
- Celery for background processing
- Redis-backed task queue
- Modular extraction services
- Toggle-based async support
- Database abstraction layer
- True vector chart numeric extraction
- File hashing for idempotency
- Extraction versioning
- Model performance logging
- Structured economic taxonomy mapping
Rajendra Bisoi Backend Engineer | Document Intelligence | Data Systems