Skip to content

coderRaj07/cyber_intel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

15 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ›‘ Cyber Metric Extractor

Automated Data Extraction Pipeline for β€œState of the Cyber Security Sector in Ireland 2022”


πŸ“Œ Overview

This project implements an automated document intelligence pipeline that:

  • Parses the target PDF
  • Extracts all quantitative metrics (tables, charts, textual values)
  • Assigns source-of-truth metadata
  • Normalizes data for longitudinal economic analysis
  • Exposes structured outputs via REST APIs
  • Supports async processing with Celery

The system is designed to handle:

  • High-fidelity tables
  • Vector-based charts
  • Narrative numeric statements
  • Hierarchical document structures

🧠 Architecture Overview

PDF Upload
    ↓
Recursive Document Parser
    ↓
Extraction Engine
   β”œβ”€β”€ Table Extractor
   β”œβ”€β”€ Textual Metric Extractor
   β”œβ”€β”€ Chart Stub Extractor
   └── LLM-Assisted Interpretation
    ↓
Normalizer
    ↓
Confidence Scoring Engine
    ↓
Database Storage
    ↓
API Export (JSON / CSV)

πŸ“‚ Project Structure

app/
 β”œβ”€β”€ api/
 β”‚    β”œβ”€β”€ upload.py          # Upload endpoint
 β”‚    β”œβ”€β”€ metrics.py         # Fetch structured metrics
 β”‚    β”œβ”€β”€ export.py          # Export CSV
 β”‚
 β”œβ”€β”€ services/
 β”‚    β”œβ”€β”€ pdf_parser.py
 β”‚    β”œβ”€β”€ recursive_parser.py
 β”‚    β”œβ”€β”€ extractor.py
 β”‚    β”œβ”€β”€ tokenizer.py
 β”‚    β”œβ”€β”€ normalizer.py
 β”‚    β”œβ”€β”€ confidence.py
 β”‚    β”œβ”€β”€ chart_stub.py
 β”‚    β”œβ”€β”€ llm_client.py
 β”‚
 β”œβ”€β”€ workers/
 β”‚    β”œβ”€β”€ celery_app.py
 β”‚    β”œβ”€β”€ tasks.py
 β”‚
 β”œβ”€β”€ utils/
 β”‚    β”œβ”€β”€ file_utils.py
 β”‚    β”œβ”€β”€ logger.py
 β”‚
 β”œβ”€β”€ database.py
 β”œβ”€β”€ models.py
 β”œβ”€β”€ schemas.py
 β”œβ”€β”€ main.py

πŸš€ Features

βœ… Recursive Parsing

Preserves document hierarchy for contextual metric extraction.

βœ… Multi-Strategy Extraction

  • Structured table parsing
  • Numeric token extraction from text
  • Vector chart parsing (stub support)
  • LLM-based semantic interpretation

βœ… Source-of-Truth Metadata

Each metric includes:

  • Page number
  • Source text snippet
  • Extraction method
  • Confidence score

βœ… Confidence Scoring

Scores are computed based on:

  • Structural reliability
  • Extraction method
  • Pattern matching certainty
  • LLM response consistency

βœ… Async Support

Optional Celery-based background processing.


πŸ“Š Output Schema

Example extracted metric:

{
  "metric_name": "Total Cyber Security Exports",
  "value": 1.2,
  "unit": "Billion EUR",
  "year": 2022,
  "source": {
    "page": 14,
    "text_snippet": "Exports reached €1.2 billion in 2022",
    "confidence": 0.91,
    "method": "table_parser"
  }
}

πŸ›  Setup Instructions

1️⃣ Create Virtual Environment

python -m venv venv
source venv/bin/activate        # mac/linux
# venv\Scripts\activate         # windows

2️⃣ Install Dependencies

pip install -r requirements.txt

3️⃣ Configure Environment

Create a .env file:

UPLOAD_DIR=uploads
ENVIRONMENT=development
REDIS_URL=
DATABASE_URL=sqlite:///./test.db
USE_CELERY=false
CEREBRAS_API_KEY=your_key_here
CEREBRAS_MODEL=llama-3.1-8b

β–Ά Running the Application

πŸ”Ή Development Mode (Without Celery)

uvicorn app.main:app --reload

Access API docs:

http://localhost:8000/docs

πŸ”Ή Production Mode (With Celery)

Start worker:

celery -A app.workers.celery_app worker --pool=threads --concurrency=2 --loglevel=info

Start API:

uvicorn app.main:app --reload

πŸ“‘ API Endpoints

πŸ“€ Upload PDF

POST /upload

Uploads and processes the PDF.


πŸ“₯ Get Extracted Metrics

GET /metrics

Returns structured JSON dataset.


πŸ“ Export CSV

GET /export/csv

Downloads extracted dataset in CSV format.


πŸ— Design Decisions

Why Recursive Parsing?

PDF documents contain nested structures (sections, subsections, tables). Recursive traversal preserves context and improves semantic accuracy.

Why LLM Integration?

Certain complex charts and ambiguous text require semantic interpretation beyond regex-based extraction.

Why Confidence Scoring?

Downstream economic analysis requires traceability and reliability scoring.


πŸ“ˆ Scalability Considerations

  • Celery for background processing
  • Redis-backed task queue
  • Modular extraction services
  • Toggle-based async support
  • Database abstraction layer

πŸ§ͺ Future Enhancements

  • True vector chart numeric extraction
  • File hashing for idempotency
  • Extraction versioning
  • Model performance logging
  • Structured economic taxonomy mapping

πŸ§‘β€πŸ’» Author

Rajendra Bisoi Backend Engineer | Document Intelligence | Data Systems


About

Automated document intelligence pipeline that extracts, normalizes, and structures quantitative metrics from complex PDFs (tables, charts, and text) with source-of-truth metadata.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages