RegDoc: AI Powered Regulatory Document Classifier

Hitachi Digital Services | Texas A&M Datathon 2025

Live Demo: regdoc.streamlit.app

Overview

RegDoc is an AI powered document intelligence system designed to automatically analyze and classify regulatory and business documents into four categories: Public, Confidential, Highly Sensitive, and Unsafe.

It supports both multi-page PDFs and image-based documents, combining deterministic policy checks with advanced language model reasoning. The system was built as part of the Hitachi Digital Services x Texas A&M Datathon 2025, with the goal of enabling fast, explainable, and audit-ready compliance classification.

Public
Confidential
Highly Sensitive
Unsafe

The system combines:

Text and image preprocessing
Dynamic prompt tree generation from a configurable prompt library
Dual model validation using two LLMs
Human in the loop (HITL) review and overrides
Citation based evidence for each classification decision

System Architecture

High Level Flow

User Upload (PDF / Image)
        |
        v
Ingestion Layer (OCR + Text)
        |
        v
Heuristics and Safety Layer
        |
        v
LLM Classification Engine
        |
        v
Policy Rules and Overrides
        |
        v
Streamlit HITL Interface + Audit Log

Category Logic

Category	Example Documents	Trigger Logic
Public	Brochures, press releases, marketing content	Default if no PII or restricted terms detected
Confidential	Internal memos, project proposals, technical reports	Contains “internal use”, “restricted circulation”, or equipment terms
Highly Sensitive	Employment forms, application data with SSNs or credit cards	PII detection (SSN, account number, or address)
Unsafe	Explicit, violent, or illegal content	Matches unsafe keyword list (CSAM, self harm, violence)

kid_safe is marked False if strong profanity is detected, even if the document is not unsafe in other respects.

Step-by-Step Pipeline

1. Pre-processing
   - Extract text from PDFs using pdfplumber
   - OCR images with pytesseract
   - Count pages, images, and check legibility

2. Heuristic Detection
   - Identify PII (email, phone, SSN, credit card)
   - Flag profanity and unsafe keywords
   - Detect aircraft or serial numbers for equipment sensitivity

3. Prompt Construction
   - Selects prompt sets from /prompts (public, sensitive, unsafe)
   - Builds a context-aware system prompt for the LLM

4. LLM Inference
   - Runs the primary model (LLaMA 3.1 8B Instruct)
   - If confidence < 0.6 → runs validator (LLaMA 3.1 70B Instruct)
   - Merges results and citations if disagreement occurs

5. Policy Enforcement
   - Internal or restricted wording → Confidential  
   - SSN or PII → Highly Sensitive  
   - Unsafe keywords → Unsafe  
   - Sensitive equipment → Confidential  

6. Human-in-the-Loop Review
   - Reviewer validates, overrides, or approves classification  
   - Feedback stored in `history.json` for continuous improvement

Confidence Calibration and Human-in-the-Loop (HITL)

To minimize manual effort while maintaining accuracy, RegDoc uses confidence thresholds and reviewer feedback loops.

Confidence Thresholds

Stage	Model	Confidence Range	Action
Primary	LLaMA 3.1 8B Instruct	≥ 0.6	Accept classification automatically
Validation	LLaMA 3.1 70B Instruct	< 0.6	Re-evaluate and merge reasoning
Review	Human Reviewer	N/A	Optional override and comment

HITL Workflow

AI generates reasoning and citations for transparency.
Reviewer can confirm or change the classification.
Overrides and comments are saved in history.json.
Future prompts can be tuned using this feedback.

Benefits

Reduced Manual Load: Only low-confidence cases need review.
Explainable Decisions: Each classification includes reasoning and citations.
Continuous Improvement: Reviewer feedback forms a retraining dataset.
Audit Ready: Complete trace of AI + Human actions for compliance.

Tech Stack

Layer	Technology
Frontend	Streamlit
Backend	Python 3.10+, pdfplumber, Pillow, pytesseract
AI Models	Meta LLaMA 3.1 8B and 70B Instruct via OpenRouter
Environment Config	`.env` file with `OPENROUTER_API_KEY`
Storage	JSON audit log (`history.json`)

Installation and Setup

1. Clone the Repository

git clone <your-repository-url>
cd <your-project-folder>

2. Install and Create a Virtual Environment

Install `virtualenv`

3. Install Dependencies

Install all required dependencies:

pip install -r requirements.txt

4. Configure Environment Variables

Create a `.env` File

In the project root folder, create a file named `.env` and paste your OpenRouter API key inside it in the following format:

OPENROUTER_API_KEY=sk-or-v1-your-key-here

5. Run the Application

Launch the Streamlit app:

streamlit run app.py

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.devcontainer		.devcontainer
backend		backend
.gitignore		.gitignore
README.md		README.md
app.py		app.py
history.json		history.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RegDoc: AI Powered Regulatory Document Classifier

Live Demo: regdoc.streamlit.app

Overview

System Architecture

High Level Flow

Category Logic

Step-by-Step Pipeline

Confidence Calibration and Human-in-the-Loop (HITL)

Confidence Thresholds

HITL Workflow

Benefits

Tech Stack

Installation and Setup

1. Clone the Repository

2. Install and Create a Virtual Environment

Install `virtualenv`

3. Install Dependencies

Install all required dependencies:

4. Configure Environment Variables

Create a `.env` File

In the project root folder, create a file named `.env` and paste your OpenRouter API key inside it in the following format:

5. Run the Application

Launch the Streamlit app:

About

Uh oh!

Releases

Packages

Languages

Vatsakala/Regdoc-classifier

Folders and files

Latest commit

History

Repository files navigation

RegDoc: AI Powered Regulatory Document Classifier

Live Demo: regdoc.streamlit.app

Overview

System Architecture

High Level Flow

Category Logic

Step-by-Step Pipeline

Confidence Calibration and Human-in-the-Loop (HITL)

Confidence Thresholds

HITL Workflow

Benefits

Tech Stack

Installation and Setup

1. Clone the Repository

2. Install and Create a Virtual Environment

Install virtualenv

3. Install Dependencies

Install all required dependencies:

4. Configure Environment Variables

Create a .env File

In the project root folder, create a file named .env and paste your OpenRouter API key inside it in the following format:

5. Run the Application

Launch the Streamlit app:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Install `virtualenv`

Create a `.env` File

In the project root folder, create a file named `.env` and paste your OpenRouter API key inside it in the following format:

Packages