Max Winning Project: Gemini-Centric Document Classification System
A comprehensive, enterprise-grade document classification system powered by Google's Gemini 2.0 Flash, featuring RAG (Retrieval Augmented Generation), CAG (Context Augmented Generation), Solana blockchain audit trails.
- ✅ Policy Knowledge Base: Comprehensive category definitions, PII patterns, and SME-validated examples
- ✅ Multi-Modal Document Processing: PDF parsing with OCR for text and images
- ✅ Citation Mapping: Precise source location tracking with bounding boxes
- ✅ Gemini File Search Store: RAG-based policy grounding
- ✅ Dynamic Prompt Tree: Sequential classification flow (UNSAFE → CONFIDENTIAL → SENSITIVE → PUBLIC)
- ✅ RAG + CAG Grounding: Policy knowledge base + cached document content
- ✅ Structured JSON Output: Category, confidence, reasoning, and citations
- ✅ Dual-Layer Validation: Consensus-based auto-approval (90%+ confidence threshold)
- ✅ Solana Blockchain: Immutable audit trails on Solana devnet
- ✅ SQLite Audit Logs: Complete classification history and HITL reviews
- ✅ Web UI: Flask-based interface with HITL feedback loop
┌─────────────────────────────────────────────────────────────┐
│ Document Upload (PDF) │
└─────────────────┬───────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Document Processing (PyMuPDF + OCR + Citation Mapping) │
└─────────────────┬───────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Gemini Classifier (RAG + CAG Pipeline) │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Policy RAG │ │ Cached Doc │ │ Dual Layer │ │
│ │ (File Search)│ +│ (CAG) │ +│ Validation │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────┬───────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Classification Result + Metadata │
└─────────┬───────────┬───────────┬───────────────────────────┘
│ │ │
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐
│ Solana │ │SQLite Audit │
│ Blockchain │ │ Logger │
└──────────────┘ └──────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Web UI (Dashboard + HITL Review Queue) │
└─────────────────────────────────────────────────────────────┘
- Python 3.9+
- Tesseract OCR
- API Keys:
- Google Gemini API
- Solana Devnet access
cd gemini-classifiermacOS:
brew install tesseractUbuntu/Debian:
sudo apt-get update
sudo apt-get install tesseract-ocrWindows: Download and install Tesseract from: https://github.com/UB-Mannheim/tesseract/wiki
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activatepip install -r requirements.txtThe .env file is already configured with your API keys:
GEMINI_API_KEY=AIzaSyA5CRA7vt5rLIVzrW9mTFOTMtFCasEhxlo
SOLANA_CLUSTER_URL=https://api.devnet.solana.comNote: In production, use environment variables or secure secret management instead of committing API keys.
python main.pyThe application will be available at:
- Main Upload: http://localhost:5000
- Dashboard: http://localhost:5000/dashboard
- HITL Queue: http://localhost:5000/hitl/queue
- Navigate to http://localhost:5000
- Upload a PDF file (drag-and-drop or click to browse)
- Wait for processing (typically 5-15 seconds)
- Review the classification result with:
- Category (UNSAFE/CONFIDENTIAL/SENSITIVE/PUBLIC)
- Confidence score
- Reasoning and citations
- Blockchain audit hash
- Navigate to HITL Queue (http://localhost:5000/hitl/queue)
- Click "Review Document" on any pending classification
- Verify or correct the classification
- Add reviewer notes
- Submit review
Important: Corrected classifications are automatically added to the RAG knowledge base as new few-shot examples, improving future accuracy.
- Harmful, violent, or threatening content
- Illegal activity instructions
- Malware or security exploits
- Action: Immediate rejection and escalation
- Trade secrets and proprietary algorithms
- Financial records (with SSN, credit cards)
- Legal documents (attorney-client privilege)
- M&A plans, executive compensation
- Source code and IP
- PII: SSN, credit cards, bank accounts, medical records, passports
- Internal memos and communications
- Employee directories
- Draft documents
- Internal project plans
- Non-executive budgets
- PII: Emails, phone numbers, addresses, employee IDs
- Published marketing materials
- Public website content
- Press releases
- Open-source code
- Public documentation
POST /upload
Content-Type: multipart/form-data
Response: {
"document_id": "DOC_abc123...",
"classification": "CONFIDENTIAL",
"confidence": 0.95,
"reasoning": "...",
"citation": "...",
"blockchain": {...},
"audio_available": true
}GET /api/statistics
Response: {
"total_classifications": 42,
"auto_approval_rate": 85.5,
"avg_processing_time": 8.3,
"by_category": {...}
}GET /api/classifications?limit=100&offset=0GET /api/classification/<document_id>POST /hitl/submit
## 🧪 Testing
### Test with Sample Documents
Create test PDFs with different content types:
**Confidential Example:**CONFIDENTIAL - Board Meeting Minutes Acquisition Target: TechCorp Offer: $500M Employee Data: John Smith - SSN: 123-45-6789 Credit Card: 4532-1234-5678-9010
**Public Example:**
FOR IMMEDIATE RELEASE Product Launch Announcement Contact: press@company.com
### Verify System Components
1. **RAG Policy Upload**: Check console for "Policy uploaded successfully"
2. **Classification**: Verify JSON output with category, confidence, reasoning
3. **Blockchain**: Check for transaction hash (may be simulated if devnet is down)
4. **Database**: SQLite file at `data/audit_logs.db`
## 📁 Project Structure
gemini-classifier/ ├── main.py # Main entry point ├── requirements.txt # Python dependencies ├── .env # Environment variables (API keys) ├── README.md # This file ├── src/ │ ├── config.py # Configuration management │ ├── audit_logger.py # SQLite audit logging │ ├── processing/ │ │ ├── init.py │ │ └── document_processor.py # PDF/OCR processing │ ├── classification/ │ │ ├── init.py │ │ ├── policy_rag.py # RAG knowledge base │ │ └── classifier.py # Core AI classifier │ ├── blockchain/ │ │ ├── init.py │ │ └── solana_audit.py # Solana integration │ └── ui/ │ ├── init.py │ └── app.py # Flask web application ├── policies/ │ ├── categories.json # Category definitions │ ├── pii_patterns.json # PII detection patterns │ └── few_shot_examples.json # SME-validated examples ├── templates/ │ ├── base.html │ ├── index.html # Upload page │ ├── dashboard.html # Statistics dashboard │ ├── hitl_queue.html # Review queue │ └── hitl_review.html # Review detail page └── data/ ├── uploads/ # Uploaded PDFs ├── cache/ # Cached content & audio ├── audit_logs/ # Log files └── audit_logs.db # SQLite database
## 🎓 Key Technologies
| Component | Technology | Purpose |
|-----------|-----------|---------|
| AI Model | Gemini 2.0 Flash | Fast, high-quality classification |
| RAG | Gemini File Search | Policy knowledge grounding |
| CAG | Gemini Caching API | Document context optimization |
| Blockchain | Solana (Devnet) | Immutable audit trails |
| Database | SQLite | Local audit logging |
| Web Framework | Flask | REST API & web UI |
| OCR | Tesseract + PyMuPDF | Multi-modal document processing |
## 🔐 Security Considerations
1. **API Keys**: Never commit API keys to version control. Use environment variables.
2. **PII Detection**: High-risk PII triggers CONFIDENTIAL classification.
3. **Audit Trail**: All decisions are logged to SQLite and Solana blockchain.
4. **HITL Review**: Human oversight for low-confidence or mismatched validations.
5. **Safety Checks**: UNSAFE content is detected first and rejected immediately.
## 📈 Performance Metrics
- **Processing Speed**: ~5-15 seconds per document (depends on page count)
- **Auto-Approval Rate**: Target 85%+ with dual validation
- **Confidence Threshold**: 90% for auto-approval
## 🐛 Troubleshooting
### "Tesseract not found"
Install Tesseract OCR (see Installation section)
### "File processing failed"
Check that the PDF is not corrupted or password-protected
### "Blockchain recording error"
The system will create a simulated transaction hash if Solana devnet is unavailable. This is normal for demo purposes.
### Gemini API errors
- Check API key validity
- Verify quota/billing is enabled
- Ensure Gemini 2.0 Flash access is enabled
## 🤝 HITL Feedback Loop
The system implements a continuous improvement cycle:
1. Document is classified by AI
2. If confidence < 90% or dual validation mismatch → HITL queue
3. SME reviews and corrects classification
4. Correction is added to `policies/few_shot_examples.json`
5. Policy RAG is updated automatically
6. Future similar documents benefit from the correction
## 📝 License
This is a demonstration project for educational purposes.
## 🙏 Acknowledgments
- **Google Gemini**: Advanced AI classification engine
- **Solana**: Blockchain infrastructure for audit trails
- **Tesseract OCR**: Open-source OCR engine
## 📞 Support
For issues or questions about this implementation, please review:
1. This README
2. The code comments (extensively documented)
3. The policy JSON files in `policies/` directory
---
**Built with ❤️ using Gemini 2.0 Flash and Solana**