Skip to content

abusaleh34/smart-RAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RAG-Qoder: Document Ingestion and Retrieval System

A comprehensive document ingestion and retrieval system that processes various document formats (PDFs, images, office documents) through multiple stages including OCR, text extraction, chunking, embedding generation, and indexing. The system supports multi-language processing with specialized Arabic language handling.

Features

  • Multi-format Document Processing: Support for PDFs, images, and office documents
  • Intelligent OCR Routing: Automatic language detection with specialized Arabic OCR processing
  • Multi-stage Processing Pipeline: Preprocessing → OCR → Table Extraction → Chunking → Embedding → Indexing
  • Duplicate Detection: Hash-based duplicate prevention
  • Hybrid Search: Combined text and vector search capabilities
  • Administrative Interface: Dataset management, monitoring, and document reprocessing
  • Robust Error Handling: Automatic retry logic with exponential backoff
  • Scalable Architecture: Microservices-based design with message queues

Architecture

The system follows a microservices architecture with the following components:

Core Services

  • Upload Service: Handles document uploads, validation, and duplicate detection
  • Processing Orchestrator: Coordinates multi-stage document processing
  • Search Service: Provides hybrid search capabilities
  • Admin Service: Manages datasets and system monitoring

Processing Engines

  • Language Detector: Identifies document language (Arabic, English, mixed)
  • OCR Router: Routes to appropriate OCR engine based on language
  • Text Normalizer: Unicode normalization and diacritic handling
  • Chunking Service: Segments text for retrieval optimization
  • Embedding Service: Generates semantic embeddings
  • Table Extraction: Detects and extracts table structures

Storage Layer

  • PostgreSQL: Document metadata, processing status, chunks, tables
  • Vector Database: Embeddings for semantic search (Qdrant/Weaviate)
  • Blob Storage: Original files and processing artifacts
  • Search Index: Full-text search (Elasticsearch)

Prerequisites

  • Node.js >= 18.x
  • PostgreSQL >= 14.x
  • Redis >= 6.x
  • Docker (optional, for containerized services)

Installation

  1. Clone the repository

    git clone <repository-url>
    cd RAG-Qoder
  2. Install dependencies

    npm install
  3. Configure environment

    cp .env.example .env
    # Edit .env with your configuration
  4. Set up database

    # Create PostgreSQL database
    createdb rag_qoder
    
    # Run migrations
    psql -d rag_qoder -f src/database/schema.sql
  5. Start Redis

    redis-server

Configuration

Edit .env file with your settings:

Database Configuration

DB_HOST=localhost
DB_PORT=5432
DB_NAME=rag_qoder
DB_USER=postgres
DB_PASSWORD=your_password

OCR Engines

DEFAULT_OCR_ENDPOINT=http://localhost:5000/ocr
ARABIC_OCR_ENDPOINT=http://localhost:5001/ocr

Embedding Service

EMBEDDING_SERVICE_ENDPOINT=http://localhost:8000/embed
EMBEDDING_MODEL_DEFAULT=sentence-transformers/all-MiniLM-L6-v2
EMBEDDING_MODEL_ARABIC=sentence-transformers/paraphrase-multilingual-mpnet-base-v2

Usage

Development Mode

npm run dev

Production Mode

npm run build
npm start

Running Tests

npm test              # Run all tests
npm run test:watch   # Run tests in watch mode

Current test coverage: Configuration, validation, and hashing utilities (33 tests passing)

API Endpoints

Upload Document

POST /api/v1/upload
Content-Type: multipart/form-data

{
  "file": <binary>,
  "dataset": "my-dataset",
  "metadata": {}
}

Search Documents

POST /api/v1/search
Content-Type: application/json

{
  "text": "search query",
  "datasets": ["dataset1"],
  "limit": 10
}

Get Document Status

GET /api/v1/documents/:id/status

List Datasets

GET /api/v1/admin/datasets

Project Structure

RAG-Qoder/
├── src/
│   ├── index.ts                  # Application entry point
│   ├── config.ts                 # Configuration management
│   ├── types/                    # TypeScript type definitions
│   │   └── index.ts
│   ├── interfaces/               # Service interfaces
│   │   └── services.ts
│   ├── database/                 # Database layer
│   │   ├── schema.sql
│   │   ├── connection.ts
│   │   └── migrations/
│   │       └── run-migrations.ts
│   └── utils/                    # Utility functions
│       ├── logger.ts
│       ├── hash.ts
│       └── validation.ts
├── logs/                         # Application logs
├── uploads/                      # Uploaded files
├── temp/                         # Temporary processing files
├── storage/                      # Blob storage
├── dist/                         # Compiled JavaScript
├── package.json
├── tsconfig.json
└── .env.example

# To be implemented:
# - src/services/         # Service implementations
# - src/routes/           # API routes
# - src/workers/          # Background workers
# - tests/                # Test files

Development Workflow

Task Implementation

The project follows a structured task-based implementation plan outlined in tasks.md:

  1. ✅ Set up project structure and core interfaces
  2. ⏳ Implement document upload and metadata management
  3. ⏳ Create processing orchestration system
  4. ⏳ Develop language detection and OCR routing
  5. ⏳ Implement text chunking and segmentation
  6. ⏳ Develop embedding and vector search system
  7. ⏳ Create table extraction system
  8. ⏳ Develop search and retrieval system
  9. ⏳ Implement administrative interface
  10. ⏳ Add monitoring and observability
  11. ⏳ Testing and validation

Processing Pipeline

Document Upload
    ↓
Validation & Duplicate Check
    ↓
Queue for Processing
    ↓
Preprocessing (PDF → Images)
    ↓
Language Detection
    ↓
OCR Routing (Arabic/Default)
    ↓
Text Normalization
    ↓
Table Extraction (Parallel)
    ↓
Text Chunking
    ↓
Embedding Generation
    ↓
Vector & Text Indexing
    ↓
Processing Complete

Language Support

Arabic Text Processing

  • Automatic detection of Arabic content
  • Specialized Arabic OCR engine routing
  • Optional diacritic removal
  • Arabic-aware text segmentation
  • Multilingual embedding models

English and Other Languages

  • Default OCR engine for non-Arabic content
  • Standard text processing pipeline
  • Language-specific embedding models

Error Handling

The system implements robust error handling with:

  • Automatic Retry: Transient errors trigger retry with exponential backoff
  • Error Classification: Distinguishes between transient and permanent errors
  • State Preservation: Processing state persisted for recovery
  • Detailed Logging: Comprehensive error information for debugging

Monitoring

  • Health check endpoint: GET /health
  • Processing status tracking per document
  • Job history and failure tracking
  • Performance metrics collection

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Commit your changes
  4. Push to the branch
  5. Create a Pull Request

License

MIT

Support

For issues and questions, please open an issue on GitHub.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published