Skip to content

arunbcodes/doc-qa-system

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

PDF Q&A System with RAG

A production-ready PDF question-answering system with semantic search and LLM-powered answers. Works with any LLM provider (OpenAI, Ollama, etc.) or no LLM at all.

Features

  • Semantic Search - Find relevant content by meaning, not keywords
  • Model-Agnostic RAG - Works with 6+ LLM providers (OpenAI, Ollama, Claude, etc.)
  • Local-First - Run completely offline with local models
  • Clean Architecture - Modular, testable, production-ready code

Quick Start

1. Setup

python3 -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install -r requirements.txt

2. Run Phase 1: Semantic Search (No LLM)

python main.py data/sample.pdf

Returns relevant text chunks for your questions.

3. Run Phase 2: RAG with LLM (Natural Language Answers)

python main_rag.py data/sample.pdf

Generates natural language answers using an LLM.

LLM Options

Local Models (Recommended)

Ollama - Easiest local setup:

curl -fsSL https://ollama.ai/install.sh | sh
ollama pull llama3.2
python main_rag.py data/sample.pdf

OpenAI gpt-oss-20b - Latest open-source model:

pip install transformers accelerate
python main_rag.py data/sample.pdf
# Select HuggingFace β†’ openai/gpt-oss-20b

Cloud APIs

export OPENAI_API_KEY="sk-..."
# or
export ANTHROPIC_API_KEY="sk-ant-..."

python main_rag.py data/sample.pdf

Project Structure

pdf-qa-system/
β”œβ”€β”€ main.py              # Semantic search CLI
β”œβ”€β”€ main_rag.py          # RAG with LLM CLI
β”œβ”€β”€ test.py              # Quick test
β”œβ”€β”€ requirements.txt     # Dependencies
β”‚
β”œβ”€β”€ data/                # Your PDF files
β”‚   └── sample.pdf
β”‚
β”œβ”€β”€ src/                 # Core modules
β”‚   β”œβ”€β”€ extract.py       # PDF β†’ text
β”‚   β”œβ”€β”€ chunk.py         # Text β†’ chunks
β”‚   β”œβ”€β”€ embed.py         # Chunks β†’ vectors
β”‚   β”œβ”€β”€ vector_store.py  # Vector database
β”‚   β”œβ”€β”€ query.py         # Search interface
β”‚   β”œβ”€β”€ llm_providers.py # LLM integrations
β”‚   └── rag.py           # RAG pipeline
β”‚
└── docs/                # Documentation
    └── ARCHITECTURE.md  # Technical details

Usage Examples

Semantic Search

$ python main.py data/sample.pdf
> What are the benefits?
[Shows 3 most relevant text chunks]

RAG with LLM

$ python main_rag.py data/sample.pdf
> What are the benefits?
πŸ’‘ Based on the document, the main benefits include:
1. Wellness app with health tracking
2. Coverage up to Rs. 10 Lakhs
3. Accidental death coverage
...

Supported LLM Providers

Provider Cost Privacy Setup
Ollama Free 100% Local ollama pull llama3.2
gpt-oss-20b Free 100% Local Auto-downloads
OpenAI Paid Cloud Set OPENAI_API_KEY
Anthropic Paid Cloud Set ANTHROPIC_API_KEY
HuggingFace Free 100% Local Auto-downloads
Local Server Free 100% Local Start vLLM/text-gen-webui

Using as a Library

from src import PDFParser, TextChunker, EmbeddingModel, VectorStore, RAGInterface

# Process PDF
parser = PDFParser()
text = parser.extract_text("document.pdf")

# Create embeddings
chunker = TextChunker()
chunks = chunker.chunk_text(text)
embedder = EmbeddingModel()
embeddings = embedder.embed_batch(chunks)

# Store in vector DB
store = VectorStore()
store.add_chunks(chunks, embeddings)

# Query
from src import get_available_llm
rag = RAGInterface(embedder, store, llm=get_available_llm())
result = rag.answer_question("What is this about?")
print(result['answer'])

Configuration

Edit settings in the respective modules:

  • Chunk size: src/chunk.py β†’ TextChunker(chunk_size=500)
  • Number of results: src/query.py β†’ QueryInterface(n_results=3)
  • Embedding model: src/embed.py β†’ EmbeddingModel(model_name="...")

Requirements

  • Python 3.8+
  • 8GB RAM minimum (16GB+ recommended for large models)
  • 10GB disk space (for models)

Architecture

Retrieval Pipeline (Phase 1)

PDF β†’ Extract β†’ Chunk β†’ Embed β†’ Vector Store β†’ Query β†’ Results

RAG Pipeline (Phase 2)

PDF β†’ Extract β†’ Chunk β†’ Embed β†’ Vector Store
                                    ↓
Question β†’ Embed β†’ Search β†’ Top Chunks β†’ Prompt β†’ LLM β†’ Answer

Documentation

License

MIT License

Acknowledgments

  • Docling - PDF parsing
  • Sentence Transformers - Embeddings
  • Chroma - Vector database
  • LangChain - Text splitting
  • Ollama - Local LLM runtime

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages