Graph-RAG Inference Engine for Automated Business Requirements Generation
Built at HackFest 2.0, GDG Cloud New Delhi. 50 hours, zero sleep, fourth place.
I am Kavin Thakur, second year CS student specialising in ML/AI at Chitkara University, currently working as Visiting Researcher (AI/ML) at National Ilan University in Taiwan. My research there focuses on real-time computer vision pipelines and edge AI inference on NVIDIA RTX 6000 HPC clusters. What I love about this hackathon project is how naturally it integrates everything I work on. Computer vision taught me how to think about feature extraction, RAG is just retrieval with a different modality, and the VRAM engineering is the same constraint-solving I do daily on HPC systems.
Team M-2.5 built this backend from scratch over 50 consecutive hours at HackFest 2.0. Every architectural decision, the knowledge graph integration, the VRAM constraint engineering, the parallel WebSocket streaming, the research paper implementations, was designed and debugged under competition pressure with a live judging deadline.
This is not the first system I have built under pressure. I won INR 1 Lakh at Swasth-a-thon building a medical voice pathology AI on-device with privacy constraints. I reached the national finals at AMD x IIT Bombay AI Sprint in the top 1.5% of 4000+ entries building an LLM scheduling agent on MI300 GPUs. I lead IEEE CIET as Technical Head and serve as AI/ML Executive at GDG Chitkara. I also served as a Data Science Intern at Infomo Global after my first semester.
I care about building things that actually work, not demos, not API wrappers, but real pipelines with real engineering decisions and real constraints behind them.
If you are from Turgon AI and reading this while evaluating internship candidates, I would genuinely love to talk. Every component of this codebase is something I can explain, defend and improve. I have context on every decision made at 3 AM when it mattered.
BACKEND REQUIRES 24 GB vRAM so it is off currently!
Frontend (live): brd-agent-cmka.vercel.app
Backend repo (this one): 100% Python. Frontend repo: auraCodesKM/brdAgent (private, DM for access)
Fourth place!
Congratulations to all the winners! 👏 A big appreciation as well to the organizers for successfully hosting the event.
The links to the winning projects are now available. You're welcome to congratulate the winners!
Thank you for your attention.
BRD Agent transforms unstructured corporate communications, emails, meeting transcripts, Slack exports, into complete professional Business Requirements Documents. Eight sections. Under three minutes. Fully local. Zero external API calls. Zero data leakage.
This repository is the backend inference engine, 100% Python. The frontend is Next.js with WebSocket streaming and Mermaid diagram rendering.
Most RAG systems retrieve text and summarise it. That produces generic output that paraphrases the input.
BRD generation requires something fundamentally different. The system must understand who said what, detect when two people contradict each other across separate documents, and produce formally structured output with every claim attributed to its source.
We solved this with three components working together.
LightRAG Knowledge Graph stores entities and relationships, not just text chunks. Kenneth Lay becomes a node. SSO becomes a node. The edge between them records that Kenneth Lay mandated SSO. When Jeff Skilling later states the SSO deadline is January 15th and Kenneth Lay said December 31st, the graph detects that contradiction structurally through traversal. Prompt engineering alone cannot do this.
A-RAG Parallel Fetch hits FAISS dense vectors and the LightRAG graph simultaneously. Dense retrieval finds semantically similar chunks. Graph retrieval finds entity relationships. BGE-M3 cross-encoder reranks the merged results by relevance to the specific BRD section being generated.
Three research techniques applied at generation time improve output quality measurably over baseline prompting, explained in full below.
This section matters to me because these were not papers I cited for credibility. I read them during the hackathon because we had specific problems to solve and these papers had the answers. Each one directly changed how we built something.
Liu et al., Stanford / UC Berkeley, 2023 arxiv.org/abs/2307.03172
What it found: When you give a language model a long context window, it pays strong attention to content at the very beginning and very end, and systematically ignores content in the middle. The paper measured this across multiple models and tasks and the degradation was severe and consistent.
Why we read it: Our BRD sections require the model to process 35+ retrieved entities plus relationship context plus formatting instructions simultaneously. We were seeing the model produce well-formatted openings and then gradually lose structure mid-section, sometimes reverting to raw graph output halfway through a requirement list.
What we changed: We restructured every prompt so that the most important formatting instruction appears twice, once before the context block and once after it. The retrieved knowledge graph context sits in the middle. The model now attends to the instruction at both ends of the context window. Section structure degradation dropped immediately.
Leviathan, Kalman, Matias. Google, 2025 arxiv.org/abs/2512.14982
What it found: Repeating the input prompt improves performance across popular models, Gemini, GPT, Claude, Deepseek, without increasing the number of generated tokens or adding any latency cost. The finding is simple and surprisingly powerful: just send the prompt twice.
Why we read it: We were seeing formatting drift on our longest sections. The model would start a Functional Requirements table correctly and then gradually lose structure mid-section, dropping acceptance criteria columns, skipping source attribution, reverting to prose. The context block was too long and the original instruction was washing out.
What we changed: Every section prompt now sends the core instruction twice, once before the context block and once after it, immediately before the generation boundary. The instruction the model sees last is the one it attends to most strongly. Combined with the sandwich pattern from the Lost in the Middle finding, formatting consistency across all eight sections became reliable rather than probabilistic. Zero added latency, measurable improvement in section structure.
Wang et al., Google Brain, 2022 arxiv.org/abs/2203.11171
What it found: When a model generates the same answer multiple times with different random seeds, correct reasoning paths converge and incorrect hallucinated paths diverge. Taking the majority answer or intersection across multiple generations is significantly more reliable than any single generation pass.
Why we read it: Functional Requirements was our most hallucination-prone section. The model would invent plausible sounding requirements, specific latency numbers, database capacities, compliance standards, that had no basis in the source emails. One generation pass was not enough to catch these because they sounded entirely reasonable.
What we changed: Functional Requirements now generates twice at temperature 0.3 and temperature 0.7. A merge pass on Qwen 7B reads both outputs and retains only requirements that appear in both. A requirement that appears in only one generation is almost certainly hallucinated. The section went from regularly inventing 2-3 fabricated requirements to zero across all our test runs.
Adams et al., Columbia University / Salesforce Research, 2023 arxiv.org/abs/2309.04269
What it found: Asking a model to write a dense, information- rich document in one pass produces lower quality output than separating the process into distinct passes, first extract the facts, then write prose from those facts. The two-pass approach produces documents that are more information-dense, more accurate, and better grounded in the source material.
Why we read it: We were running all generation on Qwen 32B and hitting two problems simultaneously. First, generation was slow, 20+ seconds per section with the full model doing both fact extraction and writing in one pass. Second, the output quality was inconsistent because the model was context- switching between retrieval reasoning and formal writing in the same generation pass.
What we changed: Pass 1 runs on Qwen 7B, lightweight, 80 tokens per second, extracts concrete facts, names, dates, numbers and stated requirements from the knowledge graph context. Pass 2 runs on Qwen 32B, writes formal BRD prose from that clean fact sheet rather than from raw email text. The result was 30% faster total generation because 7B is 10x faster than 32B for the extraction pass, and the 32B writing quality improved because it was working from structured facts rather than noisy communications.
This also solved our VRAM problem. 7B takes 5GB VRAM. 32B takes 21GB. Running them sequentially with an asyncio semaphore(3) keeps peak usage at 26GB, just within what Ollama can gracefully manage on the RTX 4090.
Qwen 2.5 32B per active inference context:
Model weights: 18GB VRAM
KV cache: 3GB VRAM
Total: 21GB VRAM
RTX 4090 available: 24GB VRAM
Naive 8 sections simultaneously:
8 x 21GB = 168GB required
Available = 24GB
Result = immediate OOM crash
Semaphore limiting alone was insufficient:
2 concurrent x 21GB = 42GB
Overflow to RAM = 18GB
Result = 8 t/s degraded throughput
vs 25 t/s target
Chain of Density two-pass architecture solved it:
Pass 1 Qwen 7B fact extraction
VRAM: 5GB
Speed: 80 t/s
Pass 2 Qwen 32B professional BRD writing
VRAM: 21GB
Speed: 25 t/s
Peak with asyncio semaphore(3):
32B writing: 21GB
7B extracting: 5GB
Total: 26GB
Ollama offload: 2GB graceful RAM overflow
Result: stable, no OOM, full throughput
First sections appear in 15 seconds. All 8 sections complete in under 3 minutes. 30% faster than single-model approach.
graph TD
subgraph Ingestion["Index Time"]
A["Raw Doc: EML/PDF/CSV"] -->|Text Extraction| B["Clean Markdown"]
B -->|Semantic Sliding Window| C["Text Chunks"]
C -->|LightRAG Entity Extraction| D["Knowledge Graph Nodes/Edges"]
C --> F[("FAISS Dense Vectors")]
D --> H[("LightRAG Graph DB")]
end
subgraph Generation["Query Time"]
I["Client Query via REST/WS"] --> L["Query Formulation"]
L --> M["A-RAG Parallel Fetch"]
M -->|Dense| F
M -->|Graph| H
F & H --> N["Context Aggregation"]
N -->|BGE-M3 Reranking| O["Top K Chunks"]
O -->|Prompt Repetition| R["Ollama: Qwen 2.5 32B"]
R -->|WebSocket Streaming| S["Frontend UI"]
end
Honest accounting of pipeline state during judging.
LightRAG Knowledge Graph, entity extraction, relationship mapping, conflict detection across documents
Qwen 2.5 32B local inference via Ollama, 25 t/s sustained
Qwen 2.5 7B fast extraction pass, 80 t/s
BGE-M3 embeddings, 1024-dimensional dense vectors
BGE-Reranker-Large cross-encoder reranking
FAISS vector store, semantic similarity retrieval
Eight parallel WebSocket streams, simultaneous section generation
All four research techniques, implemented and active
Conflict detection, real contradictions found across Enron email threads
Traceability matrix, JSON requirement-to-source mapping
MinerU vision parsing encountered environment issues, fell back to standard text extraction cleanly
Redis semantic caching hit free tier limits, system routed entirely through LightRAG and FAISS with no measurable performance impact
CONFLICT-1: Mobile Access Scope
Stakeholders: Kenneth Lay vs Jeff Skilling
Kenneth Lay: "Board requires mobile access for executives.
iOS and Android apps needed immediately."
Jeff Skilling: "Mobile is out of scope for Phase 1.
Engineering cannot deliver mobile AND
authentication by December 31st."
Status: [UNRESOLVED] Resolution required by Friday
Risk: [HIGH]
CONFLICT-2: Authentication Deadline
Stakeholders: Kenneth Lay vs Jeff Skilling
Kenneth Lay: Platform launch December 31st
Jeff Skilling: January 15th proposed for Phase 1 auth only
Status: [UNRESOLVED]
Risk: [HIGH] On critical path for platform launch
{
"status": "success",
"matrix": [
{
"req_id": "REQ-001",
"description": "Platform must handle 50,000 concurrent users during peak trading hours.",
"source": "kenneth.lay@enron.com, Q4 Platform Strategy",
"owner": "Enron Board"
},
{
"req_id": "REQ-002",
"description": "Real-time P&L display with data refresh under 500 milliseconds.",
"source": "louise.kitchen@enron.com, Dashboard Requirements",
"owner": "Trading Floor"
},
{
"req_id": "REQ-009",
"description": "All financial transactions encrypted with AES-256.",
"source": "jeff.skilling@enron.com, Security Mandate",
"owner": "Security Team"
}
]
}| Component | Technology |
|---|---|
| API Framework | FastAPI + Uvicorn |
| LLM Inference | Ollama, Qwen 2.5 32B |
| Fast Extraction | Ollama, Qwen 2.5 7B |
| Embeddings | BAAI/BGE-M3 |
| Reranking | BGE-Reranker-Large |
| Knowledge Graph | LightRAG |
| Vector Store | FAISS |
| Streaming | WebSockets |
| Cache | Redis (Upstash) |
| Hardware | NVIDIA RTX 4090 24GB |
(Note: While this repository houses the Python/FastAPI backend, the following screenshots showcase our custom-built Next.js frontend. Because the frontend repository is private, we are highlighting the complete user interface here to demonstrate the end-to-end capabilities of the BRD Agent.)
The primary landing page introducing the BRD Agent. This page explains the capabilities of the system, including Advanced RAG operations, local 32B model inference, and the 500K+ Enron dataset processing.
The Dashboard is the central command center for the application. Here, users get a high-level overview of their generation workspace, including active RAG pipelines, processed documents, and quick links to Data Ingestion and BRD Generation.
This visually detailed section of the application breaks down the core components powering the backend, including the Qwen 32B Local Engine, MinerU Ingestion, CRAG Conflict Resolution, Traceability Matrices, and Instant Generation pipelines.
The Data Ingestion interface where users can upload and process unstructured corporate data. The UI displays the active data streams being chunked and parsed into the LightRAG GraphDB and FAISS vector databases.
The onboarding user flow that outlines the intuitive three-step process for generating a complete Business Requirements Document: Select Pipelines, Run Generation, and Review the final structure.
The core BRD Generator workspace. This is where parallel WebSocket connections stream the generated document sections in real-time.
When you press Generate Document, the system displays its internal retrieval and synthesis steps. You can watch word-by-word as the raw Markdown tokens stream down from the backend (with "thinking" steps like Queueing, Retrieving corporate context..., and Synthesizing professional document block...) and are instantly rendered into formatted UI components.
A deep dive into the perfectly rendered Markdown output generated by the RAG pipeline. These detail views showcase the system gracefully handling complex formatting like robust functional specifications, acceptance criteria tables, error handling metrics, and conflict resolutions.
Visualizations of the project architecture and the GDG Cloud HackFest contact information native to the presentation diagram.
git clone https://github.com/auraCodesKM/brdAgentBackend.git
cd brdAgentBackend
cp .env.example .env
python download_models.pyPull models via Ollama:
ollama pull qwen2.5:32b
ollama pull qwen2.5:7bRun the backend:
python -m venv venv
venv\Scripts\activate
pip install -r requirements.txt
python -m uvicorn main:app --host 0.0.0.0 --port 8000 --workers 1Verify the pipeline:
curl http://localhost:8000/health
curl http://localhost:8000/conflicts
curl http://localhost:8000/traceabilityMinimum hardware: 24GB VRAM GPU. The backend server has been taken offline after the competition.
Fourth place. HackFest 2.0, GDG Cloud New Delhi.
50 consecutive hours. Real Enron corpus, 500K emails filtered and ingested semantically. Real knowledge graph with entity conflict detection. Real local inference at 25 t/s. Professional 8-section BRDs generated in under three minutes during live judging.





















