Skip to content

adhilabu/graphRAG

Repository files navigation

🔗 GraphRAG

A Graph-enhanced Retrieval Augmented Generation system for analyzing Tech Business News

Python 3.11+ LlamaIndex Neo4j License


📖 Overview

GraphRAG combines the power of knowledge graphs with vector embeddings to provide context-rich, accurate responses about tech business news. Unlike traditional RAG systems that rely solely on semantic similarity, GraphRAG leverages entity relationships to understand the broader context of your queries.

✨ Key Features

  • Hybrid Retrieval — Combines vector search (Qdrant) with graph traversal (Neo4j) using RRF fusion
  • Smart Entity Extraction — Automatically extracts companies, people, products, and events from documents
  • 2-Hop Graph Context — Retrieves related entities for comprehensive context
  • Intelligent Reranking — Uses Cohere reranker with Gemini fallback for optimal results
  • PDF Processing — Native support for PDF document ingestion

🏗️ Architecture

flowchart TB
    subgraph Input
        PDF[📄 PDF Documents]
    end
    
    subgraph Processing
        Loader[PDF Loader] --> Chunker[Text Splitter]
        Chunker --> Extractor[Entity Extractor]
        Chunker --> Embedder[Gemini Embeddings]
    end
    
    subgraph Storage
        Extractor --> |Entities & Relations| Neo4j[(Neo4j)]
        Embedder --> |Vectors| Qdrant[(Qdrant)]
    end
    
    subgraph Retrieval
        Query[🔍 User Query] --> VecSearch[Vector Search]
        Query --> GraphSearch[Graph Traversal]
        VecSearch --> Qdrant
        GraphSearch --> Neo4j
        Qdrant --> |Semantic Matches| Fusion[RRF Fusion]
        Neo4j --> |2-Hop Context| Fusion
        Fusion --> Rerank[Reranker]
        Rerank --> Response[📝 Response]
    end
    
    PDF --> Loader
Loading

📂 Project Structure

GraphRAG/
├── 📄 docker-compose.yml     # Neo4j & Qdrant infrastructure
├── 📄 requirements.txt       # Python dependencies
├── 📄 pyproject.toml         # Project metadata
├── 📄 .env.example           # Environment template
├── 📁 src/
│   ├── config.py             # Settings management
│   ├── 📁 schema/            # Graph schema definitions
│   ├── 📁 extraction/        # PDF loading & entity extraction
│   ├── 📁 storage/           # Neo4j & Qdrant clients
│   └── 📁 retrieval/         # Hybrid retriever
├── 📁 scripts/
│   └── ingest_documents.py   # Document ingestion script
├── 📁 tests/                 # Test suite
└── 📁 data/sample/           # Sample documents

🚀 Quick Start

Prerequisites

  • Python 3.11+
  • Docker & Docker Compose
  • Google Gemini API Key

1. Clone & Setup

git clone <repository-url>
cd GraphRAG

2. Start Infrastructure

docker-compose up -d

This starts:

Service Port Purpose
Neo4j Browser 7474 Web UI
Neo4j Bolt 7687 Driver connection
Qdrant HTTP 6333 REST API
Qdrant gRPC 6334 gRPC API

3. Configure Environment

cp .env.example .env

Edit .env with your API keys:

# Required
GOOGLE_API_KEY=your_google_api_key_here

# Optional (for enhanced features)
OPENAI_API_KEY=your_openai_api_key_here
COHERE_API_KEY=your_cohere_api_key_here

4. Install Dependencies

pip install -r requirements.txt

5. Ingest Documents

python scripts/ingest_documents.py data/sample/

6. Query the System

from src.retrieval import HybridRetriever

retriever = HybridRetriever()
results = retriever.retrieve("What companies did Apple acquire?")

for result in results:
    print(result)

📊 Graph Schema

Node Types

Type Properties
Company name, ticker, industry, headquarters
Person name, title, role
Product name, category, launch_date
Event name, date, type, location
Article title, published_date, source

Relationships

Person  ──[LEADS|FOUNDED|WORKS_AT]──▶ Company
Company ──[ACQUIRED|INVESTED_IN|SUED_BY|PARTNERS_WITH|COMPETES_WITH]──▶ Company
Company ──[LAUNCHED]──▶ Product
*       ──[MENTIONED_IN]──▶ Article

⚙️ Configuration

Variable Default Description
GOOGLE_API_KEY required Gemini API key
OPENAI_API_KEY optional For GPT-4o extraction
NEO4J_URI bolt://localhost:7687 Neo4j connection
NEO4J_USERNAME neo4j Neo4j username
NEO4J_PASSWORD graphrag_password Neo4j password
QDRANT_HOST localhost Qdrant host
QDRANT_PORT 6333 Qdrant port
COHERE_API_KEY optional For reranking
GEMINI_MODEL models/gemini-2.0-flash LLM model
CHUNK_SIZE 1024 Text chunk size
CHUNK_OVERLAP 128 Chunk overlap

🧪 Testing

# Run all tests
pytest

# Run with verbose output
pytest -v

# Skip integration tests
pytest -m "not integration"

📚 Tech Stack

Component Technology
LLM Google Gemini 2.0 Flash
Embeddings Gemini text-embedding-004
Graph Store Neo4j 5.15
Vector Store Qdrant 1.7
Framework LlamaIndex 0.10+
Reranking Cohere (primary), Gemini (fallback)

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.


📄 License

This project is licensed under the MIT License - see the LICENSE file for details.


Built with ❤️ using LlamaIndex, Neo4j, and Qdrant

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages