Graph RAG Pipeline - Proof of Concept

A locally-executable Graph RAG (Retrieval-Augmented Generation) pipeline that constructs knowledge graphs from Wikipedia articles and enables interactive exploration and analysis.

🎯 Project Overview

This proof of concept demonstrates:

Knowledge Graph Construction: Automated entity extraction and relationship mapping from Wikipedia articles
Multi-Model NER: Combining GLiNER and spaCy for comprehensive entity recognition
Interactive Visualization: Graph exploration through web-based interfaces
Real-time Validation: Streamlit-based interface for graph exploration and analysis

🚀 Quick Start

Prerequisites

Python 3.9+
8GB RAM minimum
10GB free disk space (for models)

Installation

Clone the repository:

git clone https://github.com/yourusername/graph-rag-poc.git
cd graph-rag-poc

Create a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt
python -m spacy download en_core_web_sm

Install Ollama and download a model:

# Install Ollama (macOS)
brew install ollama

# Start Ollama service
ollama serve

# In another terminal, pull a model
ollama pull phi3:mini

Basic Usage

Option 1: Using Makefile (Recommended)

# Complete setup
make setup

# Run the complete pipeline
make run-pipeline

# Or run individual steps
make ingest
make extract
make build

# Launch validation interface
make validate

# Clean up generated files
make clean

Available Commands:

make setup - Complete setup (install + models)
make run-pipeline - Run complete pipeline
make ingest - Ingest Wikipedia articles
make extract - Extract entities
make build - Build knowledge graph
make validate - Launch validation interface
make clean - Clean up generated files

Option 2: Direct Commands

# 1. Ingest Wikipedia articles
python -m src.ingestion.wikipedia --topics "Artificial Intelligence,Machine Learning" --max-articles 10

# 2. Extract entities and build graph
python -m src.extraction.pipeline --input data/articles --output data/graphs

# 3. Launch validation interface
streamlit run src/validation/app.py

Interactive Demo

Run the Jupyter notebook for a step-by-step demonstration:

jupyter notebook notebooks/demo.ipynb

📁 Project Structure

graph-rag-poc/
├── README.md                 # Project documentation
├── requirements.txt          # Python dependencies
├── setup.py                  # Package setup
│
├── src/                     # Source code
│   ├── __init__.py
│   ├── ingestion/           # Wikipedia data ingestion
│   │   ├── __init__.py
│   │   └── wikipedia.py     # Wikipedia API wrapper
│   │
│   ├── extraction/          # Entity extraction pipeline
│   │   ├── __init__.py
│   │   ├── gliner_extractor.py  # GLiNER model
│   │   ├── spacy_extractor.py   # spaCy NER
│   │   └── pipeline.py          # Combined pipeline
│   │
│   ├── graph/               # Graph construction
│   │   ├── __init__.py
│   │   └── builder.py       # NetworkX graph builder
│   │
│   ├── validation/          # Interactive validation UI
│   │   ├── __init__.py
│   │   └── app.py          # Streamlit application
│   │
│   └── query/              # Query interface (planned)
│       └── __init__.py
│
├── data/                   # Data storage
│   ├── articles/          # Wikipedia articles (JSON)
│   ├── entities/          # Extracted entities (JSON)
│   └── graphs/           # Graph files (pickle, graphml, html)
│
├── notebooks/            # Jupyter notebooks
│   └── demo.ipynb       # Complete pipeline demonstration
│
├── configs/              # Configuration files
│   └── pipeline.yaml    # Pipeline configuration
│
├── docs/                # Additional documentation
│   └── ARCHITECTURE.md  # System architecture
│
└── tests/              # Unit tests
    └── test_ingestion.py

🔧 Features

✅ Current Features (Working)

Wikipedia Article Ingestion: Robust article fetching with search fallback
Multi-Model Entity Extraction: GLiNER + spaCy for comprehensive NER
Knowledge Graph Construction: NetworkX-based graph with entities and relationships
Interactive Visualization: PyVis-based HTML visualizations
Streamlit Validation Interface: Web-based graph exploration and analysis
Graph Statistics: Comprehensive metrics and entity analysis
Data Export: GraphML and pickle formats for interoperability

🔄 Planned Features

LLM Query Interface: Ollama-powered natural language querying
Advanced Relationship Extraction: Beyond co-occurrence analysis
Multi-source Data Ingestion: Support for other data sources
Production Database: Neo4j integration for large-scale graphs
API Endpoints: RESTful API for programmatic access

🏗️ Architecture

The pipeline consists of four main stages:

Data Ingestion: Fetches Wikipedia articles with intelligent search fallback
Entity Extraction: Multi-model NER using GLiNER and spaCy
Graph Construction: Builds NetworkX graph with entities and co-occurrence relationships
Validation & Exploration: Interactive Streamlit interface for graph analysis

📊 Current Performance

Metric	Value	Status
Articles processed	4	✅ Working
Entities extracted	481	✅ Working
Graph nodes	481	✅ Working
Graph edges	2,001	✅ Working
Entity types	16	✅ Working
Connected components	17	✅ Working

🧪 Testing

Run the test suite:

# Run all tests
pytest

# Run specific test module
pytest tests/test_ingestion.py

# Run with coverage
pytest --cov=src --cov-report=html

📈 Usage Examples

1. Quick Start with Makefile

# Complete setup and run
make setup
make run-pipeline
make validate

2. Fetch Articles on AI Topics

python -m src.ingestion.wikipedia --topics "Deep Learning,Computer Vision,Natural Language Processing" --max-articles 5

3. Extract Entities and Build Graph

python -m src.extraction.pipeline --input data/articles --output data/graphs

4. Explore the Graph Interactively

streamlit run src/validation/app.py

5. Run the Complete Demo

jupyter notebook notebooks/demo.ipynb

🔍 Graph Analysis Features

The Streamlit validation interface provides:

Graph Overview: Statistics, density, and entity distributions
Entity Management: Search, filter, and analyze entities by type
Relationship Analysis: Explore connections and relationship types
Interactive Visualization: Generate custom graph visualizations
Data Export: Download entities and relationships as CSV

🛠️ Troubleshooting

Common Issues

"No module named 'spacy'"

make setup
# or manually:
pip install -r requirements.txt
python -m spacy download en_core_web_sm

"Ollama port conflict"

ps aux | grep ollama
kill -9 <PID>
ollama serve

"No articles found"
- Try different topic names
- Check internet connection
- Verify Wikipedia API access
"make: command not found" (Windows)
- Install Make for Windows via Chocolatey: choco install make
- Or use the direct commands instead of Makefile
- Or use WSL (Windows Subsystem for Linux)

🤝 Contributing

This is a proof of concept project. Contributions are welcome!

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

GLiNER team for zero-shot NER capabilities
spaCy for robust NLP pipeline
Ollama for local LLM deployment
NetworkX for graph manipulation
PyVis for interactive visualizations
Streamlit for web interface framework

📚 References

🚧 Current Status

Phase: Proof of Concept ✅
Status: Fully Functional
Last Updated: August 2024

✅ Completed Features

Wikipedia article ingestion with search fallback
Multi-model entity extraction (GLiNER + spaCy)
Knowledge graph construction
Interactive Streamlit validation interface
Graph visualization and analysis
Comprehensive testing and error handling

🔄 Next Steps

LLM query interface integration
Advanced relationship extraction
Multi-source data ingestion
Production database integration

📧 Contact

For questions or feedback, please open an issue on GitHub.

Last Updated: August 2024

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
configs		configs
docs		docs
notebooks		notebooks
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SUMMARY.md		SUMMARY.md
constraints.txt		constraints.txt
quickstart.py		quickstart.py
requirements.txt		requirements.txt
setup.py		setup.py

License

chris7jackson/graph-rag-poc

Folders and files

Latest commit

History

Repository files navigation