Skip to content

chris7jackson/graph-rag-poc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Graph RAG Pipeline - Proof of Concept

A locally-executable Graph RAG (Retrieval-Augmented Generation) pipeline that constructs knowledge graphs from Wikipedia articles and enables interactive exploration and analysis.

🎯 Project Overview

This proof of concept demonstrates:

  • Knowledge Graph Construction: Automated entity extraction and relationship mapping from Wikipedia articles
  • Multi-Model NER: Combining GLiNER and spaCy for comprehensive entity recognition
  • Interactive Visualization: Graph exploration through web-based interfaces
  • Real-time Validation: Streamlit-based interface for graph exploration and analysis

πŸš€ Quick Start

Prerequisites

  • Python 3.9+
  • 8GB RAM minimum
  • 10GB free disk space (for models)

Installation

  1. Clone the repository:
git clone https://github.com/yourusername/graph-rag-poc.git
cd graph-rag-poc
  1. Create a virtual environment:
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt
python -m spacy download en_core_web_sm
  1. Install Ollama and download a model:
# Install Ollama (macOS)
brew install ollama

# Start Ollama service
ollama serve

# In another terminal, pull a model
ollama pull phi3:mini

Basic Usage

Option 1: Using Makefile (Recommended)

# Complete setup
make setup

# Run the complete pipeline
make run-pipeline

# Or run individual steps
make ingest
make extract
make build

# Launch validation interface
make validate

# Clean up generated files
make clean

Available Commands:

  • make setup - Complete setup (install + models)
  • make run-pipeline - Run complete pipeline
  • make ingest - Ingest Wikipedia articles
  • make extract - Extract entities
  • make build - Build knowledge graph
  • make validate - Launch validation interface
  • make clean - Clean up generated files

Option 2: Direct Commands

# 1. Ingest Wikipedia articles
python -m src.ingestion.wikipedia --topics "Artificial Intelligence,Machine Learning" --max-articles 10

# 2. Extract entities and build graph
python -m src.extraction.pipeline --input data/articles --output data/graphs

# 3. Launch validation interface
streamlit run src/validation/app.py

Interactive Demo

Run the Jupyter notebook for a step-by-step demonstration:

jupyter notebook notebooks/demo.ipynb

πŸ“ Project Structure

graph-rag-poc/
β”œβ”€β”€ README.md                 # Project documentation
β”œβ”€β”€ requirements.txt          # Python dependencies
β”œβ”€β”€ setup.py                  # Package setup
β”‚
β”œβ”€β”€ src/                     # Source code
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ ingestion/           # Wikipedia data ingestion
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   └── wikipedia.py     # Wikipedia API wrapper
β”‚   β”‚
β”‚   β”œβ”€β”€ extraction/          # Entity extraction pipeline
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ gliner_extractor.py  # GLiNER model
β”‚   β”‚   β”œβ”€β”€ spacy_extractor.py   # spaCy NER
β”‚   β”‚   └── pipeline.py          # Combined pipeline
β”‚   β”‚
β”‚   β”œβ”€β”€ graph/               # Graph construction
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   └── builder.py       # NetworkX graph builder
β”‚   β”‚
β”‚   β”œβ”€β”€ validation/          # Interactive validation UI
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   └── app.py          # Streamlit application
β”‚   β”‚
β”‚   └── query/              # Query interface (planned)
β”‚       └── __init__.py
β”‚
β”œβ”€β”€ data/                   # Data storage
β”‚   β”œβ”€β”€ articles/          # Wikipedia articles (JSON)
β”‚   β”œβ”€β”€ entities/          # Extracted entities (JSON)
β”‚   └── graphs/           # Graph files (pickle, graphml, html)
β”‚
β”œβ”€β”€ notebooks/            # Jupyter notebooks
β”‚   └── demo.ipynb       # Complete pipeline demonstration
β”‚
β”œβ”€β”€ configs/              # Configuration files
β”‚   └── pipeline.yaml    # Pipeline configuration
β”‚
β”œβ”€β”€ docs/                # Additional documentation
β”‚   └── ARCHITECTURE.md  # System architecture
β”‚
└── tests/              # Unit tests
    └── test_ingestion.py

πŸ”§ Features

βœ… Current Features (Working)

  • Wikipedia Article Ingestion: Robust article fetching with search fallback
  • Multi-Model Entity Extraction: GLiNER + spaCy for comprehensive NER
  • Knowledge Graph Construction: NetworkX-based graph with entities and relationships
  • Interactive Visualization: PyVis-based HTML visualizations
  • Streamlit Validation Interface: Web-based graph exploration and analysis
  • Graph Statistics: Comprehensive metrics and entity analysis
  • Data Export: GraphML and pickle formats for interoperability

πŸ”„ Planned Features

  • LLM Query Interface: Ollama-powered natural language querying
  • Advanced Relationship Extraction: Beyond co-occurrence analysis
  • Multi-source Data Ingestion: Support for other data sources
  • Production Database: Neo4j integration for large-scale graphs
  • API Endpoints: RESTful API for programmatic access

πŸ—οΈ Architecture

The pipeline consists of four main stages:

  1. Data Ingestion: Fetches Wikipedia articles with intelligent search fallback
  2. Entity Extraction: Multi-model NER using GLiNER and spaCy
  3. Graph Construction: Builds NetworkX graph with entities and co-occurrence relationships
  4. Validation & Exploration: Interactive Streamlit interface for graph analysis

πŸ“Š Current Performance

Metric Value Status
Articles processed 4 βœ… Working
Entities extracted 481 βœ… Working
Graph nodes 481 βœ… Working
Graph edges 2,001 βœ… Working
Entity types 16 βœ… Working
Connected components 17 βœ… Working

πŸ§ͺ Testing

Run the test suite:

# Run all tests
pytest

# Run specific test module
pytest tests/test_ingestion.py

# Run with coverage
pytest --cov=src --cov-report=html

πŸ“ˆ Usage Examples

1. Quick Start with Makefile

# Complete setup and run
make setup
make run-pipeline
make validate

2. Fetch Articles on AI Topics

python -m src.ingestion.wikipedia --topics "Deep Learning,Computer Vision,Natural Language Processing" --max-articles 5

3. Extract Entities and Build Graph

python -m src.extraction.pipeline --input data/articles --output data/graphs

4. Explore the Graph Interactively

streamlit run src/validation/app.py

5. Run the Complete Demo

jupyter notebook notebooks/demo.ipynb

πŸ” Graph Analysis Features

The Streamlit validation interface provides:

  • Graph Overview: Statistics, density, and entity distributions
  • Entity Management: Search, filter, and analyze entities by type
  • Relationship Analysis: Explore connections and relationship types
  • Interactive Visualization: Generate custom graph visualizations
  • Data Export: Download entities and relationships as CSV

πŸ› οΈ Troubleshooting

Common Issues

  1. "No module named 'spacy'"

    make setup
    # or manually:
    pip install -r requirements.txt
    python -m spacy download en_core_web_sm
  2. "Ollama port conflict"

    ps aux | grep ollama
    kill -9 <PID>
    ollama serve
  3. "No articles found"

    • Try different topic names
    • Check internet connection
    • Verify Wikipedia API access
  4. "make: command not found" (Windows)

    • Install Make for Windows via Chocolatey: choco install make
    • Or use the direct commands instead of Makefile
    • Or use WSL (Windows Subsystem for Linux)

🀝 Contributing

This is a proof of concept project. Contributions are welcome!

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • GLiNER team for zero-shot NER capabilities
  • spaCy for robust NLP pipeline
  • Ollama for local LLM deployment
  • NetworkX for graph manipulation
  • PyVis for interactive visualizations
  • Streamlit for web interface framework

πŸ“š References

🚧 Current Status

Phase: Proof of Concept βœ…
Status: Fully Functional
Last Updated: August 2024

βœ… Completed Features

  • Wikipedia article ingestion with search fallback
  • Multi-model entity extraction (GLiNER + spaCy)
  • Knowledge graph construction
  • Interactive Streamlit validation interface
  • Graph visualization and analysis
  • Comprehensive testing and error handling

πŸ”„ Next Steps

  • LLM query interface integration
  • Advanced relationship extraction
  • Multi-source data ingestion
  • Production database integration

πŸ“§ Contact

For questions or feedback, please open an issue on GitHub.


Last Updated: August 2024

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published