A locally-executable Graph RAG (Retrieval-Augmented Generation) pipeline that constructs knowledge graphs from Wikipedia articles and enables interactive exploration and analysis.
This proof of concept demonstrates:
- Knowledge Graph Construction: Automated entity extraction and relationship mapping from Wikipedia articles
- Multi-Model NER: Combining GLiNER and spaCy for comprehensive entity recognition
- Interactive Visualization: Graph exploration through web-based interfaces
- Real-time Validation: Streamlit-based interface for graph exploration and analysis
- Python 3.9+
- 8GB RAM minimum
- 10GB free disk space (for models)
- Clone the repository:
git clone https://github.com/yourusername/graph-rag-poc.git
cd graph-rag-poc
- Create a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
python -m spacy download en_core_web_sm
- Install Ollama and download a model:
# Install Ollama (macOS)
brew install ollama
# Start Ollama service
ollama serve
# In another terminal, pull a model
ollama pull phi3:mini
# Complete setup
make setup
# Run the complete pipeline
make run-pipeline
# Or run individual steps
make ingest
make extract
make build
# Launch validation interface
make validate
# Clean up generated files
make clean
Available Commands:
make setup
- Complete setup (install + models)make run-pipeline
- Run complete pipelinemake ingest
- Ingest Wikipedia articlesmake extract
- Extract entitiesmake build
- Build knowledge graphmake validate
- Launch validation interfacemake clean
- Clean up generated files
# 1. Ingest Wikipedia articles
python -m src.ingestion.wikipedia --topics "Artificial Intelligence,Machine Learning" --max-articles 10
# 2. Extract entities and build graph
python -m src.extraction.pipeline --input data/articles --output data/graphs
# 3. Launch validation interface
streamlit run src/validation/app.py
Run the Jupyter notebook for a step-by-step demonstration:
jupyter notebook notebooks/demo.ipynb
graph-rag-poc/
βββ README.md # Project documentation
βββ requirements.txt # Python dependencies
βββ setup.py # Package setup
β
βββ src/ # Source code
β βββ __init__.py
β βββ ingestion/ # Wikipedia data ingestion
β β βββ __init__.py
β β βββ wikipedia.py # Wikipedia API wrapper
β β
β βββ extraction/ # Entity extraction pipeline
β β βββ __init__.py
β β βββ gliner_extractor.py # GLiNER model
β β βββ spacy_extractor.py # spaCy NER
β β βββ pipeline.py # Combined pipeline
β β
β βββ graph/ # Graph construction
β β βββ __init__.py
β β βββ builder.py # NetworkX graph builder
β β
β βββ validation/ # Interactive validation UI
β β βββ __init__.py
β β βββ app.py # Streamlit application
β β
β βββ query/ # Query interface (planned)
β βββ __init__.py
β
βββ data/ # Data storage
β βββ articles/ # Wikipedia articles (JSON)
β βββ entities/ # Extracted entities (JSON)
β βββ graphs/ # Graph files (pickle, graphml, html)
β
βββ notebooks/ # Jupyter notebooks
β βββ demo.ipynb # Complete pipeline demonstration
β
βββ configs/ # Configuration files
β βββ pipeline.yaml # Pipeline configuration
β
βββ docs/ # Additional documentation
β βββ ARCHITECTURE.md # System architecture
β
βββ tests/ # Unit tests
βββ test_ingestion.py
- Wikipedia Article Ingestion: Robust article fetching with search fallback
- Multi-Model Entity Extraction: GLiNER + spaCy for comprehensive NER
- Knowledge Graph Construction: NetworkX-based graph with entities and relationships
- Interactive Visualization: PyVis-based HTML visualizations
- Streamlit Validation Interface: Web-based graph exploration and analysis
- Graph Statistics: Comprehensive metrics and entity analysis
- Data Export: GraphML and pickle formats for interoperability
- LLM Query Interface: Ollama-powered natural language querying
- Advanced Relationship Extraction: Beyond co-occurrence analysis
- Multi-source Data Ingestion: Support for other data sources
- Production Database: Neo4j integration for large-scale graphs
- API Endpoints: RESTful API for programmatic access
The pipeline consists of four main stages:
- Data Ingestion: Fetches Wikipedia articles with intelligent search fallback
- Entity Extraction: Multi-model NER using GLiNER and spaCy
- Graph Construction: Builds NetworkX graph with entities and co-occurrence relationships
- Validation & Exploration: Interactive Streamlit interface for graph analysis
Metric | Value | Status |
---|---|---|
Articles processed | 4 | β Working |
Entities extracted | 481 | β Working |
Graph nodes | 481 | β Working |
Graph edges | 2,001 | β Working |
Entity types | 16 | β Working |
Connected components | 17 | β Working |
Run the test suite:
# Run all tests
pytest
# Run specific test module
pytest tests/test_ingestion.py
# Run with coverage
pytest --cov=src --cov-report=html
# Complete setup and run
make setup
make run-pipeline
make validate
python -m src.ingestion.wikipedia --topics "Deep Learning,Computer Vision,Natural Language Processing" --max-articles 5
python -m src.extraction.pipeline --input data/articles --output data/graphs
streamlit run src/validation/app.py
jupyter notebook notebooks/demo.ipynb
The Streamlit validation interface provides:
- Graph Overview: Statistics, density, and entity distributions
- Entity Management: Search, filter, and analyze entities by type
- Relationship Analysis: Explore connections and relationship types
- Interactive Visualization: Generate custom graph visualizations
- Data Export: Download entities and relationships as CSV
-
"No module named 'spacy'"
make setup # or manually: pip install -r requirements.txt python -m spacy download en_core_web_sm
-
"Ollama port conflict"
ps aux | grep ollama kill -9 <PID> ollama serve
-
"No articles found"
- Try different topic names
- Check internet connection
- Verify Wikipedia API access
-
"make: command not found" (Windows)
- Install Make for Windows via Chocolatey:
choco install make
- Or use the direct commands instead of Makefile
- Or use WSL (Windows Subsystem for Linux)
- Install Make for Windows via Chocolatey:
This is a proof of concept project. Contributions are welcome!
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- GLiNER team for zero-shot NER capabilities
- spaCy for robust NLP pipeline
- Ollama for local LLM deployment
- NetworkX for graph manipulation
- PyVis for interactive visualizations
- Streamlit for web interface framework
- GLiNER: Generalist Model for Named Entity Recognition
- spaCy Documentation
- NetworkX Documentation
- Ollama Documentation
- PyVis Documentation
- Streamlit Documentation
Phase: Proof of Concept β
Status: Fully Functional
Last Updated: August 2024
- Wikipedia article ingestion with search fallback
- Multi-model entity extraction (GLiNER + spaCy)
- Knowledge graph construction
- Interactive Streamlit validation interface
- Graph visualization and analysis
- Comprehensive testing and error handling
- LLM query interface integration
- Advanced relationship extraction
- Multi-source data ingestion
- Production database integration
For questions or feedback, please open an issue on GitHub.
Last Updated: August 2024