CodeSage is a local codebase intelligence system that combines Retrieval-Augmented Generation (RAG) with static analysis to facilitate code exploration, refactoring, and quality assessment. By integrating Abstract Syntax Tree (AST) parsing, vector embeddings, and dependency graph traversal, the system provides contextual responses to architectural queries and calculates complexity metrics autonomously.
Users can query the codebase using natural language. The system implements a Retrieval-Augmented Generation pipeline:
- Vector Search: Converts queries into dense vector embeddings and retrieves relevant code segments from a local FAISS index.
- Graph Augmentation: Extracts relational context (imports, class hierarchies, function calls) from an in-memory
NetworkXgraph. - LLM Reasoning: Passes the aggregated context to a Large Language Model (via Groq) to synthesize technically accurate responses based strictly on the provided codebase context.
The system constructs a relational dependency graph mapping files, classes, and methods. Users can supply a target module to compute its fan-in configuration, revealing downstream dependencies and calculating its overall modification risk tier (often referred to as a "blast radius" analysis).
A static analysis engine utilizes Python's native ast module to evaluate code quality at runtime. It profiles the source code against explicit programmatic heuristics:
- Cyclomatic complexity tracking (branching density and nested control structures).
- Function length anomalies.
- Parameter count thresholds.
- Multiple-exit-point detection.
CodeSage integrates specialized LLM-driven agents:
- Bug Hunter: Ingests unformatted error tracebacks, traverses the vector index to identify the failing source file, and proposes programmatic fixes.
- Auto-Refactor: Analyzes complex or unstructured code blocks and streams optimized, statically typed, and PEP-8 compliant suggestions back to the client interface.
- Event-Driven Synchronization: A
watchdogdaemon monitors the active directory tree for filesystem modifications, automatically calculating incremental semantic vectors and updating the AST graph to prevent index staleness. - Query Logging: Integrates with
pymongoto persistently log historical interactions, metadata, and LLM responses natively into a MongoDB database process without blocking the visualization thread.
- Extraction & Mapping: Native Python
ast,networkx. - Vector Engine:
sentence-transformers(all-MiniLM-L6-v2) generating embeddings stored infaiss-cpu. - Reasoning API: Groq (
llama-3.1-8b-instant). - Storage Configuration: MongoDB (
mongodb://127.0.0.1:27018). - Interface Layer: Streamlit runtime.
1. Environment Initialization Ensure Python 3.10+ is installed on the host system.
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt2. Environment Variables Configuration
Rename the .env.example template to .env and configure the required API values:
GROQ_API_KEY=gsk_your_key_here
LLM_MODEL=llama-3.1-8b-instant
EMBEDDING_MODEL=all-MiniLM-L6-v23. MongoDB Provisioning
To enable the UI Query History feature, initialize an isolated MongoDB daemon on port 27018:
mkdir -p ~/mongodb-codesage
mongod --port 27018 --dbpath ~/mongodb-codesageThe system relies on cached embeddings and graph nodes before the interface can effectively parse queries.
1. Calculate Base Index
Recursively parse the target directory to generate .index, .json, and .pkl artifacts.
python3 main.py index2. Launch Visualization Client Start the main application dashboard locally.
python3 main.py uiThe Streamlit client will bind to localhost:8501 by default.
3. Run Watchdog Daemon (Concurrent Task) Optionally spin up a concurrent terminal to maintain index integrity as source code is modified in real-time.
python3 main.py watch