A powerful application for indexing and querying code repositories using AI. This tool provides intelligent code search, explanation, and documentation capabilities using advanced RAG (Retrieval-Augmented Generation) techniques.
-
Specialized Code Embeddings
- CodeBERT and GraphCodeBERT integration
- AST-based code embeddings
- Cross-language code embeddings
- Function-level and class-level embeddings
- Language-specific parsers for multiple languages
-
Enhanced Code Analysis
- AST-based code structure analysis
- Code dependency tracking
- Code flow analysis
- Code complexity metrics
- Type inference and validation
-
Advanced Query Processing
- Query intent detection (explain, bug, feature, usage, implementation)
- Query reformulation for better search results
- Dynamic context window sizing
- Conversation history support
- Context-aware responses
-
Hybrid Search
- Dense and sparse retriever combination
- Code-specific reranking using BM25
- Vector similarity search
- Code knowledge graph integration
- Multi-stage retrieval pipeline
- Comprehensive Documentation
- API documentation generation
- Module documentation
- Code examples
- Project README generation
- Documentation site generation
- Code explanation capabilities
-
Analytics System
- System metrics tracking
- Query performance analytics
- Usage statistics
- Metrics retention management
- Performance monitoring
-
Enhanced Logging
- Rotating log files
- Component-specific logging
- Detailed error tracking
- Performance monitoring
- Debug information
ai-code-context/
├── app/
│ ├── analytics/ # Analytics and monitoring system
│ │ ├── monitor.py # System metrics tracking
│ │ └── metrics.py # Performance metrics
│ ├── config/ # Configuration management
│ │ └── settings.py # Application settings
│ ├── docs/ # Documentation generation
│ │ └── auto_documenter.py # Auto-documentation system
│ ├── github/ # GitHub integration
│ │ ├── repo_scanner.py # Repository scanning
│ │ └── indexer.py # Code indexing
│ ├── rag/ # RAG system components
│ │ ├── advanced_rag.py # Advanced RAG implementation
│ │ ├── query_optimizer.py # Query optimization
│ │ └── code_explainer.py # Code explanation
│ ├── utils/ # Utility functions
│ │ ├── code_chunker.py # Code chunking
│ │ ├── llm.py # LLM integration
│ │ └── text_processing.py # Text processing
│ └── vector_store/ # Vector storage
│ └── chroma_store.py # ChromaDB implementation
├── logs/ # Application logs
├── metrics/ # Analytics metrics
├── docs/ # Generated documentation
├── .env # Environment variables
├── requirements.txt # Dependencies
└── README.md # This file
-
Repository Indexing
GitHub Repository → Scanner → Code Chunker → Vector Store
-
Query Processing
User Query → Query Optimizer → RAG System → LLM → Response
-
Documentation Generation
Code → Auto Documenter → Documentation Site
- Clone the repository:
git clone https://github.com/yourusername/ai-code-context.git
cd ai-code-context
- Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
- Set up environment variables:
cp .env.example .env
# Edit .env with your configuration
Here's a quick guide to get up and running:
-
Set up environment variables:
cp .env.example .env # Edit .env with your GitHub token and OpenAI/Anthropic API key
-
Index a repository:
python -m app.main index --repo owner/repo
-
Query the codebase:
python -m app.main query --query "How does X work?"
That's it! You'll get a natural language explanation of the code based on your query. For more advanced usage, see the Usage section below.
Configure the application by creating a .env
file in the project root directory. You can copy the .env.example
file and modify it as needed.
GITHUB_ACCESS_TOKEN
: Your GitHub access token for repository accessOPENAI_API_KEY
orANTHROPIC_API_KEY
: At least one LLM API key is required
GITHUB_REPOSITORY
: Default repository to index in "owner/repo" formatGITHUB_BRANCH
: Default branch to index (defaults to "main")SUPPORTED_FILE_TYPES
: Comma-separated list of file extensions to index (e.g., "py,js,ts,jsx,tsx")EXCLUDED_DIRS
: Directories to exclude from indexing (e.g., "node_modules,.git,pycache")
MODEL_NAME
: LLM model to use (default: "gpt-4")USE_OPENAI
: Set to "true" to use OpenAI modelsUSE_ANTHROPIC
: Set to "true" to use Anthropic Claude modelsTEMPERATURE
: Controls response randomness (0.0-1.0, default: 0.7)MAX_TOKENS
: Maximum tokens in generated responses (default: 4000)
CHUNK_SIZE
: Size of code chunks for processing (default: 1000)CHUNK_OVERLAP
: Overlap between chunks (default: 200)MIN_CHUNK_SIZE
: Minimum size for each chunk (default: 100)MAX_CHUNK_SIZE
: Maximum size for each chunk (default: 2000)
QUERY_REFORMULATION
: Enable query reformulation (default: true)CONVERSATION_HISTORY
: Enable conversation history (default: true)MAX_HISTORY_TURNS
: Maximum conversation turns to remember (default: 5)CONTEXT_WINDOW
: Number of code snippets to include in context (default: 3)
CHROMA_PERSISTENCE_DIR
: Directory for ChromaDB persistence (default: "./chroma_db")CHROMA_COLLECTION_NAME
: Collection name in ChromaDB (default: "code_chunks")SIMILARITY_METRIC
: Similarity metric for vector search (default: "cosine")USE_DISTRIBUTED_STORE
: Whether to use distributed storage (default: false)
LOG_LEVEL
: Logging verbosity (default: "INFO")LOG_DIR
: Directory for log files (default: "logs")TRACK_SYSTEM_METRICS
: Enable system metrics tracking (default: true)TRACK_QUERY_METRICS
: Enable query metrics tracking (default: true)
For more advanced configuration options, see the .env.example
file.
The application needs to index a GitHub repository before it can answer questions about the code.
python -m app.main index --repo owner/repo --branch main
Parameters:
--repo
: The GitHub repository to index in the format "owner/repo" (overrides GITHUB_REPOSITORY from .env)--branch
: The branch to index (overrides GITHUB_BRANCH from .env, defaults to "main")
Example:
python -m app.main index --repo microsoft/TypeScript --branch main
Once a repository is indexed, you can query it with natural language questions.
python -m app.main query --query "your question here"
Parameters:
--query
: Your natural language question about the code (required)--history
: JSON string of conversation history for contextual queries (optional)--show-snippets
: Display code snippets in the output (optional, off by default)--explain
: Generate detailed explanations of the code snippets (optional, off by default)--generate-docs
: Generate documentation based on the query (optional, off by default)
Example - Basic query:
python -m app.main query --query "How are React hooks used for state management?"
Example - Show code snippets:
python -m app.main query --query "How are React hooks used for state management?" --show-snippets
Example - With code explanations:
python -m app.main query --query "How are React hooks used for state management?" --explain
Example - With conversation history:
python -m app.main query --query "How are they implemented?" --history '[{"query": "What are React hooks?", "answer": "React hooks are functions that..."}]'
The output is structured as follows:
- Response: A natural language explanation answering your question
- Code Snippets (optional, with
--show-snippets
): Relevant code from the repository - Code Explanations (optional, with
--explain
): Detailed explanation of each code snippet
Combining multiple flags:
python -m app.main query --query "Explain the implementation of useState hook" --show-snippets --explain
For documentation generation:
python -m app.main query --query "Generate documentation for the repository" --generate-docs
The application maintains separate log files for different components:
logs/app.log
: Main application logslogs/github_scanner.log
: GitHub scanning logslogs/vector_store.log
: Vector store operationslogs/llm.log
: LLM interactionslogs/auto_documenter.log
: Documentation generation logs
-
LLM Service Unavailable
- Check your API keys in
.env
- Verify network connectivity
- Check service status
- Check your API keys in
-
Vector Store Errors
- Verify ChromaDB installation
- Check disk space
- Verify permissions
-
Documentation Generation Failures
- Check file permissions
- Verify output directory exists
- Check for syntax errors in code
-
Indexing Large Repositories
- Adjust chunk size and overlap
- Use batch processing
- Monitor memory usage
-
Query Performance
- Enable caching
- Optimize context window size
- Use appropriate model size
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- OpenAI for GPT models
- Anthropic for Claude models
- ChromaDB for vector storage
- Sentence Transformers for embeddings
- CodeBERT for code-specific embeddings
Here are some examples of how to use the application for different purposes:
Index the repository and ask questions to quickly understand the codebase:
python -m app.main index --repo owner/repo
python -m app.main query --query "What is the high-level architecture of this project?"
python -m app.main query --query "What are the main components and how do they interact?"
python -m app.main query --query "How do I implement authentication in this codebase?"
python -m app.main query --query "What's the pattern for adding a new API endpoint?"
python -m app.main query --query "Why might I be getting this error: [paste error message]"
python -m app.main query --query "What could cause this function to return null in these cases?"
python -m app.main query --query "How are React hooks used in this project?"
python -m app.main query --query "What design patterns are used for handling async operations?"
python -m app.main query --query "What areas of this codebase might need refactoring?"
python -m app.main query --query "Are there any potential security vulnerabilities in the authentication system?"
python -m app.main query --query "What's the code style and contribution process for this project?"
python -m app.main query --query "How are tests structured and implemented in this project?"