A comprehensive Retrieval-Augmented Generation (RAG) system that intelligently queries multiple knowledge sources using LangChain agents. This project demonstrates how to build a sophisticated AI assistant that can seamlessly switch between different data sources based on user queries.
This project implements a fully functional LangChain Agent that performs RAG across multiple knowledge sources:
- Wikipedia Tool - Real-time encyclopedia access
- Arxiv Paper Search Tool - Academic paper retrieval
- LangSmith Documentation - Web-based vector store integration
The system uses LangChain's agent capabilities with OpenAI's LLM to intelligently route queries to the most appropriate data source.
RAG-WITH-MULTIPLE-DATA-SOURCES/
βββ .env # API keys (OpenAI)
βββ .gitignore # Ignore venv, .env, __pycache__
βββ agents.ipynb # Main notebook for running agent
βββ README.md # Project documentation
βββ requirements.txt # Dependencies
βββ venv/ # Virtual environment (ignored)
- Tool-based Agent: Utilizes LangChain's
create_openai_tools_agent
for dynamic tool selection - Context-aware Routing: Automatically chooses the best data source based on query context
- Multi-modal Integration: Seamlessly combines structured and unstructured data sources
- FAISS Vector Store: High-performance similarity search for documentation
- Document Processing: Automated chunking with
RecursiveCharacterTextSplitter
- Semantic Embeddings: OpenAI embeddings for accurate document retrieval
- Wikipedia API: Instant access to encyclopedia entries
- Arxiv API: Academic paper metadata and abstracts
- Web Content Loading: Dynamic documentation scraping and indexing
Component | Technology |
---|---|
LLM Framework | LangChain |
Language Model | OpenAI GPT-3.5 Turbo |
Vector Database | FAISS |
Embeddings | OpenAI Embeddings |
External APIs | Wikipedia, Arxiv |
Document Processing | WebBaseLoader, RecursiveCharacterTextSplitter |
- Python 3.8+
- OpenAI API key
- Git
-
Clone the repository
git clone https://github.com/your-username/rag-multi-agent-langchain.git cd rag-multi-agent-langchain
-
Create and activate virtual environment
# Linux/Mac python -m venv venv source venv/bin/activate # Windows python -m venv venv venv\Scripts\activate
-
Install dependencies
pip install -r requirements.txt
-
Configure environment variables
Create a
.env
file in the root directory:OPENAI_API_KEY=your_openai_api_key_here
-
Launch Jupyter Notebook
jupyter notebook agents.ipynb
-
Run the agent
Execute the cells step-by-step, then try these example queries:
# Query LangSmith documentation response = agent_executor.invoke({ "input": "Tell me about Langsmith's key features" }) # Search for academic papers response = agent_executor.invoke({ "input": "What's the paper 1605.08386 about?" }) # General knowledge queries response = agent_executor.invoke({ "input": "Explain quantum computing" })
- Query Analysis: The agent analyzes the user's input to determine intent
- Tool Selection: Based on context, it chooses between Wikipedia, Arxiv, or vector store
- Information Retrieval: Executes the appropriate retrieval strategy
- Response Generation: Synthesizes information into a coherent response
# Document loading and processing
loader = WebBaseLoader(urls)
documents = loader.load()
# Text splitting for optimal chunking
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
splits = text_splitter.split_documents(documents)
# Vector store creation
vectorstore = FAISS.from_documents(splits, OpenAIEmbeddings())
- Caching: Implements intelligent caching for repeated queries
- Chunking Strategy: Optimized chunk sizes for better retrieval accuracy
- Parallel Processing: Concurrent API calls where possible
- Memory Management: Efficient vector store operations
- API Key Management: Secure storage in environment variables
- Rate Limiting: Respectful API usage with built-in limits
- Error Handling: Robust error handling for external API failures
- Data Privacy: No sensitive data stored in version control
- PDF Document Support - Direct PDF parsing and indexing
- Conversation Memory - Multi-turn conversation context
- Custom Data Sources - Integration with proprietary databases
- Web Interface - Streamlit/Gradio deployment
- Performance Monitoring - Query analytics and optimization
- Multi-language Support - Internationalization capabilities
We welcome contributions! Here's how you can help:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
# Install development dependencies
pip install -r requirements-dev.txt
# Run tests
pytest tests/
# Format code
black .
flake8 .
- Research Assistant: Academic paper discovery and summarization
- Documentation Helper: Technical documentation queries
- Knowledge Base: General information retrieval
- Educational Tool: Multi-source learning assistance
OpenAI API Key Error
# Ensure your .env file is properly configured
export OPENAI_API_KEY=your_key_here
Dependencies Issues
# Reinstall dependencies
pip install --upgrade -r requirements.txt
Vector Store Performance
# Clear cache and rebuild
rm -rf .faiss_cache/
This project is licensed under the MIT License - see the LICENSE file for details.
- LangChain Team for the excellent framework
- OpenAI for powerful language models
- FAISS for efficient vector operations
- Wikipedia & Arxiv for open knowledge access
Chitraksh Suri
- LinkedIn: Connect with me
- GitHub: @chitrakshsuri
- Email: chitraksh.suri@example.com
Built with β€οΈ using LangChain and OpenAI