This project demonstrates a simple RAG pipeline with self-contained functions:
- Pull source documents into a SQLite database
- Chunk documents
- Embed documents with HuggingFace, OpenAI, or Google Generative AI
- Save and load vectors using Chroma (local vector database)
- Python 3.10+
- Install dependencies:
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt# For OpenAI embeddings
OPENAI_API_KEY=sk-...
# For Google Generative AI embeddings
GOOGLE_API_KEY=your-google-genai-api-keyThe CLI automatically loads this .env before running.
rag_pipeline/db.py: SQLite initialization and raw document storagerag_pipeline/ingest.py: Ingest files from a directory into SQLiterag_pipeline/pipeline.py: Chunking, embedding, and Chroma vector store utilitiesrag_pipeline/main.py: Example script wiring all steps together
-
Prepare a folder with your documents (e.g.,
.txt,.md,.pdf).- You can also point
--source_dirto a public S3 bucket using ans3://bucket/prefixURL. Anonymous access is used (no credentials), so the bucket/objects must be publicly readable.
- You can also point
-
Run the pipeline example (build and persist Chroma to
--vectorstore_path):
python -m rag_pipeline.main \
--source_dir ./raw_docs \
--db_path ./rag_raw_docs.sqlite \
--vectorstore_path ./chroma_store \
--chunk_size 1000 \
--chunk_overlap 200 \
--embeddings_provider huggingface \
--model_name sentence-transformers/all-MiniLM-L6-v2Public S3 example:
python -m rag_pipeline.main \
--source_dir s3://your-public-bucket/path \
--db_path ./rag_raw_docs.sqlite \
--vectorstore_path ./chroma_store \
--chunk_size 1000 \
--chunk_overlap 200 \
--embeddings_provider huggingface \
--model_name sentence-transformers/all-MiniLM-L6-v2- Example query against the saved vectorstore:
python -m rag_pipeline.main \
--vectorstore_path ./chroma_store \
--embeddings_provider huggingface \
--model_name sentence-transformers/all-MiniLM-L6-v2 \
--query "What does this corpus talk about?"Notes:
- Embeddings providers:
- HuggingFace (default):
--embeddings_provider huggingfaceand a sentence-transformers model name. No API key needed. - OpenAI:
--embeddings_provider openaiand a model liketext-embedding-3-small. RequiresOPENAI_API_KEYin your environment. - Google Generative AI:
--embeddings_provider googleand a model likemodels/text-embedding-004. RequiresGOOGLE_API_KEYin your environment.
- HuggingFace (default):
- Chroma persistence is handled via a
chromadb.PersistentClientat--vectorstore_path. Use the same path for build and query. - Ensure the same
--model_nameis used for build and query so embedding spaces match.
- Swap Chroma for another LangChain-supported vector DB.
- Replace the embeddings model with your preferred model.
- Wire the retriever into your favorite LLM for a full RAG QA chain.