GitHub - coderonfleek/dta-rag-pipeline: RAG Pipeline for Data to Agents Project

RAG Demo (LangChain + SQLite + Chroma)

This project demonstrates a simple RAG pipeline with self-contained functions:

Pull source documents into a SQLite database
Chunk documents
Embed documents with HuggingFace, OpenAI, or Google Generative AI
Save and load vectors using Chroma (local vector database)

Requirements

Python 3.10+
Install dependencies:

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

Set API keys in a `.env` file at the project root (same folder as this README):

# For OpenAI embeddings
OPENAI_API_KEY=sk-...

# For Google Generative AI embeddings
GOOGLE_API_KEY=your-google-genai-api-key

The CLI automatically loads this .env before running.

Project Layout

rag_pipeline/db.py: SQLite initialization and raw document storage
rag_pipeline/ingest.py: Ingest files from a directory into SQLite
rag_pipeline/pipeline.py: Chunking, embedding, and Chroma vector store utilities
rag_pipeline/main.py: Example script wiring all steps together

Usage

Prepare a folder with your documents (e.g., .txt, .md, .pdf).
- You can also point --source_dir to a public S3 bucket using an s3://bucket/prefix URL. Anonymous access is used (no credentials), so the bucket/objects must be publicly readable.
Run the pipeline example (build and persist Chroma to --vectorstore_path):

python -m rag_pipeline.main \
  --source_dir ./raw_docs \
  --db_path ./rag_raw_docs.sqlite \
  --vectorstore_path ./chroma_store \
  --chunk_size 1000 \
  --chunk_overlap 200 \
  --embeddings_provider huggingface \
  --model_name sentence-transformers/all-MiniLM-L6-v2

Public S3 example:

python -m rag_pipeline.main \
  --source_dir s3://your-public-bucket/path \
  --db_path ./rag_raw_docs.sqlite \
  --vectorstore_path ./chroma_store \
  --chunk_size 1000 \
  --chunk_overlap 200 \
  --embeddings_provider huggingface \
  --model_name sentence-transformers/all-MiniLM-L6-v2

Example query against the saved vectorstore:

python -m rag_pipeline.main \
  --vectorstore_path ./chroma_store \
  --embeddings_provider huggingface \
  --model_name sentence-transformers/all-MiniLM-L6-v2 \
  --query "What does this corpus talk about?"

Notes:

Embeddings providers:
- HuggingFace (default): --embeddings_provider huggingface and a sentence-transformers model name. No API key needed.
- OpenAI: --embeddings_provider openai and a model like text-embedding-3-small. Requires OPENAI_API_KEY in your environment.
- Google Generative AI: --embeddings_provider google and a model like models/text-embedding-004. Requires GOOGLE_API_KEY in your environment.
Chroma persistence is handled via a chromadb.PersistentClient at --vectorstore_path. Use the same path for build and query.
Ensure the same --model_name is used for build and query so embedding spaces match.

Extending

Swap Chroma for another LangChain-supported vector DB.
Replace the embeddings model with your preferred model.
Wire the retriever into your favorite LLM for a full RAG QA chain.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
rag_pipeline		rag_pipeline
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RAG Demo (LangChain + SQLite + Chroma)

Requirements

Set API keys in a `.env` file at the project root (same folder as this README):

Project Layout

Usage

Extending

About

Uh oh!

Releases

Packages

Languages

coderonfleek/dta-rag-pipeline

Folders and files

Latest commit

History

Repository files navigation

RAG Demo (LangChain + SQLite + Chroma)

Requirements

Set API keys in a .env file at the project root (same folder as this README):

Project Layout

Usage

Extending

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Set API keys in a `.env` file at the project root (same folder as this README):

Packages