A minimal, production-style Retrieval-Augmented Generation (RAG) service.
This repo shows how to:
- Ingest raw text into chunks
- Embed those chunks with OpenAI
- Index them with FAISS
- Expose a clean FastAPI endpoint for semantic questions
It’s intentionally small so you can read it end-to-end in one sitting and still see all the moving parts of a real RAG pipeline.
From a founder / hiring-manager perspective, this repo demonstrates that you can:
-
Design a clean, modular RAG pipeline
- Separate ingest, retrieval, and LLM orchestration
- Use FAISS and OpenAI correctly, without leaking secrets
-
Ship a small, production-shaped microservice
- FastAPI + uvicorn
- Clear I/O contract (
/queryendpoint, typed request/response models) - Artifacts and secrets handled via
.gitignoreand.env
-
Explain the architecture like an engineer, not a vibes coder
- Every file has a single, obvious responsibility
- The stack is easy to extend into a bigger “Semantic Runtime” later
Directories and files:
semantic-rag-microstack/
├── .gitignore
├── requirements.txt
├── README.md
└── src/
└── semantic_rag/
├── api.py # FastAPI app and /query endpoint
├── data.txt # Source corpus to index (plain text)
├── ingest.py # Build FAISS index + docs.npy from data.txt
├── llm.py # LLM wrapper for answering with context chunks
├── query.py # CLI helper to test the pipeline
├── retriever.py # FAISS-based semantic retriever
├── faiss.index # (ignored) FAISS index artifact
└── docs.npy # (ignored) Chunked text store
-
ingest.pyreadsdata.txt- Splits into chunks
- Embeds chunks with
text-embedding-3-small - Stores vectors in
faiss.index - Stores chunk texts in
docs.npy
-
retriever.pyloadsfaiss.index+docs.npy- Given a query, embeds it and performs vector search
- Returns top-k chunks (and optionally scores)
-
llm.pywraps the OpenAI Chat Completions API- Given
query+context_chunks, it prompts a chat model (e.g.gpt-4.1-mini) - Returns a concise, context-grounded answer
-
api.pyexposes/queryvia FastAPI- Request:
{ "query": "..." , "top_k": 4 } - Response:
{ "answer": "..." }
git clone https://github.com/<your-username>/semantic-rag-microstack.git
cd semantic-rag-microstack
python -m venv .venv
source .venv/bin/activate # macOS / Linux
# or
.venv\Scripts\activate # Windows
pip install -r requirements.txt
Create a .env file in the project root (same folder as README.md):
OPENAI_API_KEY=sk-your-real-key-here
The .env file is ignored by git via .gitignore so your secret key never leaves your machine.
Edit src/semantic_rag/data.txt with whatever corpus you want to index
(e.g. your own architecture notes, product docs, FAQs).
From the project root:
cd src/semantic_rag
python ingest.py
cd ../../
You should see log output similar to:
Chunked into N pieces
Saved index to faiss.index
Saved chunk texts to docs.npy
From the project root:
uvicorn src.semantic_rag.api:app --reload
You should see FastAPI / uvicorn startup logs.
In another terminal (still in the project root):
curl -X POST "http://127.0.0.1:8000/query" \
-H "Content-Type: application/json" \
-d '{"query": "What is the Universal Semantic Runtime?"}'
Example response:
{
"answer": "The Universal Semantic Runtime (USR) is a core layer in the Universal Semantic Systems architecture that handles planning, routing, and governance over semantic operations."
}
You can also use the built-in CLI helper:
cd src/semantic_rag
python query.py "What is the Universal Semantic Runtime?"
cd ../../
You can customize model choices in:
retriever.pyEMBEDDING_MODEL = "text-embedding-3-small"
llm.pyMODEL_NAME = "gpt-4.1-mini"
Both modules load the OPENAI_API_KEY from .env using python-dotenv.
Some natural next steps:
-
- Walk a directory, ingest all
.md/.txtfiles.
- Walk a directory, ingest all
-
- Include chunk text, index, and score for transparency.
-
- Re-order retrieved chunks with a second pass if needed.
-
- Add API keys or JWT, basic rate limiting, and logging.
-
- Log queries, response latency, and token usage.
This repo is deliberately small so you can evolve it into:
- A mini semantic runtime for a single product
- A reference microservice inside a larger USS/USR stack
MIT License. Use it, fork it, extend it.