VisionQuery is a lightweight text-to-image semantic search system that serves a pretrained multimodal model behind an API, indexes image embeddings, and returns ranked results via cosine similarity.
It’s a backend-first MVP focused on clarity and explainability: simple vector storage, clean endpoints, containerized dev, and basic observability.
- Ingest images into an embedding index
- Embed text + images into the same CLIP vector space
- Search by natural language and retrieve the most similar images
- Expose metrics for monitoring (Prometheus-compatible)
flowchart LR
C[Client<br/>curl / UI] -->|HTTP| API[FastAPI Backend]
API --> E[CLIP Embedder<br/>text + image]
API --> VS[Vector Store<br/>in-memory cosine]
API -->|/metrics| P[Prometheus]
P --> G[Grafana]
subgraph Data
IMG[(Local Images<br/>/data/images)]
end
IMG -->|ingest path| API
E -->|embeddings| VS
VS -->|top-k results| API
sequenceDiagram
autonumber
participant U as User
participant A as FastAPI
participant M as CLIP Embedder
participant V as Vector Store
U->>A: POST /ingest/image (path)
A->>M: embed(image)
M-->>A: image_embedding
A->>V: add(image_embedding, metadata)
A-->>U: {status: ok}
U->>A: POST /search (query, top_k)
A->>M: embed(text)
M-->>A: text_embedding
A->>V: cosine_search(text_embedding, top_k)
V-->>A: ranked_results
A-->>U: results + similarity scores
- API: FastAPI
- Embeddings: pretrained CLIP (text + image)
- Search: in-memory vector store (cosine similarity)
- Observability: Prometheus + Grafana
- Infra: Docker + Docker Compose
This is the meaningful layout (excluding
node_modules/, build artifacts, etc.).
VisionQuery/
├── backend/
│ ├── app/
│ │ ├── main.py # API routes + orchestration + metrics
│ │ ├── embeddings.py # CLIP embedding (text + image)
│ │ └── vector_store.py # in-memory similarity search
│ ├── Dockerfile
│ └── requirements.txt
├── frontend/ # optional React UI
├── data/
│ └── images/ # local images (not tracked)
├── monitoring/
│ └── prometheus/
│ └── prometheus.yml # Prometheus scrape config
├── docker-compose.yml
└── README.md
GET /health
Returns a simple status payload for Docker + monitoring checks.
POST /ingest/image
Body:
{ "path": "data/images/example.jpg" }Embeds the image and stores its vector + metadata in the in-memory index.
POST /search
Body:
{ "query": "a red car on the street", "top_k": 5 }Returns the top-k most similar images with cosine similarity scores.
GET /metrics
Prometheus-compatible metrics (request counts, latency).
- Docker
- Docker Compose
From the repo root:
docker compose up --build- Backend API: http://localhost:8000
- Frontend UI: http://localhost:5173
- Prometheus: http://localhost:9090
- Grafana: http://localhost:3000 (default: admin / admin)
docker compose downdocker compose down -vDocker is recommended for consistent infra. Use local runs for faster iteration.
cd backend
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000cd frontend
npm install
npm run devcurl -X POST http://localhost:8000/ingest/image \
-H "Content-Type: application/json" \
-d '{"path":"data/images/example.jpg"}'curl -X POST http://localhost:8000/search \
-H "Content-Type: application/json" \
-d '{"query":"a dog on a beach","top_k":5}'- Lazy model loading: the CLIP model loads on first request to keep container startup fast and health checks reliable.
- In-memory index: simple to understand and easy to swap later (FAISS / pgvector).
- Backend-first: the core deliverable is a clean API and system design; the UI is optional.
- Index resets on restart (no persistent vector store)
- No authentication / access control
- No batching or async inference
- Not tuned for large-scale indexing
These tradeoffs keep the system small, explainable, and interview-friendly.
- Replace the in-memory store with FAISS or pgvector
- Persist metadata + vectors in a database
- Add image upload support (instead of file paths)
- Add tracing + richer dashboards
- Batch embedding + async job queue for higher throughput
- CLIP (OpenAI): Learning Transferable Visual Models From Natural Language Supervision
- FastAPI documentation
- Prometheus instrumentation + exposition formats
- Grafana documentation
- Docker Compose documentation