Vector DB StackAI - Take-Home Task

Overview

This project implements a REST API for a Vector Database using FastAPI, with a Rust (PyO3) core for Brute Force, IVF Flat, and IVF PQ indexing plus an official Python SDK. It supports CRUD for libraries, documents, and chunks, indexing, and kNN search. Data is stored in-memory with RW locks for concurrency and persisted to disk as numpy snapshots. No external vector DB libs used, as per guidelines.

Key features:

Fixed schemas with Pydantic validation and Value Objects for invariants (non-empty IDs, valid embeddings).
Three indexing algorithms implemented in Rust and exposed to Python via PyO3: Brute Force, IVF Flat, and IVF PQ with documented complexities.
Thread-safe operations via per-library RWLock.
Dockerized API.
Python SDK for easy interaction.
On-disk snapshots (data/<library_id>/) with atomic manifest + NumPy arrays (ids.npy, vectors.npy, etc.) and metadata indices.

Extra: Domain-Driven Design elements (VO, services decoupling), FP style where applicable (itertools for traversals, but kept simple).

Setup

Use uv for dependency management (fast, modern alternative to pip/poetry).

Install uv: curl -LsSf https://astral.sh/uv/install.sh | sh or see uv installation for Windows or instalation with pip.
Sync all packages: uv sync --all-packages
- This installs every workspace member (api, core, sdk, rindex, dashboard) and keeps their lockfile in sync.
- Re-run after adding dependencies or pulling changes that modify pyproject.toml/uv.lock.
Build the Rust core library (rindex) locally: uv run poe rindex-develop
(Optional) Install pre-commit: uv run pre-commit install

Development Tasks (Poe the Poet)

Run from root: uv run poe <task>

Core workspace tasks (root pyproject.toml):

format: Format code with Ruff.
lint: Lint and auto-fix with Ruff.
typecheck: Static typing with basedpyright.
test: Execute the aggregated test suite (core, api, sdk).
pre-commit: Convenience wrapper for lint → format → tests → typecheck.
all: Runs format, lint, typecheck, test sequentially.

Package-specific tasks (invoked with uv run poe <task> as well, thanks to the included configuration):

rindex-develop: Build the PyO3 extension in editable mode (packages/rindex).
api-test, core-test, sdk-test: Scoped pytest runs per package.
dashboard: Launch the Streamlit UI (packages/streamlit).

Discover the full list at any time: uv run poe --list

For watch mode typecheck: uv run poe typecheck --watch

Running the API

Local

Use the existing Poe task so the right module path and reload flags stay in sync:

uv run poe api-serve

Open http://localhost:8000/docs for Swagger UI.

Docker

Build the image (multi-stage, ~1 GB final size, can be optimiced with cargo build --release and further optimizations, refining the .dockerignore):

docker build -t vector-db-api .

Run the container, exposing the FastAPI service on port 8000:

docker run --rm -p 8000:8000 vector-db-api

The image already contains the compiled rindex extension and all Python workspace dependencies (including NumPy). Snapshots created by the API are written inside the container under /app/data; mount a host volume if you need persistence between runs.

Streamlit Playground

Spin up a local UI that exercises the API through the SDK.

Start the API (see above).
Run the dashboard: uv run poe dashboard
Point it to your API base URL (defaults to http://localhost:8000).
The UI lets you create/delete libraries, documents, and chunks, rebuild indexes, and run searches.
Embeddings are generated automatically with the sentence-transformers/all-MiniLM-L6-v2 model (cached locally on first use).
It surfaces API errors inline (e.g., duplicate IDs, empty indexes), making it easy to confirm validation flows without crafting HTTP calls manually.

Design Choices

Indexing Algorithms

Implemented three algorithms in Rust (PyO3 bindings) without external vector DB libs:

Brute Force:
- Build: O(1) - just store list.
- Query: O(N * d) time (N chunks, d dim), O(N) space.
- Exact, simple baseline. Good for small N.
IVF Flat (Inverted File Flat):
- Build: O(N log N) via K-Means clustering into n_lists centroids.
- Query: O(nprobes * (listsize * d)) average, O(N) worst; probes nearest centroids and scans their lists exactly.
- Space: O(N * d).
- Uses K-Means for partitioning vectors into inverted lists; exact search within probed lists. Configurable n_lists (default 16), n_probes (default ~4). Good for medium-large N, balances speed and exactness.
IVF PQ (Inverted File with Product Quantization):
- Build: O(N log N + N _ num_subvectors _ num_codewords) for K-Means + PQ training.
- Query: O(nprobes * listsize * num_subvectors) approx, using precomputed codes and codebooks for fast distance estimation.
- Space: O(N * (log num_codewords / subvector) + centroids/codebooks), highly compressed.
- Combines IVF clustering with PQ: residuals quantized into codes per subvector (default num_subvectors=4, num_codewords=16). Approximate but tunable precision vs. space. Ideal for high-dim (e.g., 768) and large N.

Trade-offs: Brute for accuracy/small data and as a reference; IVF Flat for faster exact queries; IVF PQ for scalable approx search with small index size. All indices expose save/load to persist snapshots. Python facades map string IndexVectorID values to sequential integers (stored in id_map.json) before delegating to Rust.

Persistence & Snapshots

Each library snapshot lives in data/<library_id>/ and contains:
- manifest.json (written by the Rust core) describing index type, metric, params, version.
- NumPy arrays (ids.npy, vectors.npy, centroids.npy, etc.) for index payloads.
- id_map.json (string -> integer mapping used inside Rust indices).
- metadata.json (vector -> metadata/doc metadata dump) and meta_index.json (inverted metadata index for quick filtering).
IndexingService.build_index saves a fresh snapshot, ensure_index will lazily reload from disk when registry state is missing/outdated.
Snapshots are written atomically (manifest.json renamed last) so partially written indexes are avoided.

Selected per library via index_type; default brute.

Concurrency

InMemoryLibraryRepo uses RWLock per library ID.
Reads (get, search): acquire_read (multiple concurrent).
Writes (create/update/delete, index): acquire_write (exclusive, waits readers).
Ensures no races on shared library state.

Things taken from DDD

Value Objects: ID (non-empty str), Embedding (non-empty numbers, normalizable).
Invariants/Preconditions: Pydantic validators (e.g., lib name non-empty, valid chunks).
Entities: Library/Document/Chunk with identity.
Services: Decouple API from logic (IndexingService, QueryService).
Callee decides: expose primitives (e.g., raw embeddings) alongside VO.

Tooling Choices

uv: Fast resolver/installer, workspace support for monorepo.
ruff: Combination of flake8, isort, pycodestyle, pyflakes, pylint, in one tool, extremely fast linter and formatter.
pytest: de-facto standard for testing in python.
basedpyright: For type checking (stricter than mypy, faster, no Node.js needed via wheel). ty or pyrefly will be better options in the future when the conformance goes higher.
- Mode: "recommended" (strict but practical).
  - Stricter checks catch more errors.
  - Watch mode: uv run poe typecheck --watch for live feedback.
  - Prefer # pyright:ignore over # type: ignore for specificity.
- Vs mypy: See Pyright vs Mypy.
poethepoet: Task runner as there is no default in the python ecosystem, used for devs to run all sort of defined tasks/scripts.

Limitations & Extras

In-memory storage with on-disk snapshots for persistence (data/<library_id>/ with manifest.json, NumPy arrays, and metadata indices).
Basic IVF PQ (approximate distances via decoding; no advanced ADC like OPQ).
No auth (extensible via headers).
Tests: Unit (indices/services), integration (API via TestClient).

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.vscode		.vscode
packages		packages
.dockerignore		.dockerignore
.gitignore		.gitignore
AGENTS.md		AGENTS.md
Dockerfile		Dockerfile
GUIDES.md		GUIDES.md
README.md		README.md
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Vector DB StackAI - Take-Home Task

Overview

Setup

Development Tasks (Poe the Poet)

Running the API

Local

Docker

Streamlit Playground

Design Choices

Indexing Algorithms

Persistence & Snapshots

Concurrency

Things taken from DDD

Tooling Choices

Limitations & Extras

About

Uh oh!

Releases

Packages

Languages

aml360/challenge-vector-db

Folders and files

Latest commit

History

Repository files navigation

Vector DB StackAI - Take-Home Task

Overview

Setup

Development Tasks (Poe the Poet)

Running the API

Local

Docker

Streamlit Playground

Design Choices

Indexing Algorithms

Persistence & Snapshots

Concurrency

Things taken from DDD

Tooling Choices

Limitations & Extras

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages