A log-structured, hierarchical storage engine with O(1) lookup, built in C++ and exposed over gRPC — so any language can use it as a persistent backend.
Stratum stores structured data in a hierarchy of three linked tiers: a document (any HTML/text blob), a chain of input nodes attached to that document, and a chain of output nodes attached to each input. All writes are append-only to log-structured segment files on disk. An in-memory hash index gives O(1) reads. A background compaction thread merges and garbage-collects old segments automatically.
Most databases make you pick between:
- A full relational DB (too heavy, forces a schema upfront)
- A key-value store (too flat, you have to model relationships yourself)
- A document store (good for the top level, awkward for linked chains of sub-documents)
Stratum is purpose-built for the pattern: one document → many ordered inputs → one output per input, where both inputs and outputs can be any data type (integer, float, string, list, map). It handles durability, compaction, and concurrency internally, and speaks gRPC so your Python, Go, or Node backend calls it like a local function.
Your application (Python / Go / Node / …)
│
│ gRPC (any language, auto-generated client)
▼
Stratum server ─────────────────────────────────────
│ │
│ In-memory hash index │
│ unordered_map<doc_id → {byte_offset, meta}> │
│ O(1) lookup — only id + offset loaded into RAM │
│ │
│ Segment manager (append-only log) │
│ active.seg seg_002.seg seg_001.seg merged.seg │
│ │
│ Compactor thread (background) │
│ Merges segments, GCs tombstoned records │
──────────────────────────────────────────────────────
│
▼
Disk (log-structured segment files)
Document blobs · Input linked list · Output linked list
Each document is stored as a blob at a known byte offset. The document record contains a pointer to the first node in the input chain. Each input node stores its value (any supported type) and two pointers — one to the next input node, one to its corresponding output node.
Document ──→ Input node 1 ──→ Input node 2 ──→ Input node 3 ──→ …
│ │ │
▼ ▼ ▼
Output node 1 Output node 2 Output node 3
All writes (including updates) are appended to the end of the active segment. Old versions of records become garbage and are reclaimed by the compactor. Deletes write a tombstone record; the original data is never mutated.
Both input nodes and output nodes can hold any of:
| Type | Example |
|---|---|
int64 |
42 |
double |
3.14 |
string |
"hello world" |
list[int] |
[2, 7, 11, 15] |
list[string] |
["foo", "bar"] |
map[string, string] |
{"key": "value"} |
stratum/
├── CMakeLists.txt
├── Dockerfile
├── docker-compose.yml
├── proto/
│ └── lse.proto ← gRPC contract (edit this to add RPCs)
├── include/lse/
│ ├── types.h ← on-disk structs, ValueVariant
│ ├── serialisation.h ← CRC-32, encode/decode
│ ├── segment.h ← append-only segment file
│ ├── compactor.h ← background GC thread
│ └── storage_engine.h ← public C++ API
├── src/
│ ├── serialisation.cpp
│ ├── segment.cpp
│ ├── compactor.cpp
│ ├── storage_engine.cpp
│ └── grpc/
│ ├── lse_service.h ← gRPC service declaration
│ ├── lse_service.cpp ← every RPC wired to StorageEngine
│ └── server_main.cpp ← binary entry point
└── client/
└── python/
├── lse_client.py ← Python wrapper (import this)
├── lse_pb2.py ← auto-generated stubs
├── lse_pb2_grpc.py ← auto-generated stubs
└── requirements.txt
# C++ build tools + gRPC
sudo apt install -y \
build-essential cmake pkg-config \
libgrpc++-dev libprotobuf-dev \
protobuf-compiler protobuf-compiler-grpc
# Python client
pip install grpcio grpcio-toolsgit clone https://github.com/yourname/stratum
cd stratum
docker-compose up --buildThe server starts on port 50051. Data persists in a named Docker volume across restarts. To stop: docker-compose down. To wipe data entirely: docker-compose down -v.
git clone https://github.com/yourname/stratum
cd stratum
# 1. Generate gRPC C++ code from the proto file
protoc --proto_path=proto \
--cpp_out=src/grpc \
--grpc_out=src/grpc \
--plugin=protoc-gen-grpc=/usr/bin/grpc_cpp_plugin \
proto/lse.proto
# 2. Build
mkdir build && cd build
cmake .. && make -j$(nproc)
# 3. Start the server
./lse_server --port=50051 --data-dir=/path/to/your/dataCopy client/python/lse_client.py, lse_pb2.py, and lse_pb2_grpc.py into your project, then:
from lse_client import LseClient
lse = LseClient("localhost:50051")
# ── Create a document ─────────────────────────────────────────────────────────
doc_id = lse.create_problem(
"<h1>My Document</h1><p>Any HTML or text content.</p>",
name="Example",
category="tutorial",
version="1.0"
)
# ── Read it back ──────────────────────────────────────────────────────────────
content = lse.get_problem_html(doc_id)
meta = lse.get_problem_meta(doc_id) # {"name": "Example", "category": "tutorial", …}
# ── Update content (old version becomes garbage — GC handles it) ──────────────
lse.update_problem_html(doc_id, "<h1>Updated content</h1>")
lse.update_problem_column(doc_id, "version", "1.1")
# ── Add input nodes (any data type) ──────────────────────────────────────────
node1 = lse.add_test_case(doc_id, [2, 7, 11, 15]) # list of ints
node2 = lse.add_test_case(doc_id, "some string input") # string
node3 = lse.add_test_case(doc_id, {"key": "value"}) # map
# ── Attach output nodes ───────────────────────────────────────────────────────
lse.set_expected_output(doc_id, node1, [0, 1]) # list answer
lse.set_expected_output(doc_id, node2, "output string") # string answer
# ── Read all input nodes for a document ──────────────────────────────────────
nodes = lse.get_all_test_cases(doc_id)
for n in nodes:
print(n["tc_id"], n["value"])
# ── Read a specific output node ───────────────────────────────────────────────
output = lse.get_expected_output(doc_id, node1)
print(output["value"]) # [0, 1]
# ── Update a node value ───────────────────────────────────────────────────────
lse.update_test_case(doc_id, node1, [3, 2, 4])
lse.update_expected_output(doc_id, node1, [1, 2])
# ── Delete ────────────────────────────────────────────────────────────────────
lse.delete_expected_output(doc_id, node2)
lse.delete_test_case(doc_id, node3)
lse.delete_problem(doc_id)
# ── Admin ─────────────────────────────────────────────────────────────────────
stats = lse.get_stats()
# {"problems_in_index": 12, "active_segment_bytes": 4096, "segment_count": 3}
lse.flush() # force active segment to disk
lse.compact_now() # trigger immediate compaction (normally runs automatically)All operations are defined in proto/lse.proto. The full service:
| RPC | Description |
|---|---|
CreateProblem |
Create a document with optional metadata columns |
GetProblemHtml |
Fetch the document blob (O(1) index lookup + 1 disk seek) |
GetProblemMeta |
Fetch metadata only — no disk seek for the blob |
UpdateProblemHtml |
Append new version; old becomes garbage |
UpdateProblemColumn |
Update or add a metadata column |
DeleteProblem |
Tombstone a document |
ListProblems |
Return all live document IDs |
| RPC | Description |
|---|---|
AddTestCase |
Append a new input node to a document's chain |
GetTestCase |
Fetch a single input node (O(1) cache hit) |
GetAllTestCases |
Walk the full linked list for a document |
UpdateTestCase |
Rewrite a node's value |
DeleteTestCase |
Tombstone a node |
| RPC | Description |
|---|---|
SetExpectedOutput |
Attach an output node to an input node |
GetExpectedOutput |
Fetch the output node for an input node |
UpdateExpectedOutput |
Rewrite an output node's value |
DeleteExpectedOutput |
Remove the output node link |
| RPC | Description |
|---|---|
GetStats |
Index size, active segment size, segment count |
Flush |
Flush active segment to disk |
CompactNow |
Force immediate compaction (blocks until complete) |
Because the API is defined in lse.proto, you can generate a client in any language gRPC supports.
Go:
protoc --proto_path=proto --go_out=client/go --go-grpc_out=client/go proto/lse.protoNode.js:
npm install @grpc/grpc-js @grpc/proto-loader
# then load proto dynamically — no codegen neededJava / Kotlin / Rust / C# / Ruby — all follow the same pattern. One .proto file, one codegen command, done.
Stratum uses a strategy inspired by Bitcask and LSM-trees.
- All writes go to
active.segin append-only fashion. - When
active.segreaches a configured size threshold, it is sealed (renamed toseg_NNN.seg) and a freshactive.segis opened. - The background compactor wakes every 30 seconds and checks total segment size.
- When total size exceeds the compaction threshold, it scans all sealed segments, keeps only the most recent version of each record (highest timestamp wins per record ID), discards tombstoned records, writes a single
merged.seg, and atomically removes the old segments. - The in-memory index reloads from the merged segment.
This means updates and deletes are always O(1) writes. Storage only grows proportionally to live data, not total write history.
| Flag | Environment variable | Default | Description |
|---|---|---|---|
--port |
LSE_PORT |
50051 |
gRPC listen port |
--data-dir |
LSE_DATA_DIR |
/data |
Directory for segment files |
Contributions are welcome. The codebase is structured so each concern is isolated:
- Add a new RPC → edit
proto/lse.proto, add the handler insrc/grpc/lse_service.cpp - Change storage format → edit
include/lse/types.handsrc/serialisation.cpp - Tune compaction → edit
src/compactor.cpp
Please open an issue before submitting a large PR.
MIT
