Skip to content

electro-geek/Stratum

Repository files navigation

Stratum Logo

Stratum

C++ gRPC Protocol Buffers CMake Docker Python

A log-structured, hierarchical storage engine with O(1) lookup, built in C++ and exposed over gRPC — so any language can use it as a persistent backend.

Stratum stores structured data in a hierarchy of three linked tiers: a document (any HTML/text blob), a chain of input nodes attached to that document, and a chain of output nodes attached to each input. All writes are append-only to log-structured segment files on disk. An in-memory hash index gives O(1) reads. A background compaction thread merges and garbage-collects old segments automatically.


Why Stratum?

Most databases make you pick between:

  • A full relational DB (too heavy, forces a schema upfront)
  • A key-value store (too flat, you have to model relationships yourself)
  • A document store (good for the top level, awkward for linked chains of sub-documents)

Stratum is purpose-built for the pattern: one document → many ordered inputs → one output per input, where both inputs and outputs can be any data type (integer, float, string, list, map). It handles durability, compaction, and concurrency internally, and speaks gRPC so your Python, Go, or Node backend calls it like a local function.


Architecture

Your application (Python / Go / Node / …)
        │
        │  gRPC (any language, auto-generated client)
        ▼
  Stratum server  ─────────────────────────────────────
  │                                                    │
  │  In-memory hash index                              │
  │  unordered_map<doc_id → {byte_offset, meta}>       │
  │  O(1) lookup — only id + offset loaded into RAM    │
  │                                                    │
  │  Segment manager (append-only log)                 │
  │  active.seg  seg_002.seg  seg_001.seg  merged.seg  │
  │                                                    │
  │  Compactor thread (background)                     │
  │  Merges segments, GCs tombstoned records           │
  ──────────────────────────────────────────────────────
        │
        ▼
  Disk  (log-structured segment files)
  Document blobs · Input linked list · Output linked list

On-disk data model

Each document is stored as a blob at a known byte offset. The document record contains a pointer to the first node in the input chain. Each input node stores its value (any supported type) and two pointers — one to the next input node, one to its corresponding output node.

Document ──→ Input node 1 ──→ Input node 2 ──→ Input node 3 ──→ …
                  │                 │                 │
                  ▼                 ▼                 ▼
             Output node 1    Output node 2    Output node 3

All writes (including updates) are appended to the end of the active segment. Old versions of records become garbage and are reclaimed by the compactor. Deletes write a tombstone record; the original data is never mutated.

Supported value types

Both input nodes and output nodes can hold any of:

Type Example
int64 42
double 3.14
string "hello world"
list[int] [2, 7, 11, 15]
list[string] ["foo", "bar"]
map[string, string] {"key": "value"}

File structure

stratum/
├── CMakeLists.txt
├── Dockerfile
├── docker-compose.yml
├── proto/
│   └── lse.proto               ← gRPC contract (edit this to add RPCs)
├── include/lse/
│   ├── types.h                 ← on-disk structs, ValueVariant
│   ├── serialisation.h         ← CRC-32, encode/decode
│   ├── segment.h               ← append-only segment file
│   ├── compactor.h             ← background GC thread
│   └── storage_engine.h        ← public C++ API
├── src/
│   ├── serialisation.cpp
│   ├── segment.cpp
│   ├── compactor.cpp
│   ├── storage_engine.cpp
│   └── grpc/
│       ├── lse_service.h       ← gRPC service declaration
│       ├── lse_service.cpp     ← every RPC wired to StorageEngine
│       └── server_main.cpp     ← binary entry point
└── client/
    └── python/
        ├── lse_client.py       ← Python wrapper (import this)
        ├── lse_pb2.py          ← auto-generated stubs
        ├── lse_pb2_grpc.py     ← auto-generated stubs
        └── requirements.txt

Getting started

Prerequisites

# C++ build tools + gRPC
sudo apt install -y \
  build-essential cmake pkg-config \
  libgrpc++-dev libprotobuf-dev \
  protobuf-compiler protobuf-compiler-grpc

# Python client
pip install grpcio grpcio-tools

Option A — Run with Docker (recommended)

git clone https://github.com/yourname/stratum
cd stratum
docker-compose up --build

The server starts on port 50051. Data persists in a named Docker volume across restarts. To stop: docker-compose down. To wipe data entirely: docker-compose down -v.

Option B — Build and run locally

git clone https://github.com/yourname/stratum
cd stratum

# 1. Generate gRPC C++ code from the proto file
protoc --proto_path=proto \
       --cpp_out=src/grpc \
       --grpc_out=src/grpc \
       --plugin=protoc-gen-grpc=/usr/bin/grpc_cpp_plugin \
       proto/lse.proto

# 2. Build
mkdir build && cd build
cmake .. && make -j$(nproc)

# 3. Start the server
./lse_server --port=50051 --data-dir=/path/to/your/data

Usage — Python client

Copy client/python/lse_client.py, lse_pb2.py, and lse_pb2_grpc.py into your project, then:

from lse_client import LseClient

lse = LseClient("localhost:50051")

# ── Create a document ─────────────────────────────────────────────────────────
doc_id = lse.create_problem(
    "<h1>My Document</h1><p>Any HTML or text content.</p>",
    name="Example",
    category="tutorial",
    version="1.0"
)

# ── Read it back ──────────────────────────────────────────────────────────────
content = lse.get_problem_html(doc_id)
meta    = lse.get_problem_meta(doc_id)   # {"name": "Example", "category": "tutorial", …}

# ── Update content (old version becomes garbage — GC handles it) ──────────────
lse.update_problem_html(doc_id, "<h1>Updated content</h1>")
lse.update_problem_column(doc_id, "version", "1.1")

# ── Add input nodes (any data type) ──────────────────────────────────────────
node1 = lse.add_test_case(doc_id, [2, 7, 11, 15])      # list of ints
node2 = lse.add_test_case(doc_id, "some string input")  # string
node3 = lse.add_test_case(doc_id, {"key": "value"})     # map

# ── Attach output nodes ───────────────────────────────────────────────────────
lse.set_expected_output(doc_id, node1, [0, 1])          # list answer
lse.set_expected_output(doc_id, node2, "output string") # string answer

# ── Read all input nodes for a document ──────────────────────────────────────
nodes = lse.get_all_test_cases(doc_id)
for n in nodes:
    print(n["tc_id"], n["value"])

# ── Read a specific output node ───────────────────────────────────────────────
output = lse.get_expected_output(doc_id, node1)
print(output["value"])   # [0, 1]

# ── Update a node value ───────────────────────────────────────────────────────
lse.update_test_case(doc_id, node1, [3, 2, 4])
lse.update_expected_output(doc_id, node1, [1, 2])

# ── Delete ────────────────────────────────────────────────────────────────────
lse.delete_expected_output(doc_id, node2)
lse.delete_test_case(doc_id, node3)
lse.delete_problem(doc_id)

# ── Admin ─────────────────────────────────────────────────────────────────────
stats = lse.get_stats()
# {"problems_in_index": 12, "active_segment_bytes": 4096, "segment_count": 3}

lse.flush()        # force active segment to disk
lse.compact_now()  # trigger immediate compaction (normally runs automatically)

gRPC API reference

All operations are defined in proto/lse.proto. The full service:

Documents

RPC Description
CreateProblem Create a document with optional metadata columns
GetProblemHtml Fetch the document blob (O(1) index lookup + 1 disk seek)
GetProblemMeta Fetch metadata only — no disk seek for the blob
UpdateProblemHtml Append new version; old becomes garbage
UpdateProblemColumn Update or add a metadata column
DeleteProblem Tombstone a document
ListProblems Return all live document IDs

Input nodes

RPC Description
AddTestCase Append a new input node to a document's chain
GetTestCase Fetch a single input node (O(1) cache hit)
GetAllTestCases Walk the full linked list for a document
UpdateTestCase Rewrite a node's value
DeleteTestCase Tombstone a node

Output nodes

RPC Description
SetExpectedOutput Attach an output node to an input node
GetExpectedOutput Fetch the output node for an input node
UpdateExpectedOutput Rewrite an output node's value
DeleteExpectedOutput Remove the output node link

Admin

RPC Description
GetStats Index size, active segment size, segment count
Flush Flush active segment to disk
CompactNow Force immediate compaction (blocks until complete)

Using from other languages

Because the API is defined in lse.proto, you can generate a client in any language gRPC supports.

Go:

protoc --proto_path=proto --go_out=client/go --go-grpc_out=client/go proto/lse.proto

Node.js:

npm install @grpc/grpc-js @grpc/proto-loader
# then load proto dynamically — no codegen needed

Java / Kotlin / Rust / C# / Ruby — all follow the same pattern. One .proto file, one codegen command, done.


How compaction works

Stratum uses a strategy inspired by Bitcask and LSM-trees.

  1. All writes go to active.seg in append-only fashion.
  2. When active.seg reaches a configured size threshold, it is sealed (renamed to seg_NNN.seg) and a fresh active.seg is opened.
  3. The background compactor wakes every 30 seconds and checks total segment size.
  4. When total size exceeds the compaction threshold, it scans all sealed segments, keeps only the most recent version of each record (highest timestamp wins per record ID), discards tombstoned records, writes a single merged.seg, and atomically removes the old segments.
  5. The in-memory index reloads from the merged segment.

This means updates and deletes are always O(1) writes. Storage only grows proportionally to live data, not total write history.


Configuration

Flag Environment variable Default Description
--port LSE_PORT 50051 gRPC listen port
--data-dir LSE_DATA_DIR /data Directory for segment files

Contributing

Contributions are welcome. The codebase is structured so each concern is isolated:

  • Add a new RPC → edit proto/lse.proto, add the handler in src/grpc/lse_service.cpp
  • Change storage format → edit include/lse/types.h and src/serialisation.cpp
  • Tune compaction → edit src/compactor.cpp

Please open an issue before submitting a large PR.


License

MIT

About

Stratum (Storage Engine)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors