-
Notifications
You must be signed in to change notification settings - Fork 0
Database Format
This document describes the on-disk layout and binary formats used by the Photosphere media database (current format).
Database version: The database version is determined by the version field in .db/files.dat. The current format is version 6. The legacy format is version 5 (see Database-Format-Legacy.md). Most psi commands only work with version 6. The command psi upgrade migrates a database from older versions (including legacy version 5) to version 6.
Important: Do not modify database files manually. Use the Photosphere CLI (psi) for all operations.
A database is a single root directory. All paths below are relative to that root.
| Path | Description |
|---|---|
README.md |
Auto-generated warning and usage instructions. |
.db/ |
Database integrity data and media files metadata; see below. |
asset/ |
Original imported media files (one file per asset, keyed by asset UUID, no extension). |
display/ |
Display-sized derivatives (e.g. max 1000px, JPEG). One file per asset, keyed by UUID. |
thumb/ |
Thumbnail derivatives (e.g. max 300px, JPEG). One file per asset, keyed by UUID. |
The directories asset, display and thumb) contain media files (photos and videos). The .db/ directory contains data to verify the database and the metadata for media files.
When the database is encrypted, all files under the database root are stored in the encrypted file format (see §5).
<database root>/
README.md
.db/
files.dat # Files Merkle tree (versioned + type + checksum, then encrypted)
config.json # Config (origin, etc.) (optional)
write.lock
encryption.pub # Optional encryption marker - the public key if the db is encrypted.
bson/ # BSON database root
db.dat # Database Merkle tree
collections/
metadata/ # "metadata" collection
shards/
<shardId> # Shard data
<shardId>.dat # Shard Merkle tree
collection.dat # Collection Merkle tree
indexes/
metadata/
hash_asc/
tree.dat
<pageId> # UUID-named leaf pages
photoDate_desc/
tree.dat
<pageId>
asset/
<uuid> # Original media (encrypted)
display/
<uuid> # Display media.
thumb/
<uuid> # Thumbnail media.
The .db/ directory contains all control and structured data: the BSON database, files Merkle tree, config (including origin), and lock/marker files. It is used to validate the integrity of the database and stores metadata about media files.
| Path | Description |
|---|---|
.db/bson/ |
BSON database root: all structured metadata and indexes (see §3). |
.db/files.dat |
Files Merkle tree: For each file under asset/, display/, and thumb/ only (the tree does not include an entry for itself or other .db/ files), stores hash, length, and lastModified of the logical (plain/decrypted) content. Used to verify integrity and compare databases; plain and encrypted databases with the same content compare equal. |
.db/config.json |
Configuration file: JSON object with an origin field (path to the database this copy was replicated from) and room for other settings (see §6). Used for sync, repair, and fulfilling missing files. |
.db/write.lock |
Write lock (when held). |
.db/encryption.pub |
Optional marker: copy of the public key used for encryption (enables “this DB is encrypted” detection). |
Serialized files under .db/ use the versioned serialized format (version, type, payload, checksum) before encryption (see §4).
Structured metadata is stored in a BSON-based layout with sharded collections and sort indexes; its root is .db/bson/.
| Path | Description |
|---|---|
db.dat |
Database Merkle tree. Used to verify the integrity of the database and compare databases for differences. |
collections/ |
One subdirectory per collection; each contains a shards/ subdirectory (shard files and shard Merkle trees <shardId>.dat) and collection.dat at the collection root (see §3.2). |
indexes/ |
One subdirectory per sort index, named <collectionName>/<fieldName>_<direction>/; each contains B-tree metadata and leaf page files (see §3.4). |
A collection contains records that share the same schema or purpose, so different kinds of data can be stored and queried separately. Each collection is a directory under .db/bson/collections/ (e.g. .db/bson/collections/metadata for the asset metadata collection). The Photosphere app uses a single collection named metadata where each record describes a media file (photo or video).
-
shards/: Subdirectory containing all shard data for the collection:-
Shard files: one file per shard, named by shard ID (e.g.
0,1, …,96). No extension. Shard ID ismd5(recordId)[0:8] % numShards(default 100 shards). Path:.db/bson/collections/<collectionName>/shards/<shardId>. -
Shard Merkle trees: next to each shard file:
<shardId>.dat(e.g.96.dat). Used to build the collection Merkle tree. Path:.db/bson/collections/<collectionName>/shards/<shardId>.dat.
-
Shard files: one file per shard, named by shard ID (e.g.
-
Collection Merkle tree:
.db/bson/collections/<collectionName>/collection.dat(e.g..db/bson/collections/metadata/collection.dat), at the collection root. Aggregates shard root hashes.
A shard is a file that holds many collection records; records are distributed across shards by shard ID (shard ID formula and collection layout: see §3.2). The reason for shards is that storing database records as individual files is very expensive as each file would be 4k minimum (at least on Linux). So records are collected in shards simply so many records can be efficiently packed into a smaller amount of space than if they were stored indivdiually (say as one json file per record).
Shard files use the versioned serialized format (see §4): version, type, payload, then SHA-256 checksum. The payload (e.g. version 2) is:
-
[4 bytes], Record count (uint32 LE). - For each record (sorted by
_id):-
[16 bytes], Record ID as raw UUID bytes (no dashes, 16 bytes hex decoded). -
[BSON], Record fields (BSON document;_idis stored separately). -
[BSON], Metadata:{ timestamp?, fields? }for field-level timestamps.
-
Record IDs are normalized to 16-byte hex (UUID without dashes) for shard keying; on read they are formatted back to standard UUID string.
Sort indexes exist so that ordered and range queries (e.g. “list by date”, “find by hash”) can be answered without scanning the whole collection: the index keeps records ordered by the indexed field, and the B-tree supports efficient lookup and pagination. Photosphere uses two indexes on the metadata collection: hash (asc) and photoDate (desc). The hash index is needed to look up an asset by content hash (e.g. for deduplication, verify, or finding an existing record before import). The photoDate index is needed to list or browse assets by capture date (e.g. timeline view, newest first).
Sort indexes live under .db/bson/indexes/<collectionName>/<fieldName>_<direction>/ (e.g. .db/bson/indexes/metadata/hash_asc/, .db/bson/indexes/metadata/photoDate_desc/). The direction (asc or desc) and type (date, string, number) determine how values are compared (dates as timestamps, strings lexicographically, numbers numerically). The B-tree’s keys are the indexed values; leaf pages hold index entries (record ID, value, and a copy of the record’s fields).
-
tree.dat: B-tree metadata and node descriptors. Versioned serialized format with type and checksum. -
<pageId>: Leaf page files; page IDs are UUIDs. Each file is a versioned serialized page of index entries. -
build.checkpoint: Optional JSON checkpoint for incremental index builds (also stored in encrypted form if the DB is encrypted).
tree.dat payload (version 2):
-
totalEntries(uint32),totalPages(uint32). -
rootPageId(buffer/length-prefixed string). -
fieldName,direction(buffer/length-prefixed strings). -
type(uint8): 0 = none, 1 = date, 2 = string, 3 = number. - Reserved 8 bytes (uint64).
-
nodeCount(uint32). - For each node (by sorted pageId):
pageId, node (keys, children, nextLeaf, previousLeaf), etc.
Leaf page file payload (version 1):
- Record count (uint32 LE).
- For each entry: record ID (length-prefixed buffer, UTF-8), value (BSON
{ value }), record fields (BSON document).
Every serialized file (Merkle trees, BSON shards, sort index trees and pages, etc.) uses a single layout so that readers can verify and dispatch by type.
-
[4 bytes], Version (uint32 LE). -
[4 bytes], Type code (4-character ASCII, 32 bits). Identifies the kind of file. Each file kind has a distinct 4-byte ASCII code (e.g.FTRE= files Merkle tree,BDBT= BSON database tree,SHAR= collection shard,COLT= collection and per-shard Merkle tree,IDXT= index B-tree metadata,IDXP= index leaf page). Stored in the same byte order as the rest of the file (e.g. little-endian as a uint32). Writers and readers use the same code for each kind. Readers use the type code to route to the correct deserializer or reject unknown types. -
[payload], Version- and type-specific payload. -
[32 bytes], SHA-256 checksum of the concatenation: version + type + payload.
Primitives are little-endian (uint32, int32, uint64, int64); strings and buffers are length-prefixed; documents use BSON (lengths 32-bit where length-prefixed).
Almost all files under the database root are encrypted when the database is encrypted. The following files are always stored in plain text (never encrypted):
README.md.db/config.json.db/encryption.pub
All other files, asset, display, thumb, .db/files.dat, .db/bson/*, .db/write.lock, are stored in the encrypted file format. There is no mixed encrypted/unencrypted layout for these files.
Encrypted file format:
Each encrypted file uses one of two formats:
- New format: A fixed header (unencrypted), then the encrypted payload. The header lets the app identify the format and which key was used without decrypting.
- Legacy format: No header; the file starts directly with the encrypted payload. Readers treat such files as encrypted with the default key (see Encryption).
New-format header (44 bytes):
-
[4 bytes], Tag (e.g.PSEN), 4-character ASCII. If the first 4 bytes are not this tag, the file is treated as legacy format. -
[4 bytes], Format version (uint32 LE). -
[4 bytes], Encryption type (4-character ASCII, e.g.A2CB). -
[32 bytes], Key hash (SHA-256 of the public key used to encrypt this file, for key lookup).
Encrypted payload:
- For new format, the payload immediately follows the 44-byte header. For legacy format, the payload starts at byte 0.
- Payload layout is the same in both cases: RSA-wrapped AES key (512 bytes), IV (16 bytes), then AES-256-CBC ciphertext. The plaintext that is encrypted is the full serialized content (e.g. the versioned serialized blob with version, type, payload, and checksum, or a raw media blob).
Raw storage: When the application needs to inspect the on-disk bytes of a file before decryption, for example, to read the encryption header to determine which key was used, it uses a raw storage instance. Raw storage is a storage object created without an encryption layer: it returns the exact bytes stored on disk, including any encryption header. This is in contrast to the normal (encrypted) storage instance, which transparently decrypts file content on read. Raw storage is used exclusively for header inspection and must never be used to read or interpret file payloads that require decryption.
A database stores optional configuration in .db/config.json, a JSON file under the .db directory. This file is created when the database is initialized (psi init) or when it is upgraded (psi upgrade); if it does not exist after upgrade, an empty object {} is written.
| Field | Type | Description |
|---|---|---|
origin |
string (optional) | Path or URI of the database this copy was replicated from. Set on the replica when you run psi replicate (the replica’s origin is the source path). Can be set manually with psi set-origin <path>. |
lastReplicatedAt |
string (optional) | ISO 8601 date-time when this database was last replicated (i.e. when it was written as a replica from a source). Updated on the replica after each psi replicate. |
lastSyncedAt |
string (optional) | ISO 8601 date-time when this database was last synchronized with another. Updated on both sides after each psi sync. |
lastModifiedAt |
string (optional) | ISO 8601 date-time when this database was last modified locally (e.g. adding an asset, removing an asset, or editing metadata). Updated by psi add, psi remove, and by API writes. |
{
"origin": "/path/to/source/database",
"lastReplicatedAt": "2026-02-01T12:00:00.000Z",
"lastSyncedAt": "2026-02-01T14:30:00.000Z",
"lastModifiedAt": "2026-02-01T10:15:00.000Z"
}The origin value is used as the default for:
-
psi sync: If--destis omitted, the other database is taken fromorigin. -
psi replicate: If--destis omitted, the destination is taken fromorigin(e.g. replicate from a copy back to its source). -
psi repair: If--sourceis omitted, the repair source is taken fromorigin. -
psi compare: If--destis omitted, the second database is taken fromorigin.
Commands psi origin and psi set-origin <path> display and set the origin field.
If the database was not created by replication and no origin has been set, origin may be absent; in that case, sync/replicate/repair/compare require an explicit --dest or --source.
A database can be full or partial. The layout and file formats are identical; the difference is which files are physically present on disk.
Full database: All asset files are stored, asset/, display/, and thumb/ each have one file per asset. The BSON database under .db/bson/ and all of .db/ are complete. This is the normal state after import or after a full psi replicate.
Partial database: Created with psi replicate --partial. The following files are copied from the source:
| File(s) | Purpose |
|---|---|
README.md |
Auto-generated usage instructions. |
.db/files.dat |
Complete files Merkle tree, lists all asset, display, and thumb paths (even those not yet fetched). |
.db/config.json |
Config including origin (path to the source database). |
.db/bson/*.dat |
All collection and sort-index Merkle trees (db.dat, collection.dat, shard .dat files, index tree.dat files). |
The following files are not copied during a partial replication:
| File(s) | Why omitted |
|---|---|
asset/<uuid> |
Original media files, large, not needed for browsing. |
display/<uuid> |
Display-sized derivatives, fetched on demand when a photo is opened. |
thumb/<uuid> |
Thumbnails, not copied in partial mode; fetched on demand for the gallery grid. |
.db/bson/collections/<name>/shards/<id> (no extension) |
Shard data files containing asset records, fetched on demand when the gallery loads. |
Sort index leaf pages (UUID-named files under indexes/) |
Leaf pages, fetched on demand as the gallery paginates. |
The isPartial: true flag is written into the database metadata embedded in .db/files.dat. This flag tells tools that missing files are expected (e.g. psi verify does not report them as errors).
When the backend serves a partial database to the GUI, missing files are fetched transparently from the origin database on first access and cached locally for subsequent requests. No special action is needed from the user or the GUI.
The origin path is read from .db/config.json (origin field). If origin is absent or the database is not partial, the backend reads files locally as normal.
| Resource | Trigger |
|---|---|
| Sort index leaf pages | When the gallery loads and paginates through assets. |
| Shard data files | When the gallery reads asset metadata records. |
thumb/<uuid> |
When a thumbnail is displayed in the gallery grid. |
display/<uuid> |
When a photo is opened at full size. |
asset/<uuid> |
When the original file is downloaded or exported. |
- First access: The backend fetches the file from origin and writes it to the local database directory. Subsequent accesses are served from local storage with no round-trip to origin.
-
Large files (e.g. 7 GB videos):
readStreamuses a tee-stream so origin data is forwarded directly to the HTTP response and simultaneously written to the local cache, the file is never buffered in memory. - Cache write failures: If writing to the local cache fails (e.g. disk full), the data is still streamed to the caller from origin. The file will be fetched from origin again on the next request.
- Concurrent requests: If two requests arrive for the same missing file simultaneously, both may fetch from origin independently. The writes are idempotent (identical data), so no corruption occurs.
GUI request
│
▼
Backend (LazyOriginStorage)
│
├── file in local cache? ──yes──► stream from local
│
└── no ──► fetch from origin ──► tee ──► stream to caller
└──► write to local cache
psi replicate --db <source-path> --dest <local-path> --partialThe destination will have isPartial: true in .db/files.dat and an origin entry in .db/config.json pointing to the source. This is all the backend needs to perform lazy-pulls automatically.
To convert a partial database into a full one, run a normal psi replicate from the origin to the partial database (without --partial). This fills in all missing files in bulk.
psi verify reads the isPartial flag and skips integrity checks for files that are absent from a partial database. A missing shard, thumb, display, or asset file in a partial database is reported as unmodified (expected), not as an error.
The metadata collection stores asset records. Main fields:
-
_id, UUID string. -
origFileName,origPath?,contentType,width,height,hash. -
coordinates?,location?,duration?,fileDate,photoDate?,uploadDate. -
properties?,labels?,description?,deleted?. -
micro, base64 micro thumbnail. -
color,[number, number, number](e.g. dominant color).
Sort indexes used in practice: hash (asc), photoDate (desc).