-
Notifications
You must be signed in to change notification settings - Fork 0
Database Format
This document describes the on-disk layout and binary formats used by the Photosphere media database (current format).
Database version: The database version is determined by the version field in .db/files.dat. The current format is version 6. The legacy format is version 5 (see Database-Format-Legacy.md). Most psi commands only work with version 6. The command psi upgrade migrates a database from older versions (including legacy version 5) to version 6.
Important: Do not modify database files manually. Use the Photosphere CLI (psi) for all operations.
A database is a single root directory. All paths below are relative to that root.
| Path | Description |
|---|---|
README.md |
Auto-generated warning and usage instructions. |
.db/ |
Database-level control and integrity data; see below. |
asset/ |
Original imported media files (one file per asset, keyed by asset UUID, no extension). |
display/ |
Display-sized derivatives (e.g. max 1000px, JPEG). One file per asset, keyed by UUID. |
thumb/ |
Thumbnail derivatives (e.g. max 300px, JPEG). One file per asset, keyed by UUID. |
The .db/ directory contains all control and structured data: the BSON database, files Merkle tree, config (including origin), and lock/marker files. Media directories (asset/, display/, thumb/) contain only binary blobs (one per asset).
<database root>/
README.md
.db/
files.dat # Files Merkle tree (versioned + type + checksum, then encrypted)
config.json # Config (origin, etc.) (optional)
write.lock
encryption.pub # Optional encryption marker
bson/ # BSON database root
db.dat # Database Merkle tree
collections/
metadata/ # "metadata" collection
shards/
<shardId> # Shard data
<shardId>.dat # Shard Merkle tree
collection.dat # Collection Merkle tree
indexes/
metadata/
hash_asc/
tree.dat
<pageId> # UUID-named leaf pages
photoDate_desc/
tree.dat
<pageId>
asset/
<uuid> # Original media (encrypted)
display/
<uuid> # Display media.
thumb/
<uuid> # Thumbnail media.
| Path | Description |
|---|---|
.db/bson/ |
BSON database root — all structured metadata and indexes (see §3). |
.db/files.dat |
Files Merkle tree — hashes of asset/display/thumb files and database metadata; used for sync and verify. |
.db/config.json |
Configuration file — JSON object with an origin field (path to the database this copy was replicated from) and room for other settings (see §6). Used for sync, repair, and fulfilling missing files. |
.db/write.lock |
Write lock (when held). |
.db/encryption.pub |
Optional marker: copy of the public key used for encryption (enables “this DB is encrypted” detection). |
All files under .db/ (and everywhere else) are stored in the encrypted file format when the database is encrypted (see §5). Serialized files under .db/ use the versioned serialized format (version, type, payload, checksum) before encryption (see §4).
Structured metadata is stored in a BSON-based layout with sharded collections and sort indexes; its root is .db/bson/.
Summary of contents under .db/bson/:
| Path | Description |
|---|---|
db.dat |
Database Merkle tree — one root hash per collection (see §3.1). |
collections/ |
One subdirectory per collection; each contains a shards/ subdirectory (shard files and shard Merkle trees <shardId>.dat) and collection.dat at the collection root (see §3.2, §3.3). |
indexes/ |
One subdirectory per index, named <collectionName>/<fieldName>_<direction>/; each contains B-tree metadata and leaf page files (see §3.4). |
| Path | Description |
|---|---|
.db/bson/db.dat |
Database Merkle tree — one root hash per collection; used for replication and integrity. |
Format: versioned serialized file (version, type, payload, checksum). Payload is the Merkle tree serialization (e.g. current tree version 5).
Each collection is a directory under .db/bson/collections/ (e.g. .db/bson/collections/metadata for the asset metadata collection). The Photosphere app uses a single collection named metadata whose records are asset documents.
-
shards/— Subdirectory containing all shard data for the collection:-
Shard files — one file per shard, named by shard ID (e.g.
0,1, …,96). No extension. Shard ID ismd5(recordId)[0:8] % numShards(default 100 shards). Path:.db/bson/collections/<collectionName>/shards/<shardId>. -
Shard Merkle trees — next to each shard file:
<shardId>.dat(e.g.96.dat). Used to build the collection Merkle tree. Path:.db/bson/collections/<collectionName>/shards/<shardId>.dat.
-
Shard files — one file per shard, named by shard ID (e.g.
-
Collection Merkle tree —
.db/bson/collections/<collectionName>/collection.dat(e.g..db/bson/collections/metadata/collection.dat), at the collection root. Aggregates shard root hashes.
A shard is a file that holds many collection records; records are distributed across shards by shard ID (see §3.2). Shard files use the versioned serialized format (see §4): version, type, payload, then SHA-256 checksum. The payload (e.g. version 2) is:
-
[4 bytes]— Record count (uint32 LE). - For each record (sorted by
_id):-
[16 bytes]— Record ID as raw UUID bytes (no dashes, 16 bytes hex decoded). -
[BSON]— Record fields (BSON document;_idis stored separately). -
[BSON]— Metadata:{ timestamp?, fields? }for field-level timestamps.
-
Record IDs are normalized to 16-byte hex (UUID without dashes) for shard keying; on read they are formatted back to standard UUID string.
Sort indexes live under .db/bson/indexes/<collectionName>/<fieldName>_<direction>/ (e.g. .db/bson/indexes/metadata/hash_asc/, .db/bson/indexes/metadata/photoDate_desc/). Collection data lives under .db/bson/collections/<collectionName>/ (shards under shards/, collection Merkle tree collection.dat at collection root); sort indexes remain under .db/bson/indexes/. Each index orders collection records by the indexed field’s value in the given direction (asc or desc). The type (date, string, number) determines how values are compared for ordering: dates as timestamps, strings lexicographically, numbers numerically. The B-tree’s keys are these values; leaf pages hold index entries (record ID, value, and a copy of the record’s fields) in sorted order so that pagination and range queries can be served without scanning the whole collection.
Each index directory contains:
-
tree.dat— B-tree metadata and node descriptors. Versioned serialized format with type and checksum. -
<pageId>— Leaf page files; page IDs are UUIDs. Each file is a versioned serialized page of index entries. -
build.checkpoint— Optional JSON checkpoint for incremental index builds (also stored in encrypted form if the DB is encrypted).
tree.dat payload (version 2):
-
totalEntries(uint32),totalPages(uint32). -
rootPageId(buffer/length-prefixed string). -
fieldName,direction(buffer/length-prefixed strings). -
type(uint8): 0 = none, 1 = date, 2 = string, 3 = number. - Reserved 8 bytes (uint64).
-
nodeCount(uint32). - For each node (by sorted pageId):
pageId, node (keys, children, nextLeaf, previousLeaf), etc.
Leaf page file payload (version 1):
- Record count (uint32 LE).
- For each entry: record ID (length-prefixed buffer, UTF-8), value (BSON
{ value }), record fields (BSON document).
Every serialized file (Merkle trees, BSON shards, sort index trees and pages, etc.) uses a single layout so that readers can verify and dispatch by type.
Layout (always used):
-
[4 bytes]— Version (uint32 LE). -
[4 bytes]— Type code (uint32 LE). Identifies the kind of file (e.g. files Merkle tree, BSON DB tree, collection shard, sort index tree, sort index page). -
[payload]— Version- and type-specific payload. -
[32 bytes]— SHA-256 checksum of the concatenation: version + type + payload.
Primitives are little-endian (uint32, int32, uint64, int64); strings and buffers are length-prefixed; documents use BSON (lengths 32-bit where length-prefixed). Type codes are assigned centrally so that all readers can reject unknown types or route to the correct deserializer.
All files under the database root are encrypted when the database is encrypted. There is no mixed encrypted/unencrypted layout: asset, display, thumb, and the entire .db/ tree (including .db/files.dat, .db/bson/*, .db/config.json, .db/write.lock, .db/encryption.pub) are stored in the encrypted file format.
Encrypted file format:
Each encrypted file consists of a clear header (so the app can identify encryption and key without decrypting), followed by the encrypted payload.
Clear header:
-
[4 bytes]— Encrypted file format version (uint32 LE). -
[1 byte]— Encryption type code (e.g. 0 = none, 1 = RSA + AES-256-CBC per file). -
[32 bytes]— SHA-256 hash of the public key used to encrypt this file. Used to match the file to a key and to detect key mismatch.
Encrypted payload (after the header):
- Payload is encrypted with the same scheme as the legacy format: per-file RSA-encrypted AES-256 key, IV, then AES-256-CBC ciphertext. The plaintext that is encrypted is the full serialized content (e.g. the versioned serialized blob with version, type, payload, and checksum, or a raw media blob).
So a reader can: (1) read the clear header to see format version, encryption type, and public-key hash; (2) select the correct private key (if any); (3) decrypt the payload; (4) if the payload is a serialized file, verify checksum and dispatch on type.
A database can record its origin: the database it was replicated from (e.g. a path or URI). The origin is stored in .db/config.json, a JSON configuration file. The file contains an object with at least an origin field whose value is the path (or URI) to the origin database. Other fields may be added to this file for future configuration (e.g. sync options, repair preferences).
Example:
{
"origin": "/path/to/source/database"
}The origin is used to:
- Sync — Know which remote database to sync with.
- Repair — Know which source to use when repairing or validating files.
- Fulfil missing files — For partial databases, know where to fetch missing asset/display files when the user browses the gallery or when filling lazily.
If the database was not created by replication, .db/config.json may be absent, or present without an origin field.
A Merkle tree over asset/display/thumb paths and related metadata is stored at .db/files.dat.
-
Path:
.db/files.dat. -
Content: Sort tree + Merkle tree + optional database metadata (e.g.
filesImported,deletedAssetIds,isPartial). - Serialization: Versioned serialized format (version, type, payload, checksum). Payload is the same logical content as in the legacy format (e.g. tree version 5). The version field in this file is the database version; version 6 denotes the current format described in this document.
-
Leaf names: Paths like
asset/<uuid>,display/<uuid>,thumb/<uuid>; leaves store content hash and metadata for verify/sync.
A database can be full or partial. The layout and file formats are the same; the difference is which files are present on disk.
Full database: All asset files are stored: asset/, display/, and thumb/ each have one file per asset. The BSON database under .db/bson/ and .db/ are complete. This is the normal case after import or after a full replicate.
Partial database: Only thumb files and root-level files (e.g. README.md) are stored. The asset/ and display/ directories are missing or sparse. The BSON database under .db/bson/ is still complete (all asset records and indexes). Partial databases are created by replicating with the partial option.
The partial flag is stored in the files Merkle tree (.db/files.dat) in the database metadata: isPartial: true. Tools use this to treat missing asset/display files as expected (verify) and to only copy thumb and root-level files when syncing to a partial target. Missing files can be filled in lazily (e.g. download from the origin database as the user views photos in the gallery) or in bulk via a full replicate.
The metadata collection stores asset records. Main fields:
-
_id— UUID string. -
origFileName,origPath?,contentType,width,height,hash. -
coordinates?,location?,duration?,fileDate,photoDate?,uploadDate. -
properties?,labels?,description?,deleted?. -
micro— base64 micro thumbnail. -
color—[number, number, number](e.g. dominant color).
Sort indexes used in practice: hash (asc), photoDate (desc).
Migration from the legacy format (version 5; see Database-Format-Legacy.md) to the current format (version 6) is performed by the existing command psi upgrade. The database version is stored in the files Merkle tree: in version 5 that file is .db/tree.dat; in version 6 it is .db/files.dat. After a successful upgrade, the file is .db/files.dat and the version is 6.
The following changes are applied when converting a database from version 5 to version 6.
-
BSON database location: Move the entire BSON tree from
metadata/at the database root to.db/bson/. That is:- Move
metadata/db.dat→.db/bson/db.dat - Create
.db/bson/collections/and move collection data frommetadata/metadata/: create.db/bson/collections/metadata/shards/, move shard files and<shardId>.datintoshards/, movecollection.datto.db/bson/collections/metadata/collection.dat. - Create
.db/bson/indexes/and movemetadata/sort_indexes/(legacy name in v5) →.db/bson/indexes/ - Remove the now-empty
metadata/directory.
- Move
-
Files Merkle tree: Already under
.db/files.dat; no path change, but re-serialize using the new versioned format (version, type, checksum) and, if the DB is encrypted, wrap in the new encrypted file format (header + encrypted payload).
-
Versioned serialized files: For every serialized file (Merkle trees, BSON shards, index trees and pages under
.db/bson/indexes/):- Add a type field (4 bytes, uint32 LE) after the version field. Assign and use a stable type code per kind of file.
- Ensure a checksum is always present: 32 bytes SHA-256(version + type + payload) after the payload. Legacy files that were stored without checksum must be read with the legacy deserializer, then re-serialized with the new layout (version, type, payload, checksum).
- Legacy “no checksum” option: No longer used. All new serialized files have a checksum.
-
Encrypt everything: In the legacy format, only certain paths (asset, display, thumb, metadata, README) were encrypted;
.db/was not. In the current format, all files are encrypted when the database is encrypted, including.db/files.dat,.db/bson/**,.db/config.json,.db/write.lock,.db/encryption.pub. -
Encrypted file header: For each encrypted file, prepend the clear header before the existing encrypted payload:
- 4 bytes: encrypted file format version
- 1 byte: encryption type code
- 32 bytes: SHA-256 hash of the public key used for encryption Then store the existing per-file encrypted payload (RSA-wrapped key + IV + ciphertext) as-is, so the payload continues to decrypt to the same plaintext (e.g. the versioned serialized blob or raw media).
-
Origin / config: If the database was replicated from another, create or update
.db/config.jsonwith anoriginfield set to the path (or URI) of the source database. If not replicated,.db/config.jsonmay be omitted or may exist without anoriginfield. Other configuration can be added to this JSON file as needed. -
Type codes: Maintain a central registry of type codes for all serialized file kinds (files Merkle tree, BSON db tree, collection shard, index tree, index leaf page, etc.) and use them consistently in both serialization and deserialization. Note:
.db/config.jsonis plain JSON, not versioned serialized, so it does not use type codes.
- Create
.db/bson/,.db/bson/collections/, and.db/bson/indexes/; move/copy all BSON data frommetadata/to.db/bson/(db.dat at bson root; collection dirs undercollections/, each with shards undershards/andcollection.datat collection root; index dirs underindexes/). Re-serialize each file to the new versioned format (version, type, payload, checksum) if not already. - Re-serialize
.db/files.datto the new versioned format (version, type, payload, checksum), and set the database version in that file to 6. - If encryption is enabled, re-wrap every file (including under
.db/) in the new encrypted format: clear header (version, encryption type, public key hash) + existing encrypted payload. Ensure.db/is no longer written in the clear. - Optionally write or update
.db/config.jsonwith anoriginfield if the DB has a known replication source. - Remove the legacy
metadata/directory and any legacy unencrypted.db/files that have been replaced.
The psi upgrade command is the only psi command that runs against databases older than version 6. It reads the database version from the version field in the files Merkle tree (in v5 that file is .db/tree.dat; in v6 it is .db/files.dat). When the version is 5 (or older), the command must perform the following to convert the database to version 6.
Version check (already present):
- Load the files Merkle tree from the path that exists (
.db/tree.datfor v5,.db/files.datfor v6) and determine the current version (e.g. vialoadTreeVersionor by loading the tree and readingmerkleTree.version). - If version is already 6, exit successfully without changes.
- If version is greater than 6, exit with an error asking the user to update the CLI.
- If version is 5 or less, proceed with upgrade.
Steps the upgrade command must perform for v5 → v6:
-
Acquire write lock on the database (e.g.
.db/write.lock) so no other process modifies it during upgrade. -
Legacy v5 cleanup (already implemented for older upgrades):
- Fill in missing
lastModifiedon tree leaves from file metadata where possible. - If an
assets/directory exists, move its contents toasset/and update the files Merkle tree accordingly. - Create
README.mdat the database root if it does not exist. - If the database is encrypted, ensure
.db/encryption.pubexists (e.g. copy the public key into.db/as a marker). - Rebuild the files Merkle tree in sorted order, excluding legacy paths such as
metadata/andassets/from the tree (so they are no longer referenced as file leaves).
- Fill in missing
-
Move BSON database from
metadata/to.db/bson/:- Create the
.db/bson/directory and the.db/bson/collections/subdirectory. - Move (or copy then delete)
metadata/db.dat→.db/bson/db.dat. - For the collection: create
.db/bson/collections/metadata/shards/; move shard files and<shardId>.datfrommetadata/metadata/intoshards/; movemetadata/metadata/collection.dat→.db/bson/collections/metadata/collection.dat. - Create
.db/bson/indexes/and movemetadata/sort_indexes/→.db/bson/indexes/. - Remove the now-empty
metadata/directory. All BSON data is now under.db/bson/(db.dat at root; collections under.db/bson/collections/with shards under<collectionName>/shards/; indexes under.db/bson/indexes/).
- Create the
-
Re-serialize all serialized files into the v6 format (version, type, checksum):
- For every file under
.db/bson/(db.dat at root; undercollections/<name>/shards/: shard files and<shardId>.dat; atcollections/<name>/: collection.dat; underindexes/<collectionName>/<fieldName>_<direction>/: index tree.dat and leaf page files): read with the legacy deserializer (no type field, optional checksum), then write using the new layout: 4 bytes version, 4 bytes type code, payload, 32 bytes SHA-256(version + type + payload). Use a stable type code per kind of file (see §10.4). - Re-serialize
.db/files.datin the new format (version, type, payload, checksum). Set the database version in the saved tree to 6 (so the first 4 bytes of the serialized tree file, or theversionproperty when written, are 6).
- For every file under
-
Encryption (when the database is encrypted):
- Switch to a single storage backend for the whole database root (no separate unencrypted metadataStorage). All reads and writes for the rest of the upgrade must go through the encrypted backend.
- For every file under the database root (asset, display, thumb, README, and the entire
.db/tree including.db/files.dat,.db/bson/**,.db/config.json,.db/write.lock,.db/encryption.pub): ensure it is stored in the v6 encrypted format. That is: clear header (4 bytes format version, 1 byte encryption type code, 32 bytes SHA-256 of public key) followed by the existing encrypted payload (RSA-wrapped key + IV + AES-256-CBC ciphertext). Files that were previously unencrypted (e.g. under.db/in v5) must be encrypted and given this header; files that were already encrypted need the header prepended and the payload left as-is. - After this, the database has no mixed encrypted/unencrypted layout: all files are encrypted when a key is in use.
-
Origin / config (optional): If the database has a known replication source (e.g. passed in or recorded elsewhere), create or update
.db/config.jsonwith anoriginfield set to the path (or URI) of that source. If not replicated, omit the file or leave it without anoriginfield. -
Rebuild BSON database Merkle tree: Rebuild the BSON database Merkle tree (e.g.
buildDatabaseMerkleTree) using the new BSON root.db/bson/, with collections under.db/bson/collections/and indexes under.db/bson/indexes/(notmetadata/at database root). Save the result to.db/bson/db.datin the v6 serialized format. If encryption is enabled, this write goes through the single encrypted storage backend. -
Update files Merkle tree metadata: Set
databaseMetadata.filesImportedfrom the actual count of files underasset/(or equivalent). Ensure the tree’sversionproperty is 6 before saving. -
Save the files Merkle tree: Write
.db/files.datin the v6 serialized format (version, type, payload, checksum), with version 6. If encryption is enabled, write through the encrypted backend so.db/files.datis stored in the v6 encrypted file format (clear header + encrypted payload). -
Release the write lock.
After a successful run, the database version in .db/files.dat is 6, all BSON data lives under .db/bson/ (db.dat, collections/, indexes/), all serialized files use version+type+checksum, and (when encryption is used) all files are encrypted with the v6 encrypted file header. Other psi commands can then operate on the database.
The following code changes allow the application to load (and, as needed, create and update) databases in the current format.
-
BSON root: Stop using a root-level
metadata/directory for the BSON database. Use.db/bson/as the BSON storage root. Collection data lives under.db/bson/collections/<collectionName>/(shard files and shard Merkle trees under<collectionName>/shards/, collection Merkle treecollection.datat collection root); index data lives under.db/bson/indexes/<collectionName>/<fieldName>_<direction>/. All call sites that open the BSON database, a collection, or an index should use the appropriate prefix (e.g..db/bson/for the DB,.db/bson/collections/metadatafor the metadata collection with shards atmetadata/shards/,.db/bson/indexes/metadata/hash_ascfor an index) instead ofmetadata/andmetadata/sort_indexes/. -
Single storage backend: When opening a database, pass a single storage instance that represents the database root. All paths (asset, display, thumb,
.db/bson/,.db/files.dat,.db/config.json, etc.) are resolved under that root. Encryption, when enabled, applies to this single backend so that every file read/write goes through the same encrypted layer.
-
Serialization layer: Extend the serialization library (or the layer that writes Merkle trees, BSON shards, index tree and leaf page files under
.db/bson/indexes/) so that every serialized file is written as:[version (4)][type (4)][payload][checksum (32)]. Add a type code parameter (or constant per call site) so each kind of file has a distinct type. On read, read version and type first; verify checksum after reading the payload; dispatch to the correct deserializer based on type (and version if needed). - Checksum: Always compute and verify the SHA-256 checksum for serialized files. Remove or bypass the “no checksum” code path for the current format.
-
Type registry: Introduce a central registry of type codes (e.g. enum or constants) for: files Merkle tree, BSON database tree, collection shard, index tree, index leaf page, etc. Use the same codes in writers and readers. Config (
.db/config.json) is JSON, not versioned serialized, so it does not use type codes.
-
Encrypt all files: Remove the two-storage setup (encrypted assetStorage + unencrypted metadataStorage). Use one storage instance for the whole database root. When encryption is enabled, wrap that single backend with the encrypted storage implementation so that all files (including
.db/files.dat,.db/bson/**,.db/config.json,.db/write.lock,.db/encryption.pub) are read and written through the encryption layer. - Encrypted file header: When writing an encrypted file, prepend the clear header: format version (4 bytes), encryption type code (1 byte), public key hash (32 bytes). When reading, read this header first to detect encryption and key; then decrypt the remainder of the file (existing RSA + AES-256-CBC per-file scheme). Update key selection logic to use the public key hash from the header (e.g. to choose the right key when multiple keys exist or to prompt for the correct key).
-
Read/write config: Add support for reading and writing
.db/config.json(JSON). The file has at least an optionaloriginfield (path or URI to the database this copy was replicated from). Expose the origin to sync, repair, and fulfil-missing-file logic. Other fields may be added for future use. -
Use origin: In sync, repair, and lazy-fulfil flows, use the
originvalue from.db/config.jsonas the default remote or source when the user has not specified another. This allows partial replicas to fetch missing asset/display files from the database they were replicated from.
-
Detect version: On open, read the version field from
.db/files.datto determine the database version. Version 5 is the legacy format (BSON undermetadata/at database root, optional checksum, mixed encryption). Version 6 is the current format: BSON under.db/bson/with collections under.db/bson/collections/(shards under<collectionName>/shards/,collection.datat collection root) and indexes under.db/bson/indexes/; files Merkle tree at.db/files.dat; version+type+checksum on all serialized files; all files encrypted when encryption is enabled, with encrypted file header;.db/config.jsonwith optionaloriginfield. -
psi upgrade: Mostpsicommands only work with version 6. The commandpsi upgradeis the exception: it runs against databases of older versions and migrates them to version 6 (applying the changes described in §10). After upgrade, the database version in.db/files.datis set to 6 and other commands can operate on it.