# Photosphere Database Format (Legacy) This document describes the **legacy** on-disk layout and binary formats used by the Photosphere media database. This is **database version 5**. The database version is stored in the **version field** of `.db/tree.dat`. For the current format (version 6), see [Database-Format.md](Database-Format.md). To migrate a version 5 database to version 6, use **`psi upgrade`**. **Important:** Do not modify database files manually. Use the Photosphere CLI (`psi`) for all operations. --- ## 1. Top-level directory layout A database is a single root directory. All paths below are relative to that root. | Path | Description | |------|-------------| | `README.md` | Auto-generated warning and usage instructions. | | `.db/` | Database-level control and integrity data. | | `.db/tree.dat` | **Files Merkle tree**: hashes of asset/display/thumb files and metadata; used for sync and verify. | | `asset/` | Original imported media files (one file per asset, keyed by asset UUID, no extension). | | `display/` | Display-sized derivatives (e.g. max 1000px, JPEG). One file per asset, keyed by UUID. | | `thumb/` | Thumbnail derivatives (e.g. max 300px, JPEG). One file per asset, keyed by UUID. | | `metadata/` | **BSON database root**: all structured metadata and indexes live under this prefix. | --- ## 2. BSON database under `metadata/` Structured metadata is stored in a BSON-based layout with sharded collections and sort indexes; its root is `metadata/`. ### 2.1 Database Merkle tree | Path | Description | |------|-------------| | `metadata/db.dat` | **Database Merkle tree**: one root hash per collection; used for replication and integrity. | Format: versioned Merkle tree serialization (current version 5), stored without a trailing checksum. ### 2.2 Collections Each collection is a directory under `metadata/` (e.g. `metadata/metadata` for the asset metadata collection). The Photosphere app uses a single collection named `metadata` whose records are asset documents. #### Collection directory contents - **Shard files**: one file per shard, named by shard ID (e.g. `0`, `1`, …, `96`). No extension. Shard ID is `md5(recordId)[0:8] % numShards` (default 100 shards). - **Shard Merkle trees**: next to each shard file: `.dat` (e.g. `96.dat`). Used to build the collection Merkle tree. - **Collection Merkle tree**: `metadata//collection.dat` (e.g. `metadata/metadata/collection.dat`). Aggregates shard root hashes. ### 2.3 Shard file format (collection shards) Shard files are versioned binary blobs with an optional SHA-256 checksum. **Generic serialized file layout (when checksum is enabled):** - `[4 bytes]`, Version (uint32 LE). - `[payload]`, Version-specific payload. - `[32 bytes]`, SHA-256 checksum of `version + payload`. **Shard payload (version 2; version 1 is legacy, fields-only):** - `[4 bytes]`, Record count (uint32 LE). - For each record (sorted by `_id`): - `[16 bytes]`, Record ID as raw UUID bytes (no dashes, 16 bytes hex decoded). - `[BSON]`, Record fields (BSON document; `_id` is stored separately). - `[BSON]`, Metadata (version 2 only): `{ timestamp?, fields? }` for field-level timestamps. Record IDs are normalized to 16-byte hex (UUID without dashes) for shard keying; on read they are formatted back to standard UUID string. ### 2.4 Sort indexes Sort indexes live under `metadata/sort_indexes//_/` (e.g. `metadata/sort_indexes/metadata/hash_asc/`, `metadata/sort_indexes/metadata/photoDate_desc/`). Each index directory contains: - **`tree.dat`**: B-tree metadata and node descriptors (version 2). Same versioned + checksummed wrapper as above. - **``**: Leaf page files; page IDs are UUIDs. Each file is a serialized page of index entries (version 1, with checksum). - **`build.checkpoint`**: Optional JSON checkpoint for incremental index builds. **`tree.dat` payload (version 2):** - `totalEntries` (uint32), `totalPages` (uint32). - `rootPageId` (buffer/length-prefixed string). - `fieldName`, `direction` (buffer/length-prefixed strings). - `type` (uint8): 0 = none, 1 = date, 2 = string, 3 = number. - Reserved 8 bytes (uint64). - `nodeCount` (uint32). - For each node (by sorted pageId): - `pageId` (length-prefixed buffer). - Node: legacy 4-byte skip, BSON `{ keys }`, `children.length` (uint32), child IDs (length-prefixed strings), `nextLeaf`, `previousLeaf` (length-prefixed strings). **Leaf page file payload (version 1):** - Record count (uint32 LE). - For each entry: - Record ID (length-prefixed buffer, UTF-8). - Value (BSON `{ value }`). - Record fields (BSON document). --- ## 3. Files Merkle tree (`.db/tree.dat`) A separate Merkle tree over asset/display/thumb paths and related metadata is stored at **`.db/tree.dat`** (relative to the database root). - **Path:** `.db/tree.dat`. - **Content:** Sort tree + Merkle tree + optional database metadata (e.g. `filesImported`, `deletedAssetIds`, `isPartial`). - **Serialization:** Same as other Merkle tree files (version 5); stored without a trailing checksum. - **Leaf names:** Paths like `asset/`, `display/`, `thumb/`; leaves store content hash and metadata for verify/sync. --- ## 4. Partial vs full databases A database can be **full** or **partial**. The layout and file formats are the same; the difference is which files are present on disk. **Full database:** All asset files are stored: `asset/`, `display/`, and `thumb/` each have one file per asset. The BSON metadata collection and `.db/` are complete. This is the normal case after import or after a full replicate. **Partial database:** Only **thumb** files and root-level files (e.g. `README.md`) are stored. The `asset/` and `display/` directories are missing or sparse, original and display-sized media are not on disk. The BSON metadata under `metadata/` is still complete (all asset records and indexes are present), so the catalog is intact; only the full-size and display-size binaries are omitted. Partial databases are created by replicating with the partial option (e.g. “only copy thumb directory assets”). The **partial flag** is stored in the files Merkle tree: in `.db/tree.dat`, the database metadata has `isPartial: true` when the database is partial. Tools use this to: - **Verify:** Treat missing `asset/` and `display/` files as expected, not as removed or corrupt. - **Sync:** When syncing *to* a partial database, only copy thumb and root-level files; do not copy asset or display files into the partial target. So a partial database has the same directory structure and metadata as a full one, but only thumbnails (and optionally README) on disk. Missing files can be filled in lazily as required, for example, when a user browses the photo gallery, missing asset or display files can be downloaded from a remote database as they are viewed, or in bulk via a full replicate. --- ## 5. Versioned file layout Versioned binary files use one of two layouts: - **With checksum:** `[4 bytes version][payload][32 bytes SHA-256(version+payload)]`. - **Without checksum:** `[4 bytes version][payload]` (used for Merkle tree files). Primitives are little-endian (uint32, int32, uint64, int64); strings and buffers are length-prefixed; documents use BSON (lengths 32-bit where length-prefixed). --- ## 6. Optional encryption The database can use an **encrypted storage** backend for media and BSON data. When encryption is enabled, only certain paths are encrypted; the rest stay plain. **Encrypted** (when a key is provided): everything under `asset/`, `display/`, `thumb/`, and `metadata/`, plus `README.md`. These are read and written through a storage backend that applies encryption. **Unencrypted:** the `.db/` directory (e.g. `tree.dat`, `write.lock`, `encryption.pub`). This is always stored in the clear so the application can detect that the database is encrypted and prompt for a key without needing the key to read the directory. Each encrypted file is stored as: - **\[512 bytes\]**: RSA-encrypted AES-256 key (decrypt with the private key to obtain the per-file symmetric key). - **\[16 bytes\]**: AES initialization vector (IV). - **\[remaining bytes\]**: Payload encrypted with AES-256-CBC using the decrypted key and IV. The decrypted payload is the same as the unencrypted file (e.g. a versioned serialized blob or raw media). --- ## 7. Asset record shape The metadata collection stores asset records. Main fields: - `_id`, UUID string. - `origFileName`, `origPath?`, `contentType`, `width`, `height`, `hash`. - `coordinates?`, `location?`, `duration?`, `fileDate`, `photoDate?`, `uploadDate`. - `properties?`, `labels?`, `description?`, `deleted?`. - `micro`, base64 micro thumbnail. - `color`, `[number, number, number]` (e.g. dominant color). Sort indexes used in practice: `hash` (asc), `photoDate` (desc). --- ## 8. Summary diagram ``` / README.md .db/ tree.dat # Files Merkle tree asset/ # Original media (no extension) display/ thumb/ metadata/ # BDB root db.dat # Database Merkle tree metadata/ # "metadata" collection # Shard data (e.g. 96) .dat # Shard Merkle tree (e.g. 96.dat) collection.dat # Collection Merkle tree sort_indexes/ metadata/ hash_asc/ tree.dat # UUID-named leaf pages photoDate_desc/ tree.dat ```