Database Format

Photosphere Database Format

This document describes the on-disk layout and binary formats used by the Photosphere media database (current format).

Database version: The database version is determined by the version field in .db/files.dat. The current format is version 6. The legacy format is version 5 (see Database-Format-Legacy.md). Most psi commands only work with version 6. The command psi upgrade migrates a database from older versions (including legacy version 5) to version 6.

Important: Do not modify database files manually. Use the Photosphere CLI (psi) for all operations.

1. Top-level directory layout

A database is a single root directory. All paths below are relative to that root.

Path	Description
`README.md`	Auto-generated warning and usage instructions.
`.db/`	Database integrity data and media files metadata; see below.
`asset/`	Original imported media files (one file per asset, keyed by asset UUID, no extension).
`display/`	Display-sized derivatives (e.g. max 1000px, JPEG). One file per asset, keyed by UUID.
`thumb/`	Thumbnail derivatives (e.g. max 300px, JPEG). One file per asset, keyed by UUID.

The directories asset, display and thumb) contain media files (photos and videos). The .db/ directory contains data to verify the database and the metadata for media files.

When the database is encrypted, all files under the database root are stored in the encrypted file format (see §5).

Example database structure

<database root>/
  README.md
  .db/
    files.dat             # Files Merkle tree (versioned + type + checksum, then encrypted)
    config.json           # Config (origin, etc.) (optional)
    write.lock
    encryption.pub        # Optional encryption marker - the public key if the db is encrypted.
    bson/                 # BSON database root
      db.dat              # Database Merkle tree
      collections/
        metadata/         # "metadata" collection
          shards/
            <shardId>     # Shard data
            <shardId>.dat # Shard Merkle tree
          collection.dat  # Collection Merkle tree
      indexes/
        metadata/
          hash_asc/
            tree.dat
            <pageId>      # UUID-named leaf pages
          photoDate_desc/
            tree.dat
            <pageId>
  asset/
    <uuid>                # Original media (encrypted)
  display/
    <uuid>                # Display media.
  thumb/
    <uuid>                # Thumbnail media.

2. The `.db/` directory

The .db/ directory contains all control and structured data: the BSON database, files Merkle tree, config (including origin), and lock/marker files. It is used to validate the integrity of the database and stores metadata about media files.

Directory content

Path	Description
`.db/bson/`	BSON database root: all structured metadata and indexes (see §3).
`.db/files.dat`	Files Merkle tree: For each file under `asset/`, `display/`, and `thumb/` only (the tree does not include an entry for itself or other `.db/` files), stores hash, length, and lastModified of the logical (plain/decrypted) content. Used to verify integrity and compare databases; plain and encrypted databases with the same content compare equal.
`.db/config.json`	Configuration file: JSON object with an `origin` field (path to the database this copy was replicated from) and room for other settings (see §6). Used for sync, repair, and fulfilling missing files.
`.db/write.lock`	Write lock (when held).
`.db/encryption.pub`	Optional marker: copy of the public key used for encryption (enables “this DB is encrypted” detection).

Serialized files under .db/ use the versioned serialized format (version, type, payload, checksum) before encryption (see §4).

3. BSON database under `.db/bson/`

Structured metadata is stored in a BSON-based layout with sharded collections and sort indexes; its root is .db/bson/.

Directory content

Path	Description
`db.dat`	Database Merkle tree. Used to verify the integrity of the database and compare databases for differences.
`collections/`	One subdirectory per collection; each contains a `shards/` subdirectory (shard files and shard Merkle trees `<shardId>.dat`) and `collection.dat` at the collection root (see §3.2).
`indexes/`	One subdirectory per sort index, named `<collectionName>/<fieldName>_<direction>/`; each contains B-tree metadata and leaf page files (see §3.4).

3.2 Collection

A collection contains records that share the same schema or purpose, so different kinds of data can be stored and queried separately. Each collection is a directory under .db/bson/collections/ (e.g. .db/bson/collections/metadata for the asset metadata collection). The Photosphere app uses a single collection named metadata where each record describes a media file (photo or video).

Directory content

shards/: Subdirectory containing all shard data for the collection:
- Shard files: one file per shard, named by shard ID (e.g. 0, 1, …, 96). No extension. Shard ID is md5(recordId)[0:8] % numShards (default 100 shards). Path: .db/bson/collections/<collectionName>/shards/<shardId>.
- Shard Merkle trees: next to each shard file: <shardId>.dat (e.g. 96.dat). Used to build the collection Merkle tree. Path: .db/bson/collections/<collectionName>/shards/<shardId>.dat.
Collection Merkle tree: .db/bson/collections/<collectionName>/collection.dat (e.g. .db/bson/collections/metadata/collection.dat), at the collection root. Aggregates shard root hashes.

3.3 Shard

A shard is a file that holds many collection records; records are distributed across shards by shard ID (shard ID formula and collection layout: see §3.2). The reason for shards is that storing database records as individual files is very expensive as each file would be 4k minimum (at least on Linux). So records are collected in shards simply so many records can be efficiently packed into a smaller amount of space than if they were stored indivdiually (say as one json file per record).

Shard file format

Shard files use the versioned serialized format (see §4): version, type, payload, then SHA-256 checksum. The payload (e.g. version 2) is:

[4 bytes], Record count (uint32 LE).
For each record (sorted by _id):
- [16 bytes], Record ID as raw UUID bytes (no dashes, 16 bytes hex decoded).
- [BSON], Record fields (BSON document; _id is stored separately).
- [BSON], Metadata: { timestamp?, fields? } for field-level timestamps.

Record IDs are normalized to 16-byte hex (UUID without dashes) for shard keying; on read they are formatted back to standard UUID string.

3.4 Sort index

Sort indexes exist so that ordered and range queries (e.g. “list by date”, “find by hash”) can be answered without scanning the whole collection: the index keeps records ordered by the indexed field, and the B-tree supports efficient lookup and pagination. Photosphere uses two indexes on the metadata collection: hash (asc) and photoDate (desc). The hash index is needed to look up an asset by content hash (e.g. for deduplication, verify, or finding an existing record before import). The photoDate index is needed to list or browse assets by capture date (e.g. timeline view, newest first).

Sort indexes live under .db/bson/indexes/<collectionName>/<fieldName>_<direction>/ (e.g. .db/bson/indexes/metadata/hash_asc/, .db/bson/indexes/metadata/photoDate_desc/). The direction (asc or desc) and type (date, string, number) determine how values are compared (dates as timestamps, strings lexicographically, numbers numerically). The B-tree’s keys are the indexed values; leaf pages hold index entries (record ID, value, and a copy of the record’s fields).

Directory content

tree.dat: B-tree metadata and node descriptors. Versioned serialized format with type and checksum.
<pageId>: Leaf page files; page IDs are UUIDs. Each file is a versioned serialized page of index entries.
build.checkpoint: Optional JSON checkpoint for incremental index builds (also stored in encrypted form if the DB is encrypted).

tree.dat payload (version 2):

totalEntries (uint32), totalPages (uint32).
rootPageId (buffer/length-prefixed string).
fieldName, direction (buffer/length-prefixed strings).
type (uint8): 0 = none, 1 = date, 2 = string, 3 = number.
Reserved 8 bytes (uint64).
nodeCount (uint32).
For each node (by sorted pageId): pageId, node (keys, children, nextLeaf, previousLeaf), etc.

Leaf page file payload (version 1):

Record count (uint32 LE).
For each entry: record ID (length-prefixed buffer, UTF-8), value (BSON { value }), record fields (BSON document).

4. Versioned serialized file layout

Every serialized file (Merkle trees, BSON shards, sort index trees and pages, etc.) uses a single layout so that readers can verify and dispatch by type.

Format

[4 bytes], Version (uint32 LE).
[4 bytes], Type code (4-character ASCII, 32 bits). Identifies the kind of file. Each file kind has a distinct 4-byte ASCII code (e.g. FTRE = files Merkle tree, BDBT = BSON database tree, SHAR = collection shard, COLT = collection and per-shard Merkle tree, IDXT = index B-tree metadata, IDXP = index leaf page). Stored in the same byte order as the rest of the file (e.g. little-endian as a uint32). Writers and readers use the same code for each kind. Readers use the type code to route to the correct deserializer or reject unknown types.
[payload], Version- and type-specific payload.
[32 bytes], SHA-256 checksum of the concatenation: version + type + payload.

Primitives are little-endian (uint32, int32, uint64, int64); strings and buffers are length-prefixed; documents use BSON (lengths 32-bit where length-prefixed).

5. Encryption

Almost all files under the database root are encrypted when the database is encrypted. The following files are always stored in plain text (never encrypted):

README.md
.db/config.json
.db/encryption.pub

All other files, asset, display, thumb, .db/files.dat, .db/bson/*, .db/write.lock, are stored in the encrypted file format. There is no mixed encrypted/unencrypted layout for these files.

Encrypted file format:

Each encrypted file uses one of two formats:

New format: A fixed header (unencrypted), then the encrypted payload. The header lets the app identify the format and which key was used without decrypting.
Legacy format: No header; the file starts directly with the encrypted payload. Readers treat such files as encrypted with the default key (see Encryption).

New-format header (44 bytes):

[4 bytes], Tag (e.g. PSEN), 4-character ASCII. If the first 4 bytes are not this tag, the file is treated as legacy format.
[4 bytes], Format version (uint32 LE).
[4 bytes], Encryption type (4-character ASCII, e.g. A2CB).
[32 bytes], Key hash (SHA-256 of the public key used to encrypt this file, for key lookup).

Encrypted payload:

For new format, the payload immediately follows the 44-byte header. For legacy format, the payload starts at byte 0.
Payload layout is the same in both cases: RSA-wrapped AES key (512 bytes), IV (16 bytes), then AES-256-CBC ciphertext. The plaintext that is encrypted is the full serialized content (e.g. the versioned serialized blob with version, type, payload, and checksum, or a raw media blob).

Raw storage: When the application needs to inspect the on-disk bytes of a file before decryption, for example, to read the encryption header to determine which key was used, it uses a raw storage instance. Raw storage is a storage object created without an encryption layer: it returns the exact bytes stored on disk, including any encryption header. This is in contrast to the normal (encrypted) storage instance, which transparently decrypts file content on read. Raw storage is used exclusively for header inspection and must never be used to read or interpret file payloads that require decryption.

6. Config (`.db/config.json`)

A database stores optional configuration in .db/config.json, a JSON file under the .db directory. This file is created when the database is initialized (psi init) or when it is upgraded (psi upgrade); if it does not exist after upgrade, an empty object {} is written.

6.1 Fields

Field	Type	Description
`origin`	string (optional)	Path or URI of the database this copy was replicated from. Set on the replica when you run `psi replicate` (the replica’s `origin` is the source path). Can be set manually with `psi set-origin <path>`.
`lastReplicatedAt`	string (optional)	ISO 8601 date-time when this database was last replicated (i.e. when it was written as a replica from a source). Updated on the replica after each `psi replicate`.
`lastSyncedAt`	string (optional)	ISO 8601 date-time when this database was last synchronized with another. Updated on both sides after each `psi sync`.
`lastModifiedAt`	string (optional)	ISO 8601 date-time when this database was last modified locally (e.g. adding an asset, removing an asset, or editing metadata). Updated by `psi add`, `psi remove`, and by API writes.

6.2 Example

{
  "origin": "/path/to/source/database",
  "lastReplicatedAt": "2026-02-01T12:00:00.000Z",
  "lastSyncedAt": "2026-02-01T14:30:00.000Z",
  "lastModifiedAt": "2026-02-01T10:15:00.000Z"
}

6.3 Use of `origin`

The origin value is used as the default for:

psi sync: If --dest is omitted, the other database is taken from origin.
psi replicate: If --dest is omitted, the destination is taken from origin (e.g. replicate from a copy back to its source).
psi repair: If --source is omitted, the repair source is taken from origin.
psi compare: If --dest is omitted, the second database is taken from origin.

Commands psi origin and psi set-origin <path> display and set the origin field.

If the database was not created by replication and no origin has been set, origin may be absent; in that case, sync/replicate/repair/compare require an explicit --dest or --source.

7. Partial (lazy) databases

A database can be full or partial. The layout and file formats are identical; the difference is which files are physically present on disk.

7.1 Full vs partial

Full database: All asset files are stored, asset/, display/, and thumb/ each have one file per asset. The BSON database under .db/bson/ and all of .db/ are complete. This is the normal state after import or after a full psi replicate.

Partial database: Created with psi replicate --partial. The following files are copied from the source:

File(s)	Purpose
`README.md`	Auto-generated usage instructions.
`.db/files.dat`	Complete files Merkle tree, lists all asset, display, and thumb paths (even those not yet fetched).
`.db/config.json`	Config including `origin` (path to the source database).
`.db/bson/*.dat`	All collection and sort-index Merkle trees (`db.dat`, `collection.dat`, shard `.dat` files, index `tree.dat` files).

The following files are not copied during a partial replication:

File(s)	Why omitted
`asset/<uuid>`	Original media files, large, not needed for browsing.
`display/<uuid>`	Display-sized derivatives, fetched on demand when a photo is opened.
`thumb/<uuid>`	Thumbnails, not copied in partial mode; fetched on demand for the gallery grid.
`.db/bson/collections/<name>/shards/<id>` (no extension)	Shard data files containing asset records, fetched on demand when the gallery loads.
Sort index leaf pages (UUID-named files under `indexes/`)	Leaf pages, fetched on demand as the gallery paginates.

The isPartial: true flag is written into the database metadata embedded in .db/files.dat. This flag tells tools that missing files are expected (e.g. psi verify does not report them as errors).

7.2 Lazy-pull from origin

When the backend serves a partial database to the GUI, missing files are fetched transparently from the origin database on first access and cached locally for subsequent requests. No special action is needed from the user or the GUI.

The origin path is read from .db/config.json (origin field). If origin is absent or the database is not partial, the backend reads files locally as normal.

What is fetched lazily and when

Resource	Trigger
Sort index leaf pages	When the gallery loads and paginates through assets.
Shard data files	When the gallery reads asset metadata records.
`thumb/<uuid>`	When a thumbnail is displayed in the gallery grid.
`display/<uuid>`	When a photo is opened at full size.
`asset/<uuid>`	When the original file is downloaded or exported.

Cache behaviour

First access: The backend fetches the file from origin and writes it to the local database directory. Subsequent accesses are served from local storage with no round-trip to origin.
Large files (e.g. 7 GB videos): readStream uses a tee-stream so origin data is forwarded directly to the HTTP response and simultaneously written to the local cache, the file is never buffered in memory.
Cache write failures: If writing to the local cache fails (e.g. disk full), the data is still streamed to the caller from origin. The file will be fetched from origin again on the next request.
Concurrent requests: If two requests arrive for the same missing file simultaneously, both may fetch from origin independently. The writes are idempotent (identical data), so no corruption occurs.

Diagram

GUI request
    │
    ▼
Backend (LazyOriginStorage)
    │
    ├── file in local cache? ──yes──► stream from local
    │
    └── no ──► fetch from origin ──► tee ──► stream to caller
                                         └──► write to local cache

7.3 Creating a partial database

psi replicate --db <source-path> --dest <local-path> --partial

The destination will have isPartial: true in .db/files.dat and an origin entry in .db/config.json pointing to the source. This is all the backend needs to perform lazy-pulls automatically.

To convert a partial database into a full one, run a normal psi replicate from the origin to the partial database (without --partial). This fills in all missing files in bulk.

7.4 Verify behaviour

psi verify reads the isPartial flag and skips integrity checks for files that are absent from a partial database. A missing shard, thumb, display, or asset file in a partial database is reported as unmodified (expected), not as an error.

8. Asset record shape

The metadata collection stores asset records. Main fields:

_id, UUID string.
origFileName, origPath?, contentType, width, height, hash.
coordinates?, location?, duration?, fileDate, photoDate?, uploadDate.
properties?, labels?, description?, deleted?.
micro, base64 micro thumbnail.
color, [number, number, number] (e.g. dominant color).

Sort indexes used in practice: hash (asc), photoDate (desc).

Database Format

Photosphere Database Format

1. Top-level directory layout

Example database structure

2. The .db/ directory

Directory content

3. BSON database under .db/bson/

Directory content

3.2 Collection

Directory content

3.3 Shard

Shard file format

3.4 Sort index

Directory content

4. Versioned serialized file layout

Format

5. Encryption

6. Config (.db/config.json)

6.1 Fields

6.2 Example

6.3 Use of origin

7. Partial (lazy) databases

7.1 Full vs partial

7.2 Lazy-pull from origin

What is fetched lazily and when

Cache behaviour

Diagram

7.3 Creating a partial database

7.4 Verify behaviour

8. Asset record shape

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

2. The `.db/` directory

3. BSON database under `.db/bson/`

6. Config (`.db/config.json`)

6.3 Use of `origin`