Skip to content
Ashley Davis edited this page Mar 14, 2026 · 11 revisions

Photosphere Database Format

This document describes the on-disk layout and binary formats used by the Photosphere media database (current format).

Database version: The database version is determined by the version field in .db/files.dat. The current format is version 6. The legacy format is version 5 (see Database-Format-Legacy.md). Most psi commands only work with version 6. The command psi upgrade migrates a database from older versions (including legacy version 5) to version 6.

Important: Do not modify database files manually. Use the Photosphere CLI (psi) for all operations.

1. Top-level directory layout

A database is a single root directory. All paths below are relative to that root.

Path Description
README.md Auto-generated warning and usage instructions.
.db/ Database integrity data and media files metadata; see below.
asset/ Original imported media files (one file per asset, keyed by asset UUID, no extension).
display/ Display-sized derivatives (e.g. max 1000px, JPEG). One file per asset, keyed by UUID.
thumb/ Thumbnail derivatives (e.g. max 300px, JPEG). One file per asset, keyed by UUID.

The directories asset, display and thumb) contain media files (photos and videos). The .db/ directory contains data to verify the database and the metadata for media files.

When the database is encrypted — all files under the database root are stored in the encrypted file format (see §5).

Example database structure

<database root>/
  README.md
  .db/
    files.dat             # Files Merkle tree (versioned + type + checksum, then encrypted)
    config.json           # Config (origin, etc.) (optional)
    write.lock
    encryption.pub        # Optional encryption marker - the public key if the db is encrypted.
    bson/                 # BSON database root
      db.dat              # Database Merkle tree
      collections/
        metadata/         # "metadata" collection
          shards/
            <shardId>     # Shard data
            <shardId>.dat # Shard Merkle tree
          collection.dat  # Collection Merkle tree
      indexes/
        metadata/
          hash_asc/
            tree.dat
            <pageId>      # UUID-named leaf pages
          photoDate_desc/
            tree.dat
            <pageId>
  asset/
    <uuid>                # Original media (encrypted)
  display/
    <uuid>                # Display media.
  thumb/
    <uuid>                # Thumbnail media.

2. The .db/ directory

The .db/ directory contains all control and structured data: the BSON database, files Merkle tree, config (including origin), and lock/marker files. It is used to validate the integrity of the database and stores metadata about media files.

Directory content

Path Description
.db/bson/ BSON database root — all structured metadata and indexes (see §3).
.db/files.dat Files Merkle tree — For each file under asset/, display/, and thumb/ only (the tree does not include an entry for itself or other .db/ files), stores hash, length, and lastModified of the logical (plain/decrypted) content. Used to verify integrity and compare databases; plain and encrypted databases with the same content compare equal.
.db/config.json Configuration file — JSON object with an origin field (path to the database this copy was replicated from) and room for other settings (see §6). Used for sync, repair, and fulfilling missing files.
.db/write.lock Write lock (when held).
.db/encryption.pub Optional marker: copy of the public key used for encryption (enables “this DB is encrypted” detection).

Serialized files under .db/ use the versioned serialized format (version, type, payload, checksum) before encryption (see §4).

3. BSON database under .db/bson/

Structured metadata is stored in a BSON-based layout with sharded collections and sort indexes; its root is .db/bson/.

Directory content

Path Description
db.dat Database Merkle tree. Used to verify the integrity of the database and compare databases for differences.
collections/ One subdirectory per collection; each contains a shards/ subdirectory (shard files and shard Merkle trees <shardId>.dat) and collection.dat at the collection root (see §3.2).
indexes/ One subdirectory per sort index, named <collectionName>/<fieldName>_<direction>/; each contains B-tree metadata and leaf page files (see §3.4).

3.2 Collection

A collection contains records that share the same schema or purpose, so different kinds of data can be stored and queried separately. Each collection is a directory under .db/bson/collections/ (e.g. .db/bson/collections/metadata for the asset metadata collection). The Photosphere app uses a single collection named metadata where each record describes a media file (photo or video).

Directory content

  • shards/ — Subdirectory containing all shard data for the collection:
    • Shard files — one file per shard, named by shard ID (e.g. 0, 1, …, 96). No extension. Shard ID is md5(recordId)[0:8] % numShards (default 100 shards). Path: .db/bson/collections/<collectionName>/shards/<shardId>.
    • Shard Merkle trees — next to each shard file: <shardId>.dat (e.g. 96.dat). Used to build the collection Merkle tree. Path: .db/bson/collections/<collectionName>/shards/<shardId>.dat.
  • Collection Merkle tree.db/bson/collections/<collectionName>/collection.dat (e.g. .db/bson/collections/metadata/collection.dat), at the collection root. Aggregates shard root hashes.

3.3 Shard

A shard is a file that holds many collection records; records are distributed across shards by shard ID (shard ID formula and collection layout: see §3.2). The reason for shards is that storing database records as individual files is very expensive as each file would be 4k minimum (at least on Linux). So records are collected in shards simply so many records can be efficiently packed into a smaller amount of space than if they were stored indivdiually (say as one json file per record).

Shard file format

Shard files use the versioned serialized format (see §4): version, type, payload, then SHA-256 checksum. The payload (e.g. version 2) is:

  • [4 bytes] — Record count (uint32 LE).
  • For each record (sorted by _id):
    • [16 bytes] — Record ID as raw UUID bytes (no dashes, 16 bytes hex decoded).
    • [BSON] — Record fields (BSON document; _id is stored separately).
    • [BSON] — Metadata: { timestamp?, fields? } for field-level timestamps.

Record IDs are normalized to 16-byte hex (UUID without dashes) for shard keying; on read they are formatted back to standard UUID string.

3.4 Sort index

Sort indexes exist so that ordered and range queries (e.g. “list by date”, “find by hash”) can be answered without scanning the whole collection: the index keeps records ordered by the indexed field, and the B-tree supports efficient lookup and pagination. Photosphere uses two indexes on the metadata collection: hash (asc) and photoDate (desc). The hash index is needed to look up an asset by content hash (e.g. for deduplication, verify, or finding an existing record before import). The photoDate index is needed to list or browse assets by capture date (e.g. timeline view, newest first).

Sort indexes live under .db/bson/indexes/<collectionName>/<fieldName>_<direction>/ (e.g. .db/bson/indexes/metadata/hash_asc/, .db/bson/indexes/metadata/photoDate_desc/). The direction (asc or desc) and type (date, string, number) determine how values are compared (dates as timestamps, strings lexicographically, numbers numerically). The B-tree’s keys are the indexed values; leaf pages hold index entries (record ID, value, and a copy of the record’s fields).

Directory content

  • tree.dat — B-tree metadata and node descriptors. Versioned serialized format with type and checksum.
  • <pageId> — Leaf page files; page IDs are UUIDs. Each file is a versioned serialized page of index entries.
  • build.checkpoint — Optional JSON checkpoint for incremental index builds (also stored in encrypted form if the DB is encrypted).

tree.dat payload (version 2):

  • totalEntries (uint32), totalPages (uint32).
  • rootPageId (buffer/length-prefixed string).
  • fieldName, direction (buffer/length-prefixed strings).
  • type (uint8): 0 = none, 1 = date, 2 = string, 3 = number.
  • Reserved 8 bytes (uint64).
  • nodeCount (uint32).
  • For each node (by sorted pageId): pageId, node (keys, children, nextLeaf, previousLeaf), etc.

Leaf page file payload (version 1):

  • Record count (uint32 LE).
  • For each entry: record ID (length-prefixed buffer, UTF-8), value (BSON { value }), record fields (BSON document).

4. Versioned serialized file layout

Every serialized file (Merkle trees, BSON shards, sort index trees and pages, etc.) uses a single layout so that readers can verify and dispatch by type.

Format

  • [4 bytes] — Version (uint32 LE).
  • [4 bytes]Type code (4-character ASCII, 32 bits). Identifies the kind of file. Each file kind has a distinct 4-byte ASCII code (e.g. FTRE = files Merkle tree, BDBT = BSON database tree, SHAR = collection shard, COLT = collection Merkle tree, IDXT = index B-tree metadata, IDXP = index leaf page). Stored in the same byte order as the rest of the file (e.g. little-endian as a uint32). Writers and readers use the same code for each kind. Readers use the type code to route to the correct deserializer or reject unknown types.
  • [payload] — Version- and type-specific payload.
  • [32 bytes] — SHA-256 checksum of the concatenation: version + type + payload.

Primitives are little-endian (uint32, int32, uint64, int64); strings and buffers are length-prefixed; documents use BSON (lengths 32-bit where length-prefixed).

5. Encryption

All files under the database root are encrypted when the database is encrypted. There is no mixed encrypted/unencrypted layout: asset, display, thumb, and the entire .db/ tree (including .db/files.dat, .db/bson/*, .db/config.json, .db/write.lock, .db/encryption.pub) are stored in the encrypted file format.

Encrypted file format:

Each encrypted file uses one of two formats:

  • Current format: A fixed header (unencrypted), then the encrypted payload. The header lets the app identify the format and which key was used without decrypting.
  • Legacy format: No header; the file starts directly with the encrypted payload. Readers treat such files as encrypted with the default key (see Encryption).

Current-format header (44 bytes):

  • [4 bytes]Tag (e.g. PSEN), 4-character ASCII. If the first 4 bytes are not this tag, the file is treated as legacy format.
  • [4 bytes]Format version (uint32 LE).
  • [4 bytes]Encryption type (4-character ASCII, e.g. A2CB).
  • [32 bytes]Key hash (SHA-256 of the public key used to encrypt this file, for key lookup).

Encrypted payload:

  • For current format, the payload immediately follows the 44-byte header. For legacy format, the payload starts at byte 0.
  • Payload layout is the same in both cases: RSA-wrapped AES key (512 bytes), IV (16 bytes), then AES-256-CBC ciphertext. The plaintext that is encrypted is the full serialized content (e.g. the versioned serialized blob with version, type, payload, and checksum, or a raw media blob).

Reading: (1) If file length < 4 bytes → error. (2) If first 4 bytes ≠ tag → legacy format: decrypt payload with the default key. (3) If first 4 bytes = tag: if length < 44 → error; else read version, type, and 32-byte key hash; look up the private key by hash in the key map; decrypt the payload. (4) If the payload is a serialized file, verify checksum and dispatch on type.

6. Config (.db/config.json)

A database stores optional configuration in .db/config.json, a JSON file under the .db directory. This file is created when the database is initialized (psi init) or when it is upgraded (psi upgrade); if it does not exist after upgrade, an empty object {} is written.

6.1 Fields

Field Type Description
origin string (optional) Path or URI of the database this copy was replicated from. Set on the replica when you run psi replicate (the replica’s origin is the source path). Can be set manually with psi set-origin <path>.
lastReplicatedAt string (optional) ISO 8601 date-time when this database was last replicated (i.e. when it was written as a replica from a source). Updated on the replica after each psi replicate.
lastSyncedAt string (optional) ISO 8601 date-time when this database was last synchronized with another. Updated on both sides after each psi sync.
lastModifiedAt string (optional) ISO 8601 date-time when this database was last modified locally (e.g. adding an asset, removing an asset, or editing metadata). Updated by psi add, psi remove, and by API writes.

6.2 Example

{
  "origin": "/path/to/source/database",
  "lastReplicatedAt": "2026-02-01T12:00:00.000Z",
  "lastSyncedAt": "2026-02-01T14:30:00.000Z",
  "lastModifiedAt": "2026-02-01T10:15:00.000Z"
}

6.3 Use of origin

The origin value is used as the default for:

  • psi sync — If --dest is omitted, the other database is taken from origin.
  • psi replicate — If --dest is omitted, the destination is taken from origin (e.g. replicate from a copy back to its source).
  • psi repair — If --source is omitted, the repair source is taken from origin.
  • psi compare — If --dest is omitted, the second database is taken from origin.

Commands psi origin and psi set-origin <path> display and set the origin field.

If the database was not created by replication and no origin has been set, origin may be absent; in that case, sync/replicate/repair/compare require an explicit --dest or --source.

7. Partial vs full databases

A database can be full or partial. The layout and file formats are the same; the difference is which files are present on disk.

Full database: All asset files are stored: asset/, display/, and thumb/ each have one file per asset. The BSON database under .db/bson/ and .db/ are complete. This is the normal case after import or after a full replicate.

Partial database: Only thumb files and root-level files (e.g. README.md) are stored. The asset/ and display/ directories are missing or sparse. The BSON database under .db/bson/ is still complete (all asset records and indexes). Partial databases are created by replicating with the partial option.

The partial flag is stored in the files Merkle tree (.db/files.dat) in the database metadata: isPartial: true. Tools use this to treat missing asset/display files as expected (verify) and to only copy thumb and root-level files when syncing to a partial target. Missing files can be filled in lazily (e.g. download from the origin database as the user views photos in the gallery) or in bulk via a full replicate.

8. Asset record shape

The metadata collection stores asset records. Main fields:

  • _id — UUID string.
  • origFileName, origPath?, contentType, width, height, hash.
  • coordinates?, location?, duration?, fileDate, photoDate?, uploadDate.
  • properties?, labels?, description?, deleted?.
  • micro — base64 micro thumbnail.
  • color[number, number, number] (e.g. dominant color).

Sort indexes used in practice: hash (asc), photoDate (desc).

Clone this wiki locally