Skip to content
8 changes: 8 additions & 0 deletions src/current/molt/molt-replicator.md
Original file line number Diff line number Diff line change
Expand Up @@ -678,6 +678,14 @@ MOLT Replicator metrics are not enabled by default. Enable Replicator metrics by
--metricsAddr :30005
~~~

Metrics can additionally be written to snapshot files at repeated intervals. Metrics snapshotting is disabled by default. If metrics have been enabled, metrics snapshotting can also be enabled with the [`--metricsSnapshotPeriod`]({% link molt/replicator-flags.md %}#metrics-snapshot-period) flag. For example, the following flag enables metrics snapshotting every 15 seconds:

~~~
--metricsSnapshotPeriod 15s
~~~

Metrics snapshots enable access to metrics when the Prometheus server is unavailable, and they can be sent to [CockroachDB support]({% link {{ site.current_cloud_version }}/support-resources.md %}) to help quickly resolve an issue.

For guidelines on using and interpreting replication metrics, refer to [Replicator Metrics]({% link molt/replicator-metrics.md %}).

### Logging
Expand Down
5 changes: 5 additions & 0 deletions src/current/molt/replicator-flags.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ This page lists all available flags for the [MOLT Replicator commands]({% link m
| <a id="claim"></a> `--claim` | `make-jwt` | `BOOL` | If `true`, print a minimal JWT claim instead of signing. |
| <a id="collapse-mutations"></a> `--collapseMutations` | `start`, `pglogical`, `mylogical` | `BOOL` | Combine multiple mutations on the same primary key within each batch into a single mutation.<br><br>**Default:** `true` |
| <a id="default-gtid-set"></a> `--defaultGTIDSet` | `mylogical` | `STRING` | **Required** the first time `replicator` is run. The default GTID set, in the format `source_uuid:min(interval_start)-max(interval_end)`, which provides a replication marker for streaming changes. |
| <a id="data-dir"></a> `--dataDir` | `start`, `pglogical`, `mylogical`, `oraclelogminer` | `STRING` | Base directory for replicator data (for example, metrics snapshots).<br><br>**Default:** `"replicator-data"` |
| <a id="disable-authentication"></a> `--disableAuthentication` | `start` | `BOOL` | Disable authentication of incoming Replicator requests; not recommended for production. |
| <a id="discard"></a> `--discard` | `start` | `BOOL` | **Dangerous:** Discard all incoming HTTP requests; useful for changefeed throughput testing. Not intended for production. |
| <a id="discard-delay"></a> `--discardDelay` | `start` | `DURATION` | Adds additional delay in discard mode; useful for gauging the impact of changefeed round-trip time (RTT). |
Expand All @@ -38,6 +39,10 @@ This page lists all available flags for the [MOLT Replicator commands]({% link m
| <a id="log-format"></a> `--logFormat` | `start`, `pglogical`, `mylogical`, `oraclelogminer` | `STRING` | Choose log output format: `"fluent"`, `"text"`.<br><br>**Default:** `"text"` |
| <a id="max-retries"></a> `--maxRetries` | `start`, `pglogical`, `mylogical`, `oraclelogminer` | `INT` | Maximum number of times to retry a failed mutation on the target (for example, due to contention or a temporary unique constraint violation) before treating it as a hard failure.<br><br>**Default:** `10` |
| <a id="metrics-addr"></a> `--metricsAddr` | `start`, `pglogical`, `mylogical`, `oraclelogminer` | `STRING` | A `:port` or `host:port` on which to serve metrics and diagnostics. The metrics endpoint is `http://{host}:{port}/_/varz`. |
| <a id="metrics-snapshot-compression"></a> `--metricsSnapshotCompression` | `start`, `pglogical`, `mylogical`, `oraclelogminer` | `STRING` | Compression for snapshot files: `"gzip"` or `"none"`.<br><br>**Default:** `"gzip"` |
| <a id="metrics-snapshot-period"></a> `--metricsSnapshotPeriod` | `start`, `pglogical`, `mylogical`, `oraclelogminer` | `DURATION` | How often to periodically store a metrics snaphot to files (for example, `15s`, `1m`). Set to `0` to disable.<br><br>**Default:** `0` |
| <a id="metrics-snapshot-retention-size"></a> `--metricsSnapshotRetentionSize` | `start`, `pglogical`, `mylogical`, `oraclelogminer` | `STRING` | Delete oldest snapshots when the total size of the `metrics-snapshots` directory in the [`--dataDir`](#data-dir) exceeds this (for example, `100MB`, `1GiB`). Either this flag or `--metricsSnapshotRetentionTime` (or both) must be enabled. <br><br>**Default:** `""` |
| <a id="metrics-snapshot-retention-time"></a> `--metricsSnapshotRetentionTime` | `start`, `pglogical`, `mylogical`, `oraclelogminer` | `DURATION` | Delete snapshots older than this duration (for example, `24h`, `168h`). `0` to disable. Either this flag or `--metricsSnapshotRetentionSize` (or both) must be enabled. <br><br>**Default:** `168h` |
| <a id="ndjson-buffer-size"></a> `--ndjsonBufferSize` | `start` | `INT` | The maximum amount of data to buffer while reading a single line of `ndjson` input; increase when source cluster has large blob values.<br><br>**Default:** `65536` |
| <a id="oracle-application-users"></a> `--oracle-application-users` | `oraclelogminer` | `STRING` | List of Oracle usernames responsible for DML transactions in the PDB schema. Enables replication from the latest-possible starting point. Usernames are case-sensitive and must match the internal Oracle usernames (e.g., `PDB_USER`). |
| <a id="out"></a> `-o`, `--out` | `make-jwt` | `STRING` | A file to write the token to. |
Expand Down
200 changes: 200 additions & 0 deletions src/current/molt/replicator-metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -307,6 +307,206 @@ For checkpoint terminology, refer to the [MOLT Replicator documentation]({% link

[Read more about userscript metrics]({% link molt/userscript-metrics.md %}).

## Metrics snapshots

When enabled, the metrics snapshotter periodically writes out a point-in-time snapshot of Replicator's Prometheus metrics to a file in the [Replicator data directory]({% link molt/replicator-flags.md %}#data-dir). Metrics snapshots can help with debugging when direct access to the Prometheus server is not available, and you can [bundle snapshots and send them to CockroachDB support](#bundle-and-send-metrics-snapshots) to help resolve an issue. A metrics snapshot includes all of the metrics on this page.

Metrics snapshotting is disabled by default, and can be enabled with the [`--metricsSnapshotPeriod`]({% link molt/replicator-flags.md %}#metrics-snapshot-period) Replicator flag. [Replicator metrics must be enabled](#set-up-metrics) (with the [`--metricsAddr`]({% link molt/replicator-flags.md %}#metrics-addr) flag) in order for metrics snapshotting to work.

If snapshotting is enabled, the snapshot period must be at least 15 seconds. The recommended range for the snapshot period is 15-60 seconds. The retention policy for metrics snapshot files can be determined by [time]({% link molt/replicator-flags.md %}#metrics-snapshot-retention-time) and by the [total size]({% link molt/replicator-flags.md %}#metrics-snapshot-retention-size) of the snapshot data subdirectory. At least one retention policy must be configured. Snapshots can also be [compressed to a gzip file]({% link molt/replicator-flags.md %}#metrics-snapshot-compression).

Changing the snapshotter's configuration requires restarting the Replicator binary with different flags.

### Enable metrics snapshotting

#### Step 1. Run Replicator with the snapshot flags

The following is an example of a `replicator` command where snapshotting is configured:

<div class="filters filters-big clearfix">
<button class="filter-button" data-scope="postgres">PostgreSQL</button>
<button class="filter-button" data-scope="mysql">MySQL</button>
<button class="filter-button" data-scope="oracle">Oracle</button>
<button class="filter-button" data-scope="cockroachdb">CockroachDB</button>
</div>

<section class="filter-content" markdown="1" data-scope="postgres">
{% include_cached copy-clipboard.html %}
~~~shell
replicator pglogical \
--targetConn postgres://postgres:postgres@localhost:5432/molt?sslmode=disable \
--stagingConn postgres://root@localhost:26257/_replicator?sslmode=disable \
--slotName molt_slot \
--bindAddr 0.0.0.0:30004 \
--stagingSchema _replicator \
--stagingCreateSchema \
--disableAuthentication \
--tlsSelfSigned \
--stageMode crdb \
--bestEffortWindow 1s \
--flushSize 1000 \
--metricsAddr :30005 \
--metricsSnapshotPeriod 15s \
--metricsSnapshotCompression gzip \
--metricsSnapshotRetentionTime 168h \
-v
~~~
</section>

<section class="filter-content" markdown="1" data-scope="mysql">
{% include_cached copy-clipboard.html %}
~~~shell
replicator mylogical \
--targetConn postgres://postgres:postgres@localhost:5432/molt?sslmode=disable \
--stagingConn postgres://root@localhost:26257/_replicator?sslmode=disable \
--defaultGTIDSet '4c658ae6-e8ad-11ef-8449-0242ac140006:1-29' \
--bindAddr 0.0.0.0:30004 \
--stagingSchema _replicator \
--stagingCreateSchema \
--disableAuthentication \
--tlsSelfSigned \
--stageMode crdb \
--bestEffortWindow 1s \
--flushSize 1000 \
--metricsAddr :30005 \
--metricsSnapshotPeriod 15s \
--metricsSnapshotCompression gzip \
--metricsSnapshotRetentionTime 168h \
-v
~~~
</section>

<section class="filter-content" markdown="1" data-scope="oracle">
{% include_cached copy-clipboard.html %}
~~~shell
replicator oraclelogminer \
--targetConn postgres://postgres:postgres@localhost:5432/molt?sslmode=disable \
--stagingConn postgres://root@localhost:26257/_replicator?sslmode=disable \
--scn 26685786 \
--backfillFromSCN 26685444 \
--bindAddr 0.0.0.0:30004 \
--stagingSchema _replicator \
--stagingCreateSchema \
--disableAuthentication \
--tlsSelfSigned \
--stageMode crdb \
--bestEffortWindow 1s \
--flushSize 1000 \
--metricsAddr :30005 \
--metricsSnapshotPeriod 15s \
--metricsSnapshotCompression gzip \
--metricsSnapshotRetentionTime 168h \
-v
~~~
</section>

<section class="filter-content" markdown="1" data-scope="cockroachdb">
{% include_cached copy-clipboard.html %}
~~~shell
replicator start \
--targetConn postgres://postgres:postgres@localhost:5432/molt?sslmode=disable \
--stagingConn postgres://root@localhost:26257/_replicator?sslmode=disable \
--bindAddr 0.0.0.0:30004 \
--stagingSchema _replicator \
--stagingCreateSchema \
--disableAuthentication \
--tlsSelfSigned \
--stageMode crdb \
--bestEffortWindow 1s \
--flushSize 1000 \
--metricsAddr :30005 \
--metricsSnapshotPeriod 15s \
--metricsSnapshotCompression gzip \
--metricsSnapshotRetentionTime 168h \
-v
~~~
</section>

If successful, Replicator will start, and the console output will indicate that the snapshotter has started as well:

~~~
INFO [Feb 2 10:20:32] Replicator starting
...
INFO [Feb 2 10:20:32] metrics snapshotter started, writing to replicator-data/metrics-snapshots every 15s, retaining 168h0m0s
~~~

Upon interruption of Replicator, the snapshotter will be stopped:

~~~
INFO [Feb 2 10:26:45] Interrupted
INFO [Feb 2 10:26:45] metrics snapshotter stopped
INFO [Feb 2 10:26:45] Server shutdown complete
~~~

#### Step 2. Find the snapshot files in the data directory

You can find the snapshot files in the [Replicator data directory]({% link molt/replicator-flags.md %}#data-dir):

{% include_cached copy-clipboard.html %}
~~~shell
cd replicator-data/metrics-snapshots && ls . | tail -n 5
~~~

~~~
snapshot-20260202T152405.737Z.txt.gz
snapshot-20260202T152420.736Z.txt.gz
snapshot-20260202T152435.736Z.txt.gz
snapshot-20260202T152450.735Z.txt.gz
snapshot-20260202T152505.735Z.txt.gz
~~~

The uncompressed files list the metrics collected at that snapshot:

{% include_cached copy-clipboard.html %}
~~~shell
gzcat snapshot-20260202T152505.735Z.txt.gz | head -n 3
~~~

~~~
# HELP cdc_resolved_timestamp_buffer_size Current size of the resolved timestamp buffer channel which is yet to be processed by Pebble Stager
# TYPE cdc_resolved_timestamp_buffer_size gauge
cdc_resolved_timestamp_buffer_size 0.0 1.770045905735e+09
~~~

### Bundle and send metrics snapshots

The following requires a Linux system that supports bash.

#### Step 1. Download the export script

Download the [metrics snapshot export script](https://replicator.cockroachdb.com/export-metrics-snapshots.sh). Ensure it's accessible and can be run by the current user.

#### Step 2. Run a snapshot export

Run an export, indicating the `metrics-snapshots` directory within your [Replicator data directory]({% link molt/replicator-flags.md %}#data-dir). You can also provide start and end timestamps to define a subset of metrics to bundle. Times are specified as UTC and should be of the format `YYYYMMDDTHHMMSS`.

Running the script without timestamps bundles all of the data in the snapshot directory. For example:

{% include_cached copy-clipboard.html %}
~~~shell
./export-metrics-snapshots.sh ./replicator-data/metrics-snapshots
~~~

Running the script with one timestamp bundles all of the data in the snapshot directory beginning at that timestamp. For example:

{% include_cached copy-clipboard.html %}
~~~shell
./export-metrics-snapshots.sh ./replicator-data/metrics-snapshots 20260115T120000
~~~

Running the script with two timestamps bundles all of the data in the snapshot directory within the two timestamps. For example:

{% include_cached copy-clipboard.html %}
~~~shell
./export-metrics-snapshots.sh ./replicator-data/metrics-snapshots 20260115T120000 20260115T140000
~~~

The resulting output is a `.tar.gz` file placed in the directory from which you ran the script (or to a path specified as an optional argument).

#### Step 3. Upload output file to a support ticket

Include this bundled metrics snapshot file on a [support ticket]({% link {{ site.current_cloud_version }}/support-resources.md %}) to give support metrics information that's relevant to your issue.

## See also

- [MOLT Replicator]({% link molt/molt-replicator.md %})
Expand Down
2 changes: 1 addition & 1 deletion src/current/molt/userscript-metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ To improve observability and debugging in the field, [MOLT Replicator]({% link m

All userscript metrics include a `script_` prefix and are automatically labeled with the relevant schema or table for each configured handler (for example, `schema="target.public"`). If a userscript defines both schema-level and table-level handlers, separate label values will be created for each.

These metrics are part of the default [Replicator Prometheus metrics]({% link molt/replicator-metrics.md %}) set and can be visualized immediately using the provided [`replicator.json` Grafana dashboard file](https://replicator.cockroachdb.com/replicator_grafana_dashboard.json).
These metrics are part of the default [Replicator Prometheus metrics]({% link molt/replicator-metrics.md %}) set and can be visualized immediately using the provided [`replicator.json` Grafana dashboard file](https://replicator.cockroachdb.com/replicator_grafana_dashboard.json). They are also included in [Replicator metrics snapshots]({% link molt/replicator-metrics.md %}#metrics-snapshots).

Consider using these metrics to:

Expand Down
Loading