Skip to content

Troubleshooting

Bob Vawter edited this page May 17, 2024 · 12 revisions

Troubleshooting

Questions

  • Are there errors in the Replicator logs?
    • Using --logFormat fluent and --logDestination path/to/replicator.log are a good choice for setting up log aggregation, especially if Replicator is being run as a replicated network service.
  • Is the source changefeed able to deliver data to Replicator?
    • Check the output of SHOW CHANGEFEED JOBS
    • Are there (retryable) error messages reported by the changefeed senders?
    • Is the resolved timestamp advancing?
    • Are webhook flush options set to enable bulk delivery?
  • Is Replicator actively staging data?
    • Look at the row counts for the staging tables in the _replicator database.
  • Is Replicator receiving resolved timestamps?
    • Look at the row count of _replicator.checkpoints
    • Performing a transactionally-consistent backfill is not recommended, since this would require the initial state of the database to be applied in a single transaction. Use --immediate or --backfillWindow modes, or bootstrap the destination from a BACKUP or EXPORT.
  • Are resolved timestamps falling behind?
    • SELECT now() - MAX(target_applied_at) FROM _replicator.checkpoints to show when a resolved window was last processed.

Useful queries

Discard mode

The Replicator server can be run with the --discard option to receive and discard HTTP payloads. This is useful when attempting to tune changefeed delivery for maximum throughput.

Latency reports

This query shows the commit-to-receipt latency for changefeed resolved timestamps, versus the amount of time it took to process the mutations after being received. This data is available from the Prometheus cdc_target_lag_... variables, too.

SELECT now() - source_wall_time AS age,
       first_seen - source_wall_time AS delivery_lag,
       target_applied_at - first_seen AS apply_time,
       target_applied_at - source_wall_time AS oall_lag
  FROM resolved_timestamps AS OF SYSTEM TIME follower_read_timestamp()
 WHERE (now() - source_wall_time) < '10 minutes'
   AND target_applied_at IS NOT NULL
 ORDER BY 2 DESC
 LIMIT 10;

        age       |  delivery_lag   |   apply_time    |    oall_lag
------------------+-----------------+-----------------+------------------
  00:06:28.113321 | 00:03:04.300841 | 00:00:23.002403 | 00:03:27.303244
  00:06:59.362791 | 00:02:50.482577 | 00:00:22.669238 | 00:03:13.151815
  00:05:45.162184 | 00:02:35.870601 | 00:01:09.308704 | 00:03:45.179305
  00:06:29.642173 | 00:02:34.868554 | 00:00:52.007338 | 00:03:26.875892
  00:04:37.958192 | 00:02:30.637819 | 00:01:22.906329 | 00:03:53.544148
  00:05:49.978872 | 00:02:30.297588 | 00:01:12.689824 | 00:03:42.987412
  00:04:26.906991 | 00:02:22.1765   | 00:01:32.945101 | 00:03:55.121601
  00:07:11.622545 | 00:02:21.574144 | 00:00:08.603641 | 00:02:30.177785
  00:04:15.491594 | 00:02:12.363109 | 00:01:43.361881 | 00:03:55.72499
  00:07:16.211537 | 00:01:44.072479 | 00:00:02.645284 | 00:01:46.717763

Actions

Internal diagnostic endpoint

Replicator provides introspection of its internal datastructures though a diagnostic endpoint at /_/diag available from the changefeed server or the --metricsAddr bind address. This endpoint will return a JSON blob describing many of Replicator's internal datastructure. The payload may contain sensitive information. If replicator authentication is enabled, the requestor must have access permissions to a schema named _.diag in order to make the request.

This same information can be sent to Replicator's logger by sending a SIGUSR1 to the Replicator process.

This data is for support purposes only and does not constitute a stable API.

Reset Replicator

  • Cancel all source changefeeds
  • DROP and re-CREATE the _replicator database.

Decode memo table entries

The information in the _replicator.memo table is typically available from the diagnostic bundle. The data in the table can be decoded using SELECT * FROM memo WHERE key LIKE 'changefeed-%' and decode the value using pbpaste | xxd -r -p -.

Override apply templates

The SQL queries in the apply package are generated from golang templates embedded into the binary. It's possible load the templates from disk for temporary troubleshooting or experimentation. The templates are tightly coupled with internal data structures, so we cannot provide any long-term guarantees about compatibility.

# Write templates to the current directory
replicator dumptemplates --path .
# Edit ./queries/<db>/query.tmpl
CDC_SINK_TEMPLATES=. replicator start ...