-
Notifications
You must be signed in to change notification settings - Fork 23
Troubleshooting
- Is the cluster subject to a technical advisory that impacts changefeeds?
- Any cluster impacted by A104309 or A123371 is unsupportable.
- Also check release notes for other changefeed-impacting bugs in the cluster's major version.
- Are there errors in the Replicator logs?
- Using
--logFormat fluent
and--logDestination path/to/replicator.log
are a good choice for setting up log aggregation, especially if Replicator is being run as a replicated network service.
- Using
- Is the source changefeed able to deliver data to Replicator?
- Check the output of
SHOW CHANGEFEED JOBS
- Are there (retryable) error messages reported by the changefeed senders?
- Is the resolved timestamp advancing?
- Are webhook flush options set to enable bulk delivery?
- Check the output of
- Are any
sql.defaults.*
cluster settings affecting Replicator?-
SELECT * FROM system.settings
will show non-default cluster settings. - For instance, setting
sql.defaults.reorder_joins_limit=0
will completely break Replicator's SQL queries
-
- Is Replicator actively staging data?
- Look at the row counts for the staging tables in the
_replicator
database.
- Look at the row counts for the staging tables in the
- Is Replicator receiving resolved timestamps?
- Look at the row count of
_replicator.checkpoints
- Performing a transactionally-consistent backfill is not recommended, since this would require
the initial state of the database to be applied in a single transaction. Use
--immediate
or--backfillWindow
modes, or bootstrap the destination from aBACKUP
orEXPORT
.
- Look at the row count of
- Are resolved timestamps falling behind?
-
SELECT now() - MAX(target_applied_at) FROM _replicator.checkpoints
to show when a resolved window was last processed.
-
The Replicator server can be run with the --discard
option to receive and discard HTTP payloads.
This is useful when attempting to tune changefeed delivery for maximum throughput.
This query shows the commit-to-receipt latency for changefeed resolved timestamps, versus the amount
of time it took to process the mutations after being received. This data is available from the
Prometheus cdc_target_lag_...
variables, too.
SELECT now() - source_wall_time AS age,
first_seen - source_wall_time AS delivery_lag,
target_applied_at - first_seen AS apply_time,
target_applied_at - source_wall_time AS oall_lag
FROM resolved_timestamps AS OF SYSTEM TIME follower_read_timestamp()
WHERE (now() - source_wall_time) < '10 minutes'
AND target_applied_at IS NOT NULL
ORDER BY 2 DESC
LIMIT 10;
age | delivery_lag | apply_time | oall_lag
------------------+-----------------+-----------------+------------------
00:06:28.113321 | 00:03:04.300841 | 00:00:23.002403 | 00:03:27.303244
00:06:59.362791 | 00:02:50.482577 | 00:00:22.669238 | 00:03:13.151815
00:05:45.162184 | 00:02:35.870601 | 00:01:09.308704 | 00:03:45.179305
00:06:29.642173 | 00:02:34.868554 | 00:00:52.007338 | 00:03:26.875892
00:04:37.958192 | 00:02:30.637819 | 00:01:22.906329 | 00:03:53.544148
00:05:49.978872 | 00:02:30.297588 | 00:01:12.689824 | 00:03:42.987412
00:04:26.906991 | 00:02:22.1765 | 00:01:32.945101 | 00:03:55.121601
00:07:11.622545 | 00:02:21.574144 | 00:00:08.603641 | 00:02:30.177785
00:04:15.491594 | 00:02:12.363109 | 00:01:43.361881 | 00:03:55.72499
00:07:16.211537 | 00:01:44.072479 | 00:00:02.645284 | 00:01:46.717763
Replicator provides introspection of its internal datastructures though a diagnostic endpoint at
/_/diag
available from the changefeed server or the --metricsAddr
bind address.
This endpoint will return a JSON blob describing many of Replicator's internal
datastructure. The payload may contain sensitive information. If
replicator authentication
is enabled, the requestor must have access permissions to a schema named _.diag
in order to make
the request.
This same information can be sent to Replicator's logger by sending a SIGUSR1
to the Replicator
process.
This data is for support purposes only and does not constitute a stable API.
Replicator serves the standard pprof endpoints at /debug/pprof
.
- Cancel all source changefeeds
-
DROP
and re-CREATE
the_replicator
database.
The information in the _replicator.memo
table is typically available from the diagnostic bundle. The
data in the table can often be decoded by using SELECT key, convert_from(value, 'utf8') FROM memo
.
Decoding entries in the staging table is similar, although they may be gzipped: convert_from(decompress(mut, 'gzip'), 'utf8')
The SQL queries in the apply
package are generated from golang templates embedded into the binary.
It's possible to load the templates from disk for temporary troubleshooting or experimentation. The
templates are tightly coupled with internal data structures, so we cannot provide any long-term
guarantees about compatibility.
# Write templates to the current directory
replicator dumptemplates --path .
# Edit ./queries/<db>/query.tmpl
REPLICATOR_TEMPLATES=. replicator start ...