Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

migrations: Support AWS DMS as a source #84505

Open
4 of 14 tasks
ajstorm opened this issue Jul 15, 2022 · 8 comments
Open
4 of 14 tasks

migrations: Support AWS DMS as a source #84505

ajstorm opened this issue Jul 15, 2022 · 8 comments
Labels
A-migrations Migrating to CRDB from another database vendor A-tools-aws-dms Blocking support for AWS Database Migration Service C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-migrations
Projects

Comments

@ajstorm
Copy link
Collaborator

ajstorm commented Jul 15, 2022

#34766 provided support for CRDB to impersonate PG and act as a target for migrations. There is still work remaining if we ever want to support DMS with CRDB as a source. A working prototype is available in #93404.

Initial Load

To get the initial load to work, we need the following:

Replication

  • Implement the postgres streaming replication slot protocol, including the pg_catalog tables and associated builtins to create a replication slot. In the prototype, we used a global buffer which is populated by CDC using a newly added replication:// source URI to do so using the "normal connection protocol", which is wrong as the replication slot protocol has its own parser.
    • involves aligning on an adequate WAL mapping and some sort of ordering in order to achieve this. see (cdc: support Debezium (via Postgres WAL) #68333). Similar to "sinkless changefeeds", but not quite the same.
    • we only need to implement the test_decoding plugin, but others may work too with DMS. Note this means setting PluginName as an additional parameter on the source endpoint in DMS.
  • Implement database level CDC (changefeedccl: add targets for changefeeds beyond tables #73435)
    • We may be able to get away with a cheat here too. In the prototype we set one up for all tables manually. There is some work required to productize this, namely:
      • Ensuring that as new tables get created we CFs for them
      • Gracefully handle table renames
  • Implement the plpgsql to create event triggers - this is required for DMS to recognise and apply schema changes during replication.
    • Alternatively, add a session variable that fakes the existence of the event trigger and function as we did in the prototype (or instruct the user to create these functions / triggers with nothing in them). The caveat: schema changes don't get properly propagated on the target DB.
  • Correctly populate relreplident in pg_catalog.pg_class.

Requisite plpgsql and event trigger

CREATE OR REPLACE FUNCTION objects_schema.awsdms_intercept_ddl()
  RETURNS event_trigger
LANGUAGE plpgsql
SECURITY DEFINER
  AS $$
  declare _qry text;
BEGIN
  if (tg_tag='CREATE TABLE' or tg_tag='ALTER TABLE' or tg_tag='DROP TABLE') then
         SELECT current_query() into _qry;
         insert into objects_schema.awsdms_ddl_audit
         values
         (
         default,current_timestamp,current_user,cast(TXID_CURRENT()as varchar(16)),tg_tag,0,'',current_schema,_qry
         );
         delete from objects_schema.awsdms_ddl_audit;
end if;
END;
$$$

CREATE EVENT TRIGGER awsdms_intercept_ddl ON ddl_command_end 
EXECUTE PROCEDURE objects_schema.awsdms_intercept_ddl();

Jira issue: CRDB-17695

Epic CC-8911

@ajstorm ajstorm added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-tools-aws-dms Blocking support for AWS Database Migration Service A-migrations Migrating to CRDB from another database vendor labels Jul 15, 2022
@ajstorm ajstorm added this to To do in Migrations via automation Jul 15, 2022
@otan
Copy link
Contributor

otan commented Dec 8, 2022

Investigation from the logs

Full log here, involves creating and dropping replication slots multiple times. (anything calling SELECT * was me, anything using DECLARE CURSOR was DMS)

From logging PG statements, we have to support the following queries:

for initial load - looks like we're missing basically WITH HOLD:

 BEGIN;declare "SQL_CUR0x146eb80ff1a0" cursor with hold for SELECT "a","b"  FROM "public"."good_table";fetch 10000 in "SQL_CUR0x146eb80ff1a0"
 close "SQL_CUR0x146eb80ff1a0";commit
 BEGIN;declare "SQL_CUR0x146ebc0e4090" cursor with hold for select CAST (version() as varchar(512));fetch 10000 in "SQL_CUR0x146ebc0e4090"
 close "SQL_CUR0x146ebc0e4090";commit
 BEGIN;declare "SQL_CUR0x146ebc0e4090" cursor with hold for select cast(setting as integer) from pg_settings where name = 'server_version_num';fetch 10000 in "SQL_CUR0x146ebc0e4090"
 close "SQL_CUR0x146ebc0e4090";commit
 BEGIN;declare "SQL_CUR0x146ebc0e4090" cursor with hold for SELECT pg_encoding_to_char(encoding) FROM pg_database WHERE datname = current_database();fetch 10000 in "SQL_CUR0x146ebc0e4090"
 close "SQL_CUR0x146ebc0e4090";commit

for cdc, PG uses replication slots (examples):

 BEGIN;declare "SQL_CUR0x146ed00bada0" cursor with hold for SELECT pg_drop_replication_slot('kl3yh77fl5mhvcyz_00016402_6d5a24d1_0a92_4efc_80e7_9d0c0d409dbc');fetch 10000 in "SQL_CUR0x146ed00bada0"
 BEGIN;declare "SQL_CUR0x146ed00bada0" cursor with hold for SELECT lsn FROM pg_create_logical_replication_slot('kl3yh77fl5mhvcyz_00016402_6d5a24d1_0a92_4efc_80e7_9d0c0d409dbc', 'test_decoding');fetch 10000 in "SQL_CUR0x146ed00bada0"

BEGIN;declare "SQL_CUR0x146ed00bada0" cursor with hold for
        select cast(tn.is_dst as varchar(8))
        from pg_timezone_names tn, pg_settings s
         where tn.name = s.setting
         and   s.name = 'log_timezone'
        ;fetch 10000 in "SQL_CUR0x146ed00bada0"
BEGIN;declare "SQL_CUR0x146ed00bada0" cursor with hold for
        select restart_lsn from pg_replication_slots
          where slot_name='kl3yh77fl5mhvcyz_00016402_6d5a24d1_0a92_4efc_80e7_9d0c0d409dbc'
          and database   ='replicationload'
          and plugin     ='test_decoding'
        ;fetch 10000 in "SQL_CUR0x146ed00bada0"

we would then need to implement this stream: https://www.postgresql.org/docs/current/protocol-replication.html

DMS uses the replication protocol mentioned above to listen for updates (instead of pg_logical_slot_get_changes which is much nicer and recommended by PG):

2022-12-08 19:04:35 UTC:172.31.55.39(37356):postgres@replicationload:[672]:LOG:  starting logical decoding for slot "kl3yh77fl5mhvcyz_00016402_6d5a24d1_0a92_4efc_80e7_9d0c0d409dbc"
2022-12-08 19:04:35 UTC:172.31.55.39(37356):postgres@replicationload:[672]:DETAIL:  Streaming transactions committing after 1CE/A00005A0, reading WAL from 1CE/A0000568.
2022-12-08 19:04:35 UTC:172.31.55.39(37356):postgres@replicationload:[672]:STATEMENT:  START_REPLICATION SLOT "kl3yh77fl5mhvcyz_00016402_6d5a24d1_0a92_4efc_80e7_9d0c0d409dbc" LOGICAL 000001CE/A0000568 ("include-timestamp" 'on')
2022-12-08 19:04:35 UTC:172.31.55.39(37356):postgres@replicationload:[672]:LOG:  logical decoding found consistent point at 1CE/A0000568
2022-12-08 19:04:35 UTC:172.31.55.39(37356):postgres@replicationload:[672]:DETAIL:  There are no running transactions.
2022-12-08 19:04:35 UTC:172.31.55.39(37356):postgres@replicationload:[672]:STATEMENT:  START_REPLICATION SLOT "kl3yh77fl5mhvcyz_00016402_6d5a24d1_0a92_4efc_80e7_9d0c0d409dbc" LOGICAL 000001CE/A0000568 ("include-timestamp" 'on')

which is super ew. can't even debug it using psql:

 psql "postgres://postgres:bobby-123@otan-pg-test.cpa6lrp2ahsc.us-east-1.rds.amazonaws.com:5432/replicationload?replication=database" -c 'START_REPLICATION SLOT "otan_test" LOGICAL 1CE/B8000810;'
unexpected PQresultStatus: 8

@ajstorm
Copy link
Collaborator Author

ajstorm commented Dec 12, 2022

Awesome start! After wrapping my head around what you've done, I'm hoping to get the first cut of INSERT/DELETE working today.

@otan
Copy link
Contributor

otan commented Dec 12, 2022

Some discussion from an internal slack thread regarding the replication slot stuff:

@otan
Copy link
Contributor

otan commented Dec 13, 2022

The replication now gets to START_REPLICATION on DMS but seems to exit for some reason and I don't know why

Instructions:

  • Create a cluster, create a table test_table in a source database. Run SET CLUSTER SETTING kv.rangefeed.enabled = true. As the decoding function doesn't work, ensure all columns are ints.
  • On DMS, create a source endpoint. Set PluginName on the source endpoint for CockroachDB for test_decoding. Without this you may see a weird error about querying pglogical.node, which I'm not sure how to resolve.

image

  • Create a target endpoint, setup DMS as normal, only replication test_table.
  • Ensure all changefeed jobs are currently cancelled. You may have to stop and restart the server after cancelling them due to the global state we inherit
  • Command succeeded: "START_REPLICATION SLOT "o67mdfyzzca6l5zr_00000104_9cf6388a_6451_48f9_908d_4915f0f7bc1a" LOGICAL AAAAAAAA/AA000020 ("include-timestamp" 'on')" with status code: "" (postgres_test_decoding.c:167) is not normal. Not sure why. Probably because my LSNs are screwed up. For some reason cdc sink seems to play every change from the start, which is not normal, so I reckon it's cursor related. Cursors have to start with A. resolved
  • The feed will last up to 24 hours.

after you set it up, whenever restarting cockroachdb for everything to work:

  • drop all changefeed jobs
  • restart the database

Lot more hacks here with imitating certain triggers / functions exist.

If you want a faster iteration cycle, you can use https://github.com/otan-cockroach/repltest (follow the readme) to inspect the replication stream.

@otan
Copy link
Contributor

otan commented Dec 13, 2022

It works now when I changed the table to include the schema name on the replication log. Branch now up to date. Modified issue with write up.

@ajstorm
Copy link
Collaborator Author

ajstorm commented Dec 14, 2022

Added a few changes to avoid some of the hard-coding, handle multiple tables, and fix the types. Branch has been updated.

craig bot pushed a commit that referenced this issue Dec 22, 2022
94110: roachprod: include storage workload metadata on snapshot r=jbowens a=coolcom200

Currently when a snapshot is taken of a volume that has been used for storage workload collection, the snapshot only contains the user provided information--name of the snapshot and description. Which could lead to data not being included about which cluster this ran on, machine type, crdb version, etc.

As a result, we encode this metadata in the labels / tags when we create a snapshot allowing the user to provide both a name and a description while also capturing metadata that can be used for searching and further reference. There are some limitations with the maximum length of the labels (aws key: 128 chars value: 256 chars; gcp: both 63 chars) and which characters are allowed to be used (gcp: lowercase, digits, _, -; aws: letters, digits, spaces, ., :, +, =, `@,` _, /, -)

Alternatively, the metadata could be encoded into the description field which would allow for more data to be saved at the cost of it being harder to search / filter.

Fixes: #94075

Release note: None

94123: sql: implement the `log_timezone` session variable r=rafiss a=otan

Informs #84505

Release note (sql change): Add the `log_timezone` session variable, which is read only and always UTC.

94154: cloud: set orchestration version updated to 22.2.1 r=absterr08 a=absterr08

links Epic: https://cockroachlabs.atlassian.net/browse/REL-228

Release note: none

94178: descs: remove GetAllTableDescriptorsInDatabase r=postamar a=postamar

Recent changes in #93543 had modified the contract of this method (it no longer returns dropped tables) and made it unsuitable for its main use case, the SQLTranslator.

This commit fixes this regression by removing this deprecated method entirely and using correct alternatives instead.

Fixes #93614.

Release note: None

Co-authored-by: Leon Fattakhov <leon.fattakhov@cockroachlabs.com>
Co-authored-by: Oliver Tan <otan@cockroachlabs.com>
Co-authored-by: Abby Hersh <abby@cockroachlabs.com>
Co-authored-by: Marius Posta <marius@cockroachlabs.com>
craig bot pushed a commit that referenced this issue Dec 23, 2022
93757: trigram: support multi-byte string trigrams; perf improvements r=jordanlewis a=jordanlewis

Fixes #93744 
Related to #93830

- Add multi-byte character support
- Improve performance

```
name           old time/op    new time/op    delta
Similarity-32    1.72µs ± 0%    0.60µs ± 3%  -64.98%  (p=0.000 n=9+10)

name           old alloc/op   new alloc/op   delta
Similarity-32    1.32kB ± 0%    0.37kB ± 0%  -72.10%  (p=0.000 n=10+10)

name           old allocs/op  new allocs/op  delta
Similarity-32      15.0 ± 0%       6.0 ± 0%  -60.00%  (p=0.000 n=10+10)
```

Release note (sql change): previously, trigrams ignored multi-byte characters from input strings. This is now corrected.

94122: sql: implement the pg_timezone_names table r=rafiss a=otan

Informs #84505

Release note (sql change): Implement the `pg_timezone_names` pg_catalog table, which lists all supported timezones.

Co-authored-by: Jordan Lewis <jordanthelewis@gmail.com>
Co-authored-by: Oliver Tan <otan@cockroachlabs.com>
craig bot pushed a commit that referenced this issue Jul 10, 2023
106242: pg_class: populate pg_class.relreplident r=rafiss a=otan

Release note (sql change): pg_class's relreplident field was previously unpopulated. It is now populated with `d` for all tables (as each table has a primary key) and n otherwise.

Informs: #84505

106546: flowinfra: clean up flow stats propagation in row-based flows r=yuzefovich a=yuzefovich

Previously, we would attach `FlowStats` (like max memory usage) to the "stream component" stats object. I don't really understand why that was the case, probably it was due to misunderstanding how tracing works (in particular, the TODOs that are now removed mentioned "flow level span", but we don't need to attach the metadata to a particular tracing span).

This simplifies the code a bit but also simplifies the work on adding region information to `ComponentID` object.

Epic: None

Release note: None

Co-authored-by: Oliver Tan <otan@cockroachlabs.com>
Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
@cucxabong
Copy link

Hi guys
Do we have any update on this feature?

@ajstorm
Copy link
Collaborator Author

ajstorm commented Oct 11, 2023

Hi @cucxabong. We currently don't have an update as to when this feature will be completed. We do have a working prototype, but it'll take some effort to get it over the line.

Is there a particular reason why you're interested in this feature? To where would you be hoping to migrate the data?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-migrations Migrating to CRDB from another database vendor A-tools-aws-dms Blocking support for AWS Database Migration Service C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-migrations
Projects
Development

No branches or pull requests

3 participants