Support table and column blacklisting #21

xstevens · 2015-07-06T22:39:14Z

A feature that's useful to financial institutions (probably others too) is to have a configurable blacklist of tables and/or columns that you don't want sent to Kafka. This is generally because the data is considered sensitive information and most likely has strict compliance requirements, but it could also just be that it isn't useful for analytical purposes.

I'm not aware that this is achievable with Postgres itself, but I'm open to that solution as well if it exists.

ept · 2015-07-09T20:54:28Z

Yes, this would be a good feature to add to Bottled Water. I think we'd have to do it in the Bottled Water extension. Pull requests welcome :)

bchazalet · 2015-12-04T11:41:10Z

Being able to choose the tables would be awesome indeed!

samstokes · 2015-12-17T19:10:10Z

One extra benefit from doing this: if you're using BW with a database generated using ActiveRecord migrations, then you have to run with --allow-unkeyed, since the schema_migrations table that ActiveRecord uses to track migration state does not have a primary key.

It's hard to see a use case for replicating schema_migrations to Kafka, so you could just blacklist it and run with the default mode.

bchazalet · 2016-01-07T16:37:45Z

@ept say I want to implement whitelisting or blacklisting of table names. For the snapshot part, I think I can filter at get_table_list by giving it the right table pattern. But I don't see where I would filter the right tables in the logdecoder part. Do you think you could give me a hint?

In particular, I am confused about the transaction begin/commit functions (output_avro_begin_txn and output_avro_commit_txn): do those need to filter consistently too or you think updating the output_avro_change function would be enough?

ept · 2016-01-10T18:38:16Z

@bchazalet Would be great if you want to try making a patch.

For snapshot: yes, get_table_list is the place to go. It already has rudimentary filtering support via the table_pattern parameter, which currently snapshot_start hard-codes to be %, but could be exposed as a command-line parameter. But it might be better to instead allow an explicit whitelist or blacklist of table names (or table name patterns with wildcards).

For log decoding: I think you only need to update output_avro_change. If you have a transaction that only modifies tables whose changes you're filtering out, you'll get a transaction-begin event followed by commit event with nothing in between, but that's ok. (That pattern already occurs on DDL transactions, which produce begin and commit events, but no data change events.)

To filter on tables, I would suggest first translating the whitelist/blacklist into a list of Oids identifying the tables to be included/excluded, by querying the catalog from the client (similar to what get_table_list does). Then you should be able to pass that list of Oids to the logical replication plugin by passing options to the START_REPLICATION command. You should be able to pick up those options in output_avro_startup and add them to the plugin state. Might need to look at the PG source to figure out exactly how those options get passed around, as I don't think the docs say.

Let me know if I can help further!

samstokes · 2016-08-01T19:48:41Z

As a workaround, if your Kafka cluster is configured to not automatically create topics (auto.create.topics.enable=false), you can achieve the same goal by manually creating Kafka topics for only those tables that you want to sync, and then running bottledwater with the --on-error=log command-line flag. Bottled Water should ignore updates to the other tables because it can't produce to the corresponding topics. (--on-error=log is needed because otherwise Bottled Water's default policy is to stop the sync and exit on error.)

(The keen-eyed will notice that this workaround actually simulates a whitelist, not a blacklist :))

ept mentioned this issue Dec 15, 2015

How to select Postgres DB database? #16

Closed

ept mentioned this issue Jan 14, 2016

Ability to filter on certain tables #42

Closed

ept mentioned this issue May 8, 2016

Replication per table and start offset #74

Closed

ept mentioned this issue Jul 1, 2016

How to monitor only a few tables? #91

Closed

samstokes added the enhancement label Aug 31, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support table and column blacklisting #21

Support table and column blacklisting #21

xstevens commented Jul 6, 2015

ept commented Jul 9, 2015

bchazalet commented Dec 4, 2015

samstokes commented Dec 17, 2015

bchazalet commented Jan 7, 2016

ept commented Jan 10, 2016

samstokes commented Aug 1, 2016

Support table and column blacklisting #21

Support table and column blacklisting #21

Comments

xstevens commented Jul 6, 2015

ept commented Jul 9, 2015

bchazalet commented Dec 4, 2015

samstokes commented Dec 17, 2015

bchazalet commented Jan 7, 2016

ept commented Jan 10, 2016

samstokes commented Aug 1, 2016