Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DBZ-5572: Specify table.include.list during VStream subscription #96

Merged
merged 2 commits into from Sep 2, 2022

Conversation

HenryCaiHaiying
Copy link
Contributor

When table.include.list was used to only subscribe to changes to a subset of tables in the database, the filtering is currently done in debezium VM. This is less efficient comparing to the filtering at the VtTablet level. During VStream subscription, you can specify which tables you are interested, the VtTable will only send the changes to these tables to you. The benefit of doing filtering in VtTable level is:

  1. There are much less network bytes sending over the network,
  2. it also has the advantage of avoiding errors related to tables you are not interested. For example, we had seen errors related to schema mismatch on gh-ost (online schema migration) metadata table, we are not really interested in those tables. By specifying to subscribe to only the data tables we are interested in, we can avoid those problems as well.

The fix is relatively straightforward by using Binlogdata.Filter during VStreamRequest building. The details of those filters can be found in: https://github.com/vitessio/vitess/blob/release-14.0/go/vt/vttablet/tabletserver/vstreamer/planbuilder.go#L316 . Note that when you provide the table name for the filtering, you also need to provide a SELECT query to further restrict the rows/columns from that table, we are currently not filtering on row/column level.

New integration tests are added to cover the table.include.list filtering.

The change only applies when you specify table.include.list config params.

For reference, sometimes we saw the errors not related to our tables like below:

Errors like the following (that _ghc table is the temp table created during gh-ost migration process)
VStream streaming onError. Status: Status{code=UNKNOWN, description=target: byuser.-4000.replica: vttablet: rpc error: code = Unknown desc = stream (at source tablet) error @ 08fb1cf3-0ce5-11ed-b921-0a8939501751:1-1443715: unknown table _mentions_unread_ghc in schema, cause=null}

When table.include.list was used to only subscribe to changes to a subset of tables in the database, the filtering is currently done in debezium VM.  This is less efficient comparing to the filtering at the VtTablet level.   During VStream subscription, you can specify which tables you are interested, the VtTable will only send the changes to these tables to you.  The benefit of doing filtering in VtTable level is:

1. There are much less network bytes sending over the network,
2. it also has the advantage of avoiding errors related to tables you are not interested.  For example, we had seen errors related to schema mismatch on gh-ost (online schema migration) metadata table, we are not really interested in those tables.  By specifying to subscribe to only the data tables we are interested in, we can avoid those problems as well.

The fix is relatively straightforward by using Binlogdata.Filter during VStreamRequest building.  The details of those filters can be found in: https://github.com/vitessio/vitess/blob/release-14.0/go/vt/vttablet/tabletserver/vstreamer/planbuilder.go#L316 .   Note that when you provide the table name for the filtering, you also need to provide a SELECT query to further restrict the rows/columns from that table, we are currently not filtering on row/column level.

New integration tests are added to cover the table.include.list filtering.

The change only applies when you specify table.include.list config params.

For reference, sometimes we saw the errors not related to our tables like below:

Errors like the following (that _ghc table is the temp table created during gh-ost migration process)
VStream streaming onError. Status: Status{code=UNKNOWN, description=target: byuser.-4000.replica: vttablet: rpc error: code = Unknown desc = stream (at source tablet) error @ 08fb1cf3-0ce5-11ed-b921-0a8939501751:1-1443715: unknown table _mentions_unread_ghc in schema, cause=null}
@HenryCaiHaiying
Copy link
Contributor Author

@jpechane @shichao-an One more fix for vitess connector, please take a look.

@jpechane
Copy link
Contributor

jpechane commented Sep 1, 2022

@HenryCaiHaiying Thanks for the PR, theis is a neat idea. The issue is that the table list is not list of table names but list f regular exepressions so the correct implementation should obtain the list of tables and apply the regex list (using Predicates class` to obtain the list of tables.

@HenryCaiHaiying
Copy link
Contributor Author

@jpechane I pushed the 2nd command which addressed your concern. table.include.list is now processed as a list of patterns instead of direct table names.

Per review comment, table.include.list contains a comma separated patterns instead of direct table names.  Modify the code to retrieve a list of tables from keyspace and feed the list of tables into the pattern to find out which tables fit the table.include.list pattern.

The loop is very similar to the table.include.list processing in https://github.com/debezium/debezium/blob/main/debezium-core/src/main/java/io/debezium/relational/RelationalSnapshotChangeEventSource.java#L171
@jpechane
Copy link
Contributor

jpechane commented Sep 2, 2022

@HenryCaiHaiying Good job! Applied

@HenryCaiHaiying HenryCaiHaiying deleted the DBZ-5572 branch September 2, 2022 18:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants