Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changefeedccl/schemafeed: ignore unwatched column drops #84674

Conversation

jayshrivastava
Copy link
Contributor

@jayshrivastava jayshrivastava commented Jul 19, 2022

Closes #80982

Release note (enterprise change): Previously, if you dropped
a column with the schema_change_policy 'stop', the changefeed
would stop. Dropping a column with a different policy would
result in previous rows being retransmitted with the
dropped column omitted.

In some cases, a changefeed may target specific columns
(a column family) of a table. In these cases, if a non-target
column is dropped, it does not make sense to stop the changefeed
or retransmit values because the column was not visible to
a consumer sink to begin with.

With this change, dropping an non-target column from a
table will not stop the changefeed when the
schema_change_policy is 'stop'. With any other policy,
dropping a non-target column will not trigger a backfill.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@jayshrivastava jayshrivastava force-pushed the changefeed-ignore-drop-col-outside-family branch 11 times, most recently from a3b59d6 to 851f402 Compare July 20, 2022 21:51
@jayshrivastava jayshrivastava changed the title changefeedccl/kvfeed: ignore unwatched column drops with stop policy changefeedccl/schemafeed: ignore unwatched column drops Jul 20, 2022
@jayshrivastava jayshrivastava force-pushed the changefeed-ignore-drop-col-outside-family branch 2 times, most recently from e235a9e to 31266fe Compare July 21, 2022 14:52
@jayshrivastava jayshrivastava marked this pull request as ready for review July 21, 2022 15:36
@jayshrivastava jayshrivastava requested a review from a team as a code owner July 21, 2022 15:36
@jayshrivastava jayshrivastava requested review from livlobo, HonoreDB and samiskin and removed request for a team and livlobo July 21, 2022 15:36
Copy link
Contributor

@HonoreDB HonoreDB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great commit message and a very fast first PR, love to see it.

Reviewed all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jayshrivastava and @samiskin)


pkg/ccl/changefeedccl/changefeedbase/target.go line 88 at r1 (raw file):

// HasNonDefaultColumnFamily returns true if the Targets include least
// one watched column family which is not the default column family.
func (ts *Targets) HasNonDefaultColumnFamily() (bool, error) {

This method needs to take a TableID. It's possible to create a changefeed like this: "CREATE CHANGEFEED FOR TABLE foo, TABLE bar FAMILY baz", so that we have one target with a specified column family and another without.

Also, a nit: "NonDefault" isn't quite how I'd describe what this is checking for. It's possible for a table to have only one column family but it's not the default family, it's one that was created explicitly. I'd maybe call this HasSpecifiedColumnFamilies().

Or you could just call EachHavingTableID in the filtering logic rather than add a method here.

If you keep this method implemented here, you could optimize it by implementing it more low-level, something like ts.m[tableID].wholeTable == nil should be equivalent if we know the tableID is in the map.


pkg/ccl/changefeedccl/schemafeed/table_event_filter.go line 203 at r1 (raw file):

		}

		for _, watchedColID := range watchedCols {

I'm a little worried about scalability here...people do silly things like create tables with thousands of columns, so even though this method won't get called very often it'd still be a shame to be anything worse than O(number of columns) in complexity. If you look at how the filter is determining that a column was dropped in the first place, dropVisibleColumnMutationExists, it's able to iterate over the changes and check if it cares about each individual column. What if that function were family-aware? Like for example if the signature were dropVisibleColumnMutationExists(desc catalog.TableDescriptor, relevancyChecker func(catalog.Column) bool),
and it called relevancyChecker on each dropped column. Then if the whole table is being watched, you could pass in a function that just returns true, while if it's not, you could create a function with a closure over a set of column ids that checks whether the column is in that set.

func (irs *IndexRecommendationSet) getColOrdSet(
cols opt.ColSet, tabID opt.TableID,
) util.FastIntSet {
var colsOrdSet util.FastIntSet
cols.ForEach(func(col opt.ColumnID) {
table := irs.md.ColumnMeta(col).Table
// Do not add columns from other tables.
if table != tabID {
return
}
colOrd := table.ColumnOrdinal(col)
colsOrdSet.Add(colOrd)
})
return colsOrdSet
has an example of creating a set of columns that can be efficiently checked whether it contains a specific column. If you could find a way to just create that once by iterating over the watched families, I think you'd be in more efficient territory.


pkg/ccl/changefeedccl/changefeed_test.go line 1001 at r1 (raw file):

// If the schema_change_policy is 'stop' and we drop columns which are not
// targeted by the changefeed, it should not stop.
func TestNoStopAfterNonTargetColumnDrop(t *testing.T) {

Great test!


pkg/ccl/changefeedccl/changefeed_test.go line 1005 at r1 (raw file):

	testFn := func(t *testing.T, s TestServer, f cdctest.TestFeedFactory) {
		sqlDB := sqlutils.MakeSQLRunner(s.DB)
		disableDeclarativeSchemaChangesForTest(t, sqlDB)

Was this copied from another test or is it necessary? The declarative schema changer is on by default, so I'd rather not add any new tests that disable it--this feature will need to work when it's on. If it's an issue with the test framework, though, no need to fix it.


pkg/ccl/changefeedccl/changefeed_test.go line 1034 at r1 (raw file):

}

// If we drop columns which are not targeted by the changefeed, it should not backfill.

I'd suggest adding tests with a single changefeed that has multiple targets. Three important cases are

  1. TABLE hasfams FAMILY b_and_c, TABLE nofams(checking behavior for each)
  2. TABLE hasfams FAMILY b_and_c, TABLE alsohasfams FAMILY only id_a (with the same family names on both tables).
  3. TABLE hasfams FAMILY id_a, TABLE hasfams FAMILY b_and_c(I wouldn't worry about optimizing away the case where somebody has listed all the families in a table, but it's important that a table that's referenced multiple times with different families works correctly).

pkg/ccl/changefeedccl/schemafeed/helpers_test.go line 44 at r1 (raw file):

}

func CreateChangefeedTargets(tableID descpb.ID) changefeedbase.Targets {

Nice!

@jayshrivastava jayshrivastava force-pushed the changefeed-ignore-drop-col-outside-family branch from 10e5ed8 to c6f667c Compare July 22, 2022 20:51
Copy link
Contributor Author

@jayshrivastava jayshrivastava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @HonoreDB and @samiskin)


pkg/ccl/changefeedccl/changefeedbase/target.go line 88 at r1 (raw file):

Previously, HonoreDB (Aaron Zinger) wrote…

This method needs to take a TableID. It's possible to create a changefeed like this: "CREATE CHANGEFEED FOR TABLE foo, TABLE bar FAMILY baz", so that we have one target with a specified column family and another without.

Also, a nit: "NonDefault" isn't quite how I'd describe what this is checking for. It's possible for a table to have only one column family but it's not the default family, it's one that was created explicitly. I'd maybe call this HasSpecifiedColumnFamilies().

Or you could just call EachHavingTableID in the filtering logic rather than add a method here.

If you keep this method implemented here, you could optimize it by implementing it more low-level, something like ts.m[tableID].wholeTable == nil should be equivalent if we know the tableID is in the map.

I agree with this. Adding the table ID makes it easier while filtering schema changes. I didn't realize I could do the nil check until I read how Targets.add(...) works. I added some commentary to the Target struct in pkg/ccl/changefeedccl/changefeedbase/target.go to make it more clear.


pkg/ccl/changefeedccl/schemafeed/table_event_filter.go line 203 at r1 (raw file):

Previously, HonoreDB (Aaron Zinger) wrote…

I'm a little worried about scalability here...people do silly things like create tables with thousands of columns, so even though this method won't get called very often it'd still be a shame to be anything worse than O(number of columns) in complexity. If you look at how the filter is determining that a column was dropped in the first place, dropVisibleColumnMutationExists, it's able to iterate over the changes and check if it cares about each individual column. What if that function were family-aware? Like for example if the signature were dropVisibleColumnMutationExists(desc catalog.TableDescriptor, relevancyChecker func(catalog.Column) bool),
and it called relevancyChecker on each dropped column. Then if the whole table is being watched, you could pass in a function that just returns true, while if it's not, you could create a function with a closure over a set of column ids that checks whether the column is in that set.

func (irs *IndexRecommendationSet) getColOrdSet(
cols opt.ColSet, tabID opt.TableID,
) util.FastIntSet {
var colsOrdSet util.FastIntSet
cols.ForEach(func(col opt.ColumnID) {
table := irs.md.ColumnMeta(col).Table
// Do not add columns from other tables.
if table != tabID {
return
}
colOrd := table.ColumnOrdinal(col)
colsOrdSet.Add(colOrd)
})
return colsOrdSet
has an example of creating a set of columns that can be efficiently checked whether it contains a specific column. If you could find a way to just create that once by iterating over the watched families, I think you'd be in more efficient territory.

Great comment :)

I updated the function. It returns immediately if the whole table is being watched. At the moment, we don't store column IDs in Targets, so I have to look them up from the descriptor. There's also no way to get a family by name in the descriptor. I iterate over the families in the descriptor, skipping them if they are not watched and collecting the column ids in the FastIntSet for later. Then the function checks all the mutations (I would imagine that the number of mutations in a single schema feed event descriptor should be small) for a watched column that is being dropped, returning immediately if it finds one.


pkg/ccl/changefeedccl/changefeed_test.go line 1005 at r1 (raw file):

Previously, HonoreDB (Aaron Zinger) wrote…

Was this copied from another test or is it necessary? The declarative schema changer is on by default, so I'd rather not add any new tests that disable it--this feature will need to work when it's on. If it's an issue with the test framework, though, no need to fix it.

Initially, I copied another test to write mine. I agree that it's better to enable declarative schema changes since those are what we are working towards.

One issue I ran into with enabling them is that it causes errors to pop up when using sinkless/core changefeeds. When the primary index swap occurs as a part of the column drop schema change, the sinkless changefeed doesn't get restarted in the same way that other ones do. The next write after the schema change causes this error. Thus, I've omitted the sinkless changefeed from these tests.

(error encountered after some results were delivered)
ERROR: schema change occurred at 1658520726642164000.0000000000

Some other tests (ex. TestChangefeedEachColumnFamilySchemaChanges) fail on sinkless for the same reason if disableDeclarativeSchemaChangesForTest() is removed.

Code quote:

feedTestOmitSinks("sinkless"))

pkg/ccl/changefeedccl/changefeed_test.go line 1034 at r1 (raw file):

Previously, HonoreDB (Aaron Zinger) wrote…

I'd suggest adding tests with a single changefeed that has multiple targets. Three important cases are

  1. TABLE hasfams FAMILY b_and_c, TABLE nofams(checking behavior for each)
  2. TABLE hasfams FAMILY b_and_c, TABLE alsohasfams FAMILY only id_a (with the same family names on both tables).
  3. TABLE hasfams FAMILY id_a, TABLE hasfams FAMILY b_and_c(I wouldn't worry about optimizing away the case where somebody has listed all the families in a table, but it's important that a table that's referenced multiple times with different families works correctly).

Ack! I added more tests for these cases.

Copy link
Contributor

@HonoreDB HonoreDB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 7 files at r1, 1 of 4 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @HonoreDB, @jayshrivastava, and @samiskin)


pkg/ccl/changefeedccl/changefeedbase/target.go line 92 at r2 (raw file):

// GetSpecifiedColumnFamilies returns a set of watched families
// belonging to the table.
func (ts *Targets) GetSpecifiedColumnFamilies(tableID descpb.ID) map[string]*struct{} {

I usually see this done as map[string]struct{}, curious what making it a map to pointers does.


pkg/ccl/changefeedccl/schemafeed/table_event_filter.go line 203 at r1 (raw file):

Previously, jayshrivastava wrote…

Great comment :)

I updated the function. It returns immediately if the whole table is being watched. At the moment, we don't store column IDs in Targets, so I have to look them up from the descriptor. There's also no way to get a family by name in the descriptor. I iterate over the families in the descriptor, skipping them if they are not watched and collecting the column ids in the FastIntSet for later. Then the function checks all the mutations (I would imagine that the number of mutations in a single schema feed event descriptor should be small) for a watched column that is being dropped, returning immediately if it finds one.

Nice!

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @HonoreDB, @jayshrivastava, and @samiskin)


pkg/ccl/changefeedccl/changefeedbase/target.go line 92 at r2 (raw file):

Previously, HonoreDB (Aaron Zinger) wrote…

I usually see this done as map[string]struct{}, curious what making it a map to pointers does.

I'd go so far as to say that it's a mistake. I think for funny reasons, it ends up not really being less efficient, but it's unidiomatic. If I see map[string]struct{} I know it's a set of strings, if I see this, I feel unsure.

Copy link
Contributor Author

@jayshrivastava jayshrivastava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @HonoreDB, @jayshrivastava, and @samiskin)


pkg/ccl/changefeedccl/changefeedbase/target.go line 92 at r2 (raw file):

Previously, ajwerner wrote…

I'd go so far as to say that it's a mistake. I think for funny reasons, it ends up not really being less efficient, but it's unidiomatic. If I see map[string]struct{} I know it's a set of strings, if I see this, I feel unsure.

Agreed. I'll change it before I merge.

See release note.

Fixes cockroachdb#80982

Release note (enterprise change): Previously, if you dropped
a column with the schema_change_policy 'stop', the changefeed
would stop. Dropping a column with a different policy would
result in previous rows being retransmitted with the
dropped column omitted.

In some cases, a changefeed may target specific columns
(a column family) of a table. In these cases, if a non-target
column is dropped, it does not make sense to stop the changefeed
or retransmit values because the column was not visible to
a consumer sink to begin with.

With this change, dropping an non-target column from a
table will not stop the changefeed when the
schema_change_policy is 'stop'. With any other policy,
dropping a non-target column will not trigger a backfill.
@jayshrivastava jayshrivastava force-pushed the changefeed-ignore-drop-col-outside-family branch from c6f667c to 74bccc7 Compare July 25, 2022 19:08
@jayshrivastava
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Jul 26, 2022

Build succeeded:

@craig craig bot merged commit d7b901d into cockroachdb:master Jul 26, 2022
@jayshrivastava jayshrivastava deleted the changefeed-ignore-drop-col-outside-family branch July 27, 2022 21:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

changefeedccl: filter schema changes that don't affect any watched families
4 participants