Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changefeedccl: add nil checking for avroDataRecord.refreshTypeMetadata #119639

Merged
merged 1 commit into from Feb 28, 2024

Conversation

wenyihu6
Copy link
Contributor

Previously, the avro encoder could call refreshTypeMetadata on
avroDataRecord without proper nil checking. This could lead to node panics
because avroDataRecord could sometimes be nil. For example,
registered.schema.after is set only when using the wrapped envelope. Thus,
avro encoder could lead to panics when using with other envelope formats. This
patch addresses this issue by adding a defensive nil check when invoking
refreshTypeMetadata.

Fixes: #119428
Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@wenyihu6
Copy link
Contributor Author

Do we prefer to add nil check before calling refreshMetadata instead? That seems to be the convention way of handling.

@wenyihu6 wenyihu6 marked this pull request as ready for review February 26, 2024 17:25
@wenyihu6 wenyihu6 requested a review from a team as a code owner February 26, 2024 17:25
@wenyihu6 wenyihu6 requested review from jayshrivastava and rharding6373 and removed request for a team February 26, 2024 17:25
@wenyihu6 wenyihu6 self-assigned this Feb 26, 2024
Copy link
Contributor

@jayshrivastava jayshrivastava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if nil checking is the best approach here. I believe the bug in the issue was caused by regional by row tables and the hidden region column getting encoded. We should investigate why it happens more in depth and avoid the code path which causes the NPE if possible

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rharding6373)

Copy link
Collaborator

@rharding6373 rharding6373 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @wenyihu6)


pkg/ccl/changefeedccl/avro.go line 1055 at r1 (raw file):

// The only user-defined type is enum, so this is usually a no-op.
func (r *avroDataRecord) refreshTypeMetadata(row cdcevent.Row) error {
	if r == nil {

I would do the alternative you identified and move the nil check to the call site.

@wenyihu6
Copy link
Contributor Author

pkg/ccl/changefeedccl/avro.go line 1055 at r1 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…

I would do the alternative you identified and move the nil check to the call site.

Done.

@wenyihu6
Copy link
Contributor Author

I agree we should keep the investigation going on. My patch here just adds the nil check. The issue being closed here is only for adding the nil check. I think registered.schema.after is expected to be nil here since it is only set for wrapped envelope

opts = avroEnvelopeOpts{afterField: true, beforeField: e.beforeField, updatedField: e.updatedField}
. But we can dig more on why using a hidden column would invoke the code path for using cacheKey whereas not using a hidden column would not. I'm handing this off to @andyyang890 as the current l2.

@wenyihu6
Copy link
Contributor Author

I looked into this more with @andyyang890.

The reason why we hit this cache is because three parallel encoders were started in

numWorkers = defaultNumWorkers()
. When the tests were running without crdb_regions, three encoders all emit one row without using the cache. If we change the test to insert more rows, they start using the cache. When the tests were running with crdb_regions, the enocders now start using the cache. We see that encodeAndEmitMessage is being called a bit later than when crdb_regions is not used, and it looks like one of the encoder got the cacheKey and triggered the code path.

We thought it might be useful to add another test case here to force one encoder to make sure we are triggering the code path involving the cache.

Copy link
Collaborator

@andyyang890 andyyang890 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Wenyi that after being nil seems to be expected here since it's only expected to be not nil when the envelope type is wrapped. From some git archaeology, it looks like initially the after field was populated for all envelope types and when the change was made to move the data from after into the record field for non-wrapped envelope types, that assumption wasn't corrected. Here's the relevant commit: c00bf0f

I think adding the nil check should be enough to resolve this issue.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rharding6373 and @wenyihu6)

@rharding6373 rharding6373 added backport-23.1.x Flags PRs that need to be backported to 23.1 backport-23.2.x Flags PRs that need to be backported to 23.2. labels Feb 27, 2024
Copy link
Collaborator

@rharding6373 rharding6373 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Thanks for digging into the bottom of this issue Wenyi and Andy! And thanks for fixing it so quickly. Added backport labels.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @wenyihu6)

Copy link
Collaborator

@andyyang890 andyyang890 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @rharding6373 and @wenyihu6)


pkg/ccl/changefeedccl/encoder_avro.go line 223 at r2 (raw file):

	if ok {
		registered = v.(confluentRegisteredEnvelopeSchema)
		if registered.schema.after != nil {

I think we'll also want to add a check for registered.schema.record since that seems to match the original intent of this code. Something like:

-               if err := registered.schema.after.refreshTypeMetadata(updatedRow); err != nil {
-                       return nil, err
+               if registered.schema.after != nil {
+                       if err := registered.schema.after.refreshTypeMetadata(updatedRow); err != nil {
+                               return nil, err
+                       }
+               }
+               if registered.schema.record != nil {
+                       if err := registered.schema.record.refreshTypeMetadata(updatedRow); err != nil {
+                               return nil, err
+                       }
                }

Copy link
Collaborator

@andyyang890 andyyang890 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @rharding6373 and @wenyihu6)


pkg/ccl/changefeedccl/encoder_test.go line 1269 at r3 (raw file):

				cluster, db, cleanup := startTestCluster(t)
				defer cleanup()
				if rand.Intn(2) == 0 {

I think I'd prefer to see something like testutils.RunTrueAndFalse instead so that we always run both cases.

@wenyihu6
Copy link
Contributor Author

pkg/ccl/changefeedccl/encoder_avro.go line 223 at r2 (raw file):

Previously, andyyang890 (Andy Yang) wrote…

I think we'll also want to add a check for registered.schema.record since that seems to match the original intent of this code. Something like:

-               if err := registered.schema.after.refreshTypeMetadata(updatedRow); err != nil {
-                       return nil, err
+               if registered.schema.after != nil {
+                       if err := registered.schema.after.refreshTypeMetadata(updatedRow); err != nil {
+                               return nil, err
+                       }
+               }
+               if registered.schema.record != nil {
+                       if err := registered.schema.record.refreshTypeMetadata(updatedRow); err != nil {
+                               return nil, err
+                       }
                }

Done.

@wenyihu6
Copy link
Contributor Author

pkg/ccl/changefeedccl/encoder_test.go line 1269 at r3 (raw file):

Previously, andyyang890 (Andy Yang) wrote…

I think I'd prefer to see something like testutils.RunTrueAndFalse instead so that we always run both cases.

Done.

Copy link
Collaborator

@andyyang890 andyyang890 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @jayshrivastava, @rharding6373, and @wenyihu6)


pkg/ccl/changefeedccl/encoder_avro.go line 150 at r4 (raw file):

	if ok {
		registered = v.(confluentRegisteredKeySchema)
		if registered.schema != nil {

I'm not sure we should have this check. This field should never be nil and if it somehow was, the code later in this function will panic anyway.


pkg/ccl/changefeedccl/encoder_test.go line 1266 at r4 (raw file):

		}
		for _, test := range tests {
			testutils.RunTrueAndFalse(t, test.format, func(t *testing.T, overrideWithSingleWorker bool) {

You probably still want an outer t.Run(test.format ... and then this inner call should be testutils.RunTrueAndFalse(t, "overrideWithSingleWorker", ....

@wenyihu6
Copy link
Contributor Author

pkg/ccl/changefeedccl/encoder_avro.go line 150 at r4 (raw file):

Previously, andyyang890 (Andy Yang) wrote…

I'm not sure we should have this check. This field should never be nil and if it somehow was, the code later in this function will panic anyway.

Ack. Added as a safe check. I will remove it.

@wenyihu6
Copy link
Contributor Author

wenyihu6 commented Feb 27, 2024

pkg/ccl/changefeedccl/encoder_test.go line 1266 at r4 (raw file):

Previously, andyyang890 (Andy Yang) wrote…

You probably still want an outer t.Run(test.format ... and then this inner call should be testutils.RunTrueAndFalse(t, "overrideWithSingleWorker", ....

I think RunTrueAndFalse lets you pass the test name here

func RunTrueAndFalse[T testingTB[T]](t T, name string, fn func(t T, b bool)) {
? Do you want something else other than test.format here? It also prints out things like below:

    --- PASS: TestAvroWithRegionalTable/wrapped=false (2.00s)
    --- PASS: TestAvroWithRegionalTable/wrapped=true (1.74s)
    --- PASS: TestAvroWithRegionalTable/bare=false (1.88s)
    --- PASS: TestAvroWithRegionalTable/bare=true (1.73s)
    --- PASS: TestAvroWithRegionalTable/key_only=false (1.88s)
    --- PASS: TestAvroWithRegionalTable/key_only=true (1.73s)

@wenyihu6
Copy link
Contributor Author

pkg/ccl/changefeedccl/encoder_avro.go line 150 at r4 (raw file):

Previously, wenyihu6 (Wenyi Hu) wrote…

Ack. Added as a safe check. I will remove it.

Done.

Copy link
Collaborator

@andyyang890 andyyang890 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @jayshrivastava, @rharding6373, and @wenyihu6)


pkg/ccl/changefeedccl/encoder_test.go line 1266 at r4 (raw file):

Previously, wenyihu6 (Wenyi Hu) wrote…

I think RunTrueAndFalse lets you pass the test name here

func RunTrueAndFalse[T testingTB[T]](t T, name string, fn func(t T, b bool)) {
? Do you want something else other than test.format here?

The subtest name passed to RunTrueAndFalse should be the thing you're toggling. You're not toggling wrapped to true and false. If you do what I suggested, the subtest names will be something like TestAvroWithRegionalTable/wrapped/overrideWithSingleWorker=false and TestAvroWithRegionalTable/wrapped/overrideWithSingleWorker=true.

@wenyihu6
Copy link
Contributor Author

wenyihu6 commented Feb 28, 2024

pkg/ccl/changefeedccl/encoder_test.go line 1266 at r4 (raw file):

Previously, andyyang890 (Andy Yang) wrote…

The subtest name passed to RunTrueAndFalse should be the thing you're toggling. You're not toggling wrapped to true and false. If you do what I suggested, the subtest names will be something like TestAvroWithRegionalTable/wrapped/overrideWithSingleWorker=false and TestAvroWithRegionalTable/wrapped/overrideWithSingleWorker=true.

Ack. I changed the names to the following. Is this better?

--- PASS: TestAvroWithRegionalTable (10.57s)
    --- PASS: TestAvroWithRegionalTable/overrideWithSingleWorker=false (5.21s)
        --- PASS: TestAvroWithRegionalTable/overrideWithSingleWorker=false/wrapped (1.69s)
        --- PASS: TestAvroWithRegionalTable/overrideWithSingleWorker=false/bare (1.76s)
        --- PASS: TestAvroWithRegionalTable/overrideWithSingleWorker=false/key_only (1.76s)
    --- PASS: TestAvroWithRegionalTable/overrideWithSingleWorker=true (5.25s)
        --- PASS: TestAvroWithRegionalTable/overrideWithSingleWorker=true/wrapped (1.72s)
        --- PASS: TestAvroWithRegionalTable/overrideWithSingleWorker=true/bare (1.80s)
        --- PASS: TestAvroWithRegionalTable/overrideWithSingleWorker=true/key_only (1.73s)

Copy link
Collaborator

@andyyang890 andyyang890 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @jayshrivastava, @rharding6373, and @wenyihu6)


pkg/ccl/changefeedccl/encoder_test.go line 1266 at r4 (raw file):

Previously, wenyihu6 (Wenyi Hu) wrote…

Ack. I changed it to the following. Is this better?

--- PASS: TestAvroWithRegionalTable (10.57s)
    --- PASS: TestAvroWithRegionalTable/overrideWithSingleWorker=false (5.21s)
        --- PASS: TestAvroWithRegionalTable/overrideWithSingleWorker=false/wrapped (1.69s)
        --- PASS: TestAvroWithRegionalTable/overrideWithSingleWorker=false/bare (1.76s)
        --- PASS: TestAvroWithRegionalTable/overrideWithSingleWorker=false/key_only (1.76s)
    --- PASS: TestAvroWithRegionalTable/overrideWithSingleWorker=true (5.25s)
        --- PASS: TestAvroWithRegionalTable/overrideWithSingleWorker=true/wrapped (1.72s)
        --- PASS: TestAvroWithRegionalTable/overrideWithSingleWorker=true/bare (1.80s)
        --- PASS: TestAvroWithRegionalTable/overrideWithSingleWorker=true/key_only (1.73s)

At the risk of excessive bikshedding, I'd prefer to see envelope type first (e.g. TestAvroWithRegionalTable/wrapped/overrideWithSingleWorker=false) because to me that feels like the natural hierarchical order of things (i.e. each test case is testing a type of envelope and the overrideWithSingleWorker argument is a modifier of that test).

This also has the positive side effect of allowing the for loop to be closer (in nesting depth) to the data structure it's iterating over. In fact, you could have something like:

for _, test := range []struct {
    envelope string
    payload  []string
}{
    // test cases
}{
    t.Run(test.envelope, ...
}

pkg/ccl/changefeedccl/encoder_test.go line 1236 at r6 (raw file):

	tests := []struct {
		format  string

nit: rename this to envelope/envelopeType to match the name of the option it's setting


pkg/ccl/changefeedccl/encoder_test.go line 1270 at r6 (raw file):

				defer cleanup()
				if overrideWithSingleWorker {
					// Run the test with one and three(default) workers to test both the

nit: this comment should be moved to be on top of the testutils.RunTrueAndFalse line

Previously, the avro encoder could call `refreshTypeMetadata` on
`avroDataRecord` without proper nil checking. This could lead to node panics
because `avroDataRecord` could sometimes be nil. For example,
`registered.schema.after` is set only when using the wrapped envelope. Thus,
avro encoder could lead to panics when using with other envelope formats. This
patch addresses this issue by adding a defensive nil check when invoking
`refreshTypeMetadata`.

Fixes: cockroachdb#119428
Release note: None
Copy link
Contributor Author

@wenyihu6 wenyihu6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andyyang890, @jayshrivastava, and @rharding6373)


pkg/ccl/changefeedccl/encoder_test.go line 1266 at r4 (raw file):

Previously, andyyang890 (Andy Yang) wrote…

At the risk of excessive bikshedding, I'd prefer to see envelope type first (e.g. TestAvroWithRegionalTable/wrapped/overrideWithSingleWorker=false) because to me that feels like the natural hierarchical order of things (i.e. each test case is testing a type of envelope and the overrideWithSingleWorker argument is a modifier of that test).

This also has the positive side effect of allowing the for loop to be closer (in nesting depth) to the data structure it's iterating over. In fact, you could have something like:

for _, test := range []struct {
    envelope string
    payload  []string
}{
    // test cases
}{
    t.Run(test.envelope, ...
}

Ack. It now prints the following. Is this better?

--- PASS: TestAvroWithRegionalTable (10.83s)
    --- PASS: TestAvroWithRegionalTable/wrapped (3.69s)
        --- PASS: TestAvroWithRegionalTable/wrapped/overrideWithSingleWorker=false (1.94s)
        --- PASS: TestAvroWithRegionalTable/wrapped/overrideWithSingleWorker=true (1.75s)
    --- PASS: TestAvroWithRegionalTable/bare (3.50s)
        --- PASS: TestAvroWithRegionalTable/bare/overrideWithSingleWorker=false (1.79s)
        --- PASS: TestAvroWithRegionalTable/bare/overrideWithSingleWorker=true (1.71s)
    --- PASS: TestAvroWithRegionalTable/key_only (3.53s)
        --- PASS: TestAvroWithRegionalTable/key_only/overrideWithSingleWorker=false (1.78s)
        --- PASS: TestAvroWithRegionalTable/key_only/overrideWithSingleWorker=true (1.75s)

pkg/ccl/changefeedccl/encoder_test.go line 1236 at r6 (raw file):

Previously, andyyang890 (Andy Yang) wrote…

nit: rename this to envelope/envelopeType to match the name of the option it's setting

Done.


pkg/ccl/changefeedccl/encoder_test.go line 1270 at r6 (raw file):

Previously, andyyang890 (Andy Yang) wrote…

nit: this comment should be moved to be on top of the testutils.RunTrueAndFalse line

Done.

@wenyihu6
Copy link
Contributor Author

I think I've resolved all comments above. I will go ahead and merge this, but lmk if you have any other comments. Thanks for the thorough PR review!

bors r=andyyang890, rharding6373

Copy link
Collaborator

@andyyang890 andyyang890 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @jayshrivastava and @rharding6373)

@craig
Copy link
Contributor

craig bot commented Feb 28, 2024

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Feb 28, 2024

Build succeeded:

@craig craig bot merged commit b777cef into cockroachdb:master Feb 28, 2024
14 of 17 checks passed
Copy link

blathers-crl bot commented Feb 28, 2024

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from e33ed5d to blathers/backport-release-23.1-119639: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1 backport-23.2.x Flags PRs that need to be backported to 23.2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

changefeedccl: set or defend registered.schema.after for all envelope types
5 participants