Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

importccl: fix target column populating for AVRO #52867

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Anzoteh96
Copy link

@Anzoteh96 Anzoteh96 commented Aug 16, 2020

Target columns for AVRO are not properly populated when starting a job for IMPORT INTO. This hinders the ability for default expressions to be supported, and therefore this PR does that by extracting these information out from the AVRO schema to be populated into importCtx. It also extracts the columns present in the table but missing from the AVRO schema so that an error could be returned if strict validation is turned on, but when some columns in the table are not specified in the AVRO schema.

Fixed #52160

The default values for each column follows the precendence:

  1. Data of that column specified in an AVRO row;
  2. Default value specified in AVRO schema:
  3. Default value of a table.

For example, say we have
CREATE TABLE t (a INT, b INT DEFAULT 42)
Consider, also, an Avro file myfile.avro. Then for each row:

  1. If a value is present in column (b), then the value is used.
  2. If the value is not present in column (b) but the AVRO schema specifies
    schema: (a, "INT", b, "INT", default: 20)
    then the default value of 20 is automatically populated into the Avro values, and we'll therefore use this default value 20 instead of 42.
  3. If none of (1) and (2) holds, then the value 42 is used.

Release note: None

@miretskiy
Copy link
Contributor

I'm curious where did you see the use of "default" setting in avro schema? Is it a standard setting or something we decided on?

@Anzoteh96
Copy link
Author

I'm curious where did you see the use of "default" setting in avro schema? Is it a standard setting or something we decided on?

It's over here: https://avro.apache.org/docs/1.8.1/spec.html#schema_complex. Also notice that their default expressions seem to limit to only constant expressions.

@Anzoteh96 Anzoteh96 changed the title importccl: [WIP] fix target column populating for AVRO importccl: fix target column populating for AVRO Aug 17, 2020
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@Anzoteh96 Anzoteh96 requested review from a team and miretskiy and removed request for a team August 17, 2020 14:32
@Anzoteh96 Anzoteh96 marked this pull request as ready for review August 17, 2020 14:33
@pbardea pbardea self-requested a review August 17, 2020 20:54
Copy link
Contributor

@pbardea pbardea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @Anzoteh96, @miretskiy, and @pbardea)


pkg/ccl/importccl/read_import_avro.go, line 162 at r1 (raw file):

		colName, ok := col["name"].(string)
		if !ok {
			return []tree.Name{}, []string{}, errors.New("bad field name description")

This error message and the one above could be more descriptive. Why would casting fail? Let's include the column and also what about that column is invalid.


pkg/ccl/importccl/read_import_avro.go, line 456 at r1 (raw file):

		strict:         avro.opts.StrictMode,
	}
	// TODO: need to check strict mode.

Are you planning on handling this TODO in this PR?

Copy link
Author

@Anzoteh96 Anzoteh96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy and @pbardea)


pkg/ccl/importccl/read_import_avro.go, line 162 at r1 (raw file):

Previously, pbardea (Paul Bardea) wrote…

This error message and the one above could be more descriptive. Why would casting fail? Let's include the column and also what about that column is invalid.

At these stages the column names are not obtained yet. So how it works was:

colMap is schema["fields"], which is, in runtime, an array of map[string]interface{}. But at compile time it's simply an interface, which is why casting is needed.

There's a two-step casting process here:
interface -> array of interface and for each element, interface -> map[string]interface{}

Finally, for the last one (casting below) the column name is given as col["name"], and cast that from an interface{} to a string.


pkg/ccl/importccl/read_import_avro.go, line 456 at r1 (raw file):

Previously, pbardea (Paul Bardea) wrote…

Are you planning on handling this TODO in this PR?

Ah, this TODO was a note for myself (when I was struggling to find a way to handle strict mode to pass the existing tests). I should have removed it even before putting this PR up 🤦‍♂️

@Anzoteh96 Anzoteh96 force-pushed the import-avro-fix branch 2 times, most recently from cea198e to e0864f3 Compare August 19, 2020 15:07
@pbardea pbardea self-requested a review August 19, 2020 15:09
Copy link
Author

@Anzoteh96 Anzoteh96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy and @pbardea)


pkg/ccl/importccl/import_stmt_test.go, line 3431 at r3 (raw file):

				data:       avroData,
				format:     "AVRO",
			},

There used to be a TODO to create a test case for AVRO (under the random() default expression PR), now I've added this test case. Not too sure if this is intended though since there has been another test case on AVRO import (with constant default expression) earlier on.

Target columns for AVRO are not properly populated when
starting a job for IMPORT INTO. This hinders the ability
for default expressions to be supported, and therefore this
PR does that by extracting these information out from the AVRO
schema to be populated into `importCtx`. It also extracts the
columns present in the table but missing from the AVRO schema
so that an error could be returned if strict validation is turned
on, but when some columns in the table are not specified in the
AVRO schema.

The default values for each column follows the precendence:
1. Data of that column specified in an AVRO row;
2. Default value specified in AVRO schema:
3. Default value of a table.

For example, say we have
CREATE TABLE t (a INT, b INT DEFAULT 42)
Consider, also, an Avro file myfile.avro. Then for each row:

1. If a value is present in column (b), then the value is used.
2. If the value is not present in column (b) but the AVRO schema specifies
schema: (a, "INT", b, "INT", default: 20) then the default value of 20
is automatically populated into the Avro values, and we'll therefore
use this default value 20 instead of 42.
3. If none of (1) and (2) holds, then the value 42 is used.

Release note: None
Copy link
Author

@Anzoteh96 Anzoteh96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I had discussion with @pbardea : we agree that this is close to being done except on the way the schema is read in getTargetColFromCodec. More details below.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy and @pbardea)


pkg/ccl/importccl/read_import_avro.go, line 162 at r1 (raw file):

Previously, Anzoteh96 (Anzo Teh) wrote…

At these stages the column names are not obtained yet. So how it works was:

colMap is schema["fields"], which is, in runtime, an array of map[string]interface{}. But at compile time it's simply an interface, which is why casting is needed.

There's a two-step casting process here:
interface -> array of interface and for each element, interface -> map[string]interface{}

Finally, for the last one (casting below) the column name is given as col["name"], and cast that from an interface{} to a string.

Following up from the general comment above: here we're assuming that codec.Schema() is in the form:
column_name = schema["fields"][column_number]["name"]. So far this assumption has gone well with the test cases we have, but is this always the case?

@tbg tbg added the X-noremind Bots won't notify about PRs with X-noremind label May 6, 2021
@cockroach-teamcity
Copy link
Member

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


anzoteh96 seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@adityamaru adityamaru removed the request for review from pbardea May 11, 2022 11:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
X-noremind Bots won't notify about PRs with X-noremind
Projects
None yet
Development

Successfully merging this pull request may close these issues.

importccl: IMPORT INTO AVRO file gives NULL for default expressions not specified in the AVRO schema
5 participants