New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
importccl: fix target column populating for AVRO #52867
base: master
Are you sure you want to change the base?
Conversation
I'm curious where did you see the use of "default" setting in avro schema? Is it a standard setting or something we decided on? |
It's over here: https://avro.apache.org/docs/1.8.1/spec.html#schema_complex. Also notice that their default expressions seem to limit to only constant expressions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @Anzoteh96, @miretskiy, and @pbardea)
pkg/ccl/importccl/read_import_avro.go, line 162 at r1 (raw file):
colName, ok := col["name"].(string) if !ok { return []tree.Name{}, []string{}, errors.New("bad field name description")
This error message and the one above could be more descriptive. Why would casting fail? Let's include the column and also what about that column is invalid.
pkg/ccl/importccl/read_import_avro.go, line 456 at r1 (raw file):
strict: avro.opts.StrictMode, } // TODO: need to check strict mode.
Are you planning on handling this TODO in this PR?
909c510
to
d4ecb0b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy and @pbardea)
pkg/ccl/importccl/read_import_avro.go, line 162 at r1 (raw file):
Previously, pbardea (Paul Bardea) wrote…
This error message and the one above could be more descriptive. Why would casting fail? Let's include the column and also what about that column is invalid.
At these stages the column names are not obtained yet. So how it works was:
colMap is schema["fields"]
, which is, in runtime, an array of map[string]interface{}. But at compile time it's simply an interface, which is why casting is needed.
There's a two-step casting process here:
interface -> array of interface and for each element, interface -> map[string]interface{}
Finally, for the last one (casting below) the column name is given as col["name"], and cast that from an interface{} to a string.
pkg/ccl/importccl/read_import_avro.go, line 456 at r1 (raw file):
Previously, pbardea (Paul Bardea) wrote…
Are you planning on handling this TODO in this PR?
Ah, this TODO was a note for myself (when I was struggling to find a way to handle strict mode to pass the existing tests). I should have removed it even before putting this PR up 🤦♂️
cea198e
to
e0864f3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy and @pbardea)
pkg/ccl/importccl/import_stmt_test.go, line 3431 at r3 (raw file):
data: avroData, format: "AVRO", },
There used to be a TODO to create a test case for AVRO (under the random()
default expression PR), now I've added this test case. Not too sure if this is intended though since there has been another test case on AVRO import (with constant default expression) earlier on.
e0864f3
to
623e885
Compare
Target columns for AVRO are not properly populated when starting a job for IMPORT INTO. This hinders the ability for default expressions to be supported, and therefore this PR does that by extracting these information out from the AVRO schema to be populated into `importCtx`. It also extracts the columns present in the table but missing from the AVRO schema so that an error could be returned if strict validation is turned on, but when some columns in the table are not specified in the AVRO schema. The default values for each column follows the precendence: 1. Data of that column specified in an AVRO row; 2. Default value specified in AVRO schema: 3. Default value of a table. For example, say we have CREATE TABLE t (a INT, b INT DEFAULT 42) Consider, also, an Avro file myfile.avro. Then for each row: 1. If a value is present in column (b), then the value is used. 2. If the value is not present in column (b) but the AVRO schema specifies schema: (a, "INT", b, "INT", default: 20) then the default value of 20 is automatically populated into the Avro values, and we'll therefore use this default value 20 instead of 42. 3. If none of (1) and (2) holds, then the value 42 is used. Release note: None
623e885
to
f3d9e37
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I had discussion with @pbardea : we agree that this is close to being done except on the way the schema is read in getTargetColFromCodec
. More details below.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy and @pbardea)
pkg/ccl/importccl/read_import_avro.go, line 162 at r1 (raw file):
Previously, Anzoteh96 (Anzo Teh) wrote…
At these stages the column names are not obtained yet. So how it works was:
colMap is
schema["fields"]
, which is, in runtime, an array of map[string]interface{}. But at compile time it's simply an interface, which is why casting is needed.There's a two-step casting process here:
interface -> array of interface and for each element, interface -> map[string]interface{}Finally, for the last one (casting below) the column name is given as col["name"], and cast that from an interface{} to a string.
Following up from the general comment above: here we're assuming that codec.Schema()
is in the form:
column_name = schema["fields"][column_number]["name"]
. So far this assumption has gone well with the test cases we have, but is this always the case?
anzoteh96 seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
Target columns for AVRO are not properly populated when starting a job for IMPORT INTO. This hinders the ability for default expressions to be supported, and therefore this PR does that by extracting these information out from the AVRO schema to be populated into
importCtx
. It also extracts the columns present in the table but missing from the AVRO schema so that an error could be returned if strict validation is turned on, but when some columns in the table are not specified in the AVRO schema.Fixed #52160
The default values for each column follows the precendence:
For example, say we have
CREATE TABLE t (a INT, b INT DEFAULT 42)
Consider, also, an Avro file
myfile.avro
. Then for each row:schema: (a, "INT", b, "INT", default: 20)
then the default value of 20 is automatically populated into the Avro values, and we'll therefore use this default value 20 instead of 42.
Release note: None