-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support struct column reading with different schemas #5962
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for meta-velox canceled.
|
velox/dwio/common/Options.h
Outdated
/** | ||
* Get the output type of row reader. | ||
*/ | ||
const RowTypePtr& getOutputType() const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requested type is available as getSelector()->getSchemaWithId()->type
. We may want to convert it to a type directly in the future, but for now let's not keep 2 copies of the same thing.
for (auto i = 0; i < childSpecs.size(); ++i) { | ||
if (childSpecs[i]->isConstant()) { | ||
continue; | ||
} | ||
auto childDataType = fileType_->childByName(childSpecs[i]->fieldName()); | ||
const auto& fieldName = childSpecs[i]->fieldName(); | ||
if (outputType && !fileType_->containsChild(fieldName)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to decide what is the schema evolution strategy we want here. In our data warehouse, columns are not matched by name but by position, so any extra fields added need to be at the end of the children list. This allows column renaming. If we match by name here, we will lose the renaming functionality and this seems quite important in most data warehouse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your comment. Does that mean for a row(a, c)
struct schema in parquet, the expected output can only be like row(a, c, xxx, ...)
? In Spark, there is no such limitation to extra child fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes new subfields can only be appended. So in plain vanilla Spark, field renaming is not supported? There is also a third way to match by field ID (e.g. Iceberg), we need to start draft some design about this to cover all three cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does field renaming is conducted in the data warehouse you mentioned? In Spark, for query like select a as b
, it adds a projection node with Alias
expression after scan.
And what do you suggest for the design, should I added some notes in this PR or something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With matching by name you need to know all the old field names (a
in your query) in all old files, which is not practical in a normal data warehouse. I would suggest we pause this PR for a bit and design the right way to allow matching columns in different ways first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. That looks good to me. Convert this PR to draft for now.
c03152b
to
c8c5132
Compare
2168dc9
to
fda6ff8
Compare
a8174d3
to
7abb820
Compare
d307831
to
0364f89
Compare
e7eab9e
to
1021b22
Compare
e50d52d
to
f9e493e
Compare
155830e
to
2a1569b
Compare
When reading struct type column, the user-specified output schema can be different with the actual data schema in Parquet. For the missing fields,
null
occupies the position. Below table summarizes the cases and the expected result.This PR supports above cases and adds unit tests for them.