Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support struct column reading with different schemas #5962

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

rui-mo
Copy link
Collaborator

@rui-mo rui-mo commented Aug 2, 2023

When reading struct type column, the user-specified output schema can be different with the actual data schema in Parquet. For the missing fields, null occupies the position. Below table summarizes the cases and the expected result.

Parquet column schema User-specified output schema Result
row({"a", "c"}) row({"a", "b", "c"}) row(a_val, null, c_val)
row({"a", "c"}) row({"b"}) row(null)
row({"a", "c"}) row({"b", "d"}) null
row({"a", "c"}) row({}) empty

This PR supports above cases and adds unit tests for them.

@netlify
Copy link

netlify bot commented Aug 2, 2023

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 2a1569b
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/666fa24e03fc3000082a4b10

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 2, 2023
@Yuhta Yuhta self-requested a review August 2, 2023 17:44
/**
* Get the output type of row reader.
*/
const RowTypePtr& getOutputType() const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requested type is available as getSelector()->getSchemaWithId()->type. We may want to convert it to a type directly in the future, but for now let's not keep 2 copies of the same thing.

for (auto i = 0; i < childSpecs.size(); ++i) {
if (childSpecs[i]->isConstant()) {
continue;
}
auto childDataType = fileType_->childByName(childSpecs[i]->fieldName());
const auto& fieldName = childSpecs[i]->fieldName();
if (outputType && !fileType_->containsChild(fieldName)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to decide what is the schema evolution strategy we want here. In our data warehouse, columns are not matched by name but by position, so any extra fields added need to be at the end of the children list. This allows column renaming. If we match by name here, we will lose the renaming functionality and this seems quite important in most data warehouse.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comment. Does that mean for a row(a, c) struct schema in parquet, the expected output can only be like row(a, c, xxx, ...)? In Spark, there is no such limitation to extra child fields.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes new subfields can only be appended. So in plain vanilla Spark, field renaming is not supported? There is also a third way to match by field ID (e.g. Iceberg), we need to start draft some design about this to cover all three cases.

Copy link
Collaborator Author

@rui-mo rui-mo Aug 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does field renaming is conducted in the data warehouse you mentioned? In Spark, for query like select a as b, it adds a projection node with Alias expression after scan.
And what do you suggest for the design, should I added some notes in this PR or something else?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With matching by name you need to know all the old field names (a in your query) in all old files, which is not practical in a normal data warehouse. I would suggest we pause this PR for a bit and design the right way to allow matching columns in different ways first.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. That looks good to me. Convert this PR to draft for now.

@rui-mo rui-mo changed the title Support struct column reading with different schemas [GLUTEN] Support struct column reading with different schemas Aug 4, 2023
@rui-mo rui-mo marked this pull request as draft August 9, 2023 02:52
@rui-mo rui-mo changed the title [GLUTEN] Support struct column reading with different schemas Support struct column reading with different schemas Aug 28, 2023
@rui-mo rui-mo force-pushed the wip_struct branch 3 times, most recently from c03152b to c8c5132 Compare September 5, 2023 05:28
@rui-mo rui-mo force-pushed the wip_struct branch 2 times, most recently from 2168dc9 to fda6ff8 Compare October 13, 2023 02:33
@rui-mo rui-mo force-pushed the wip_struct branch 2 times, most recently from a8174d3 to 7abb820 Compare November 7, 2023 01:44
@rui-mo rui-mo force-pushed the wip_struct branch 2 times, most recently from d307831 to 0364f89 Compare January 26, 2024 02:54
@rui-mo rui-mo force-pushed the wip_struct branch 2 times, most recently from e7eab9e to 1021b22 Compare April 2, 2024 05:13
marin-ma pushed a commit to oap-project/velox that referenced this pull request Apr 2, 2024
marin-ma pushed a commit to oap-project/velox that referenced this pull request Apr 3, 2024
GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Apr 4, 2024
GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Apr 5, 2024
zhouyuan pushed a commit to oap-project/velox that referenced this pull request May 10, 2024
zhouyuan pushed a commit to oap-project/velox that referenced this pull request May 11, 2024
GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request May 12, 2024
GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request May 14, 2024
GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request May 15, 2024
GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request May 16, 2024
rui-mo added a commit to oap-project/velox that referenced this pull request May 17, 2024
GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request May 18, 2024
rui-mo added a commit to oap-project/velox that referenced this pull request May 18, 2024
GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request May 19, 2024
GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request May 21, 2024
GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request May 22, 2024
@rui-mo rui-mo force-pushed the wip_struct branch 2 times, most recently from e50d52d to f9e493e Compare May 30, 2024 01:29
@rui-mo rui-mo force-pushed the wip_struct branch 2 times, most recently from 155830e to 2a1569b Compare June 17, 2024 02:41
FelixYBW pushed a commit to oap-project/velox that referenced this pull request Jul 25, 2024
zhztheplayer pushed a commit to oap-project/velox that referenced this pull request Jul 25, 2024
zhztheplayer pushed a commit to oap-project/velox that referenced this pull request Jul 25, 2024
GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Jul 25, 2024
zhztheplayer pushed a commit to oap-project/velox that referenced this pull request Jul 26, 2024
zhztheplayer pushed a commit to zhztheplayer/velox that referenced this pull request Jul 27, 2024
GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Jul 29, 2024
GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Jul 30, 2024
GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Jul 31, 2024
GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Aug 1, 2024
GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Aug 2, 2024
GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Aug 3, 2024
GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Aug 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants