Support struct column reading with different schemas #5962

rui-mo · 2023-08-02T06:51:51Z

When reading struct type column, the user-specified output schema can be different with the actual data schema in Parquet. For the missing fields, null occupies the position. Below table summarizes the cases and the expected result.

Parquet column schema	User-specified output schema	Result
row({"a", "c"})	row({"a", "b", "c"})	row(a_val, null, c_val)
row({"a", "c"})	row({"b"})	row(null)
row({"a", "c"})	row({"b", "d"})	null
row({"a", "c"})	row({})	empty

This PR supports above cases and adds unit tests for them.

netlify · 2023-08-02T06:51:56Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`2a1569b`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/666fa24e03fc3000082a4b10

Yuhta · 2023-08-02T17:47:02Z

velox/dwio/common/Options.h

+  /**
+   * Get the output type of row reader.
+   */
+  const RowTypePtr& getOutputType() const {


Requested type is available as getSelector()->getSchemaWithId()->type. We may want to convert it to a type directly in the future, but for now let's not keep 2 copies of the same thing.

Yuhta · 2023-08-02T17:51:34Z

velox/dwio/parquet/reader/StructColumnReader.cpp

  for (auto i = 0; i < childSpecs.size(); ++i) {
    if (childSpecs[i]->isConstant()) {
      continue;
    }
-    auto childDataType = fileType_->childByName(childSpecs[i]->fieldName());
+    const auto& fieldName = childSpecs[i]->fieldName();
+    if (outputType && !fileType_->containsChild(fieldName)) {


We need to decide what is the schema evolution strategy we want here. In our data warehouse, columns are not matched by name but by position, so any extra fields added need to be at the end of the children list. This allows column renaming. If we match by name here, we will lose the renaming functionality and this seems quite important in most data warehouse.

Thanks for your comment. Does that mean for a row(a, c) struct schema in parquet, the expected output can only be like row(a, c, xxx, ...)? In Spark, there is no such limitation to extra child fields.

Yes new subfields can only be appended. So in plain vanilla Spark, field renaming is not supported? There is also a third way to match by field ID (e.g. Iceberg), we need to start draft some design about this to cover all three cases.

How does field renaming is conducted in the data warehouse you mentioned? In Spark, for query like select a as b, it adds a projection node with Alias expression after scan.
And what do you suggest for the design, should I added some notes in this PR or something else?

With matching by name you need to know all the old field names (a in your query) in all old files, which is not practical in a normal data warehouse. I would suggest we pause this PR for a bit and design the right way to allow matching columns in different ways first.

Thanks. That looks good to me. Convert this PR to draft for now.

…or#5962)

…t schemas (5962) facebookincubator#5962

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 2, 2023

Yuhta self-requested a review August 2, 2023 17:44

Yuhta reviewed Aug 2, 2023

View reviewed changes

rui-mo changed the title ~~Support struct column reading with different schemas~~ [GLUTEN] Support struct column reading with different schemas Aug 4, 2023

rui-mo marked this pull request as draft August 9, 2023 02:52

rui-mo changed the title ~~[GLUTEN] Support struct column reading with different schemas~~ Support struct column reading with different schemas Aug 28, 2023

rui-mo force-pushed the wip_struct branch 3 times, most recently from c03152b to c8c5132 Compare September 5, 2023 05:28

rui-mo force-pushed the wip_struct branch from c8c5132 to e65f832 Compare September 19, 2023 07:40

rui-mo force-pushed the wip_struct branch 2 times, most recently from 2168dc9 to fda6ff8 Compare October 13, 2023 02:33

rui-mo force-pushed the wip_struct branch from fda6ff8 to 6dc6b0f Compare October 26, 2023 01:11

rui-mo force-pushed the wip_struct branch 2 times, most recently from a8174d3 to 7abb820 Compare November 7, 2023 01:44

rui-mo force-pushed the wip_struct branch from 7abb820 to e847a3b Compare November 22, 2023 05:59

rui-mo force-pushed the wip_struct branch from e847a3b to 07949bb Compare January 2, 2024 09:48

rui-mo force-pushed the wip_struct branch 2 times, most recently from d307831 to 0364f89 Compare January 26, 2024 02:54

rui-mo force-pushed the wip_struct branch from 0364f89 to 12ca41d Compare March 5, 2024 03:34

rui-mo force-pushed the wip_struct branch from 12ca41d to 8af647b Compare March 20, 2024 04:06

rui-mo force-pushed the wip_struct branch 2 times, most recently from e7eab9e to 1021b22 Compare April 2, 2024 05:13

marin-ma pushed a commit to oap-project/velox that referenced this pull request Apr 2, 2024

Support struct column reading with different schemas (facebookincubat…

73da57b

…or#5962)

rui-mo force-pushed the wip_struct branch from 1021b22 to f2a890c Compare April 2, 2024 08:32

marin-ma pushed a commit to oap-project/velox that referenced this pull request Apr 3, 2024

Support struct column reading with different schemas (facebookincubat…

86d860c

…or#5962)

GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Apr 4, 2024

Support struct column reading with different schemas (facebookincubat…

a9f262b

…or#5962)

GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Apr 5, 2024

Support struct column reading with different schemas (facebookincubat…

2b4b1a4

…or#5962)

zhouyuan pushed a commit to oap-project/velox that referenced this pull request May 10, 2024

Support struct column reading with different schemas (facebookincubat…

3d3cfd7

…or#5962)

zhouyuan pushed a commit to oap-project/velox that referenced this pull request May 11, 2024

Support struct column reading with different schemas (facebookincubat…

1508aaa

…or#5962)

GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request May 12, 2024

Support struct column reading with different schemas (facebookincubat…

7ddfe1a

…or#5962)

GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request May 14, 2024

Support struct column reading with different schemas (facebookincubat…

fc47c7c

…or#5962)

GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request May 15, 2024

Support struct column reading with different schemas (facebookincubat…

543558c

…or#5962)

GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request May 16, 2024

Support struct column reading with different schemas (facebookincubat…

9276e8e

…or#5962)

rui-mo added a commit to oap-project/velox that referenced this pull request May 17, 2024

Support struct column reading with different schemas (facebookincubat…

c4181c2

…or#5962)

GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request May 18, 2024

Support struct column reading with different schemas (facebookincubat…

08ffc91

…or#5962)

rui-mo added a commit to oap-project/velox that referenced this pull request May 18, 2024

Support struct column reading with different schemas (facebookincubat…

f675cce

…or#5962)

GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request May 19, 2024

Support struct column reading with different schemas (facebookincubat…

2af5004

…or#5962)

GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request May 21, 2024

Support struct column reading with different schemas (facebookincubat…

183c3b5

…or#5962)

GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request May 22, 2024

Support struct column reading with different schemas (facebookincubat…

c36c91c

…or#5962)

rui-mo force-pushed the wip_struct branch 2 times, most recently from e50d52d to f9e493e Compare May 30, 2024 01:29

Support struct column reading with different schemas

2a1569b

rui-mo force-pushed the wip_struct branch 2 times, most recently from 155830e to 2a1569b Compare June 17, 2024 02:41

FelixYBW pushed a commit to oap-project/velox that referenced this pull request Jul 25, 2024

[facebookincubator#5962 ] Support struct column reading with differen…

4a68739

…t schemas (5962) facebookincubator#5962

zhztheplayer pushed a commit to oap-project/velox that referenced this pull request Jul 25, 2024

[facebookincubator#5962 ] Support struct column reading with differen…

e003832

…t schemas (5962) facebookincubator#5962

zhztheplayer pushed a commit to oap-project/velox that referenced this pull request Jul 25, 2024

[facebookincubator#5962 ] Support struct column reading with differen…

c8eb37a

…t schemas (5962) facebookincubator#5962

GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Jul 25, 2024

[facebookincubator#5962 ] Support struct column reading with differen…

5dd9414

…t schemas (5962) facebookincubator#5962

zhztheplayer pushed a commit to oap-project/velox that referenced this pull request Jul 26, 2024

[facebookincubator#5962 ] Support struct column reading with differen…

d7a03a3

…t schemas (5962) facebookincubator#5962

zhztheplayer pushed a commit to zhztheplayer/velox that referenced this pull request Jul 27, 2024

[facebookincubator#5962 ] Support struct column reading with differen…

5ea4764

…t schemas (5962) facebookincubator#5962

GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Jul 29, 2024

[facebookincubator#5962 ] Support struct column reading with differen…

ed1bac0

…t schemas (5962) facebookincubator#5962

GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Jul 30, 2024

[facebookincubator#5962 ] Support struct column reading with differen…

49b4379

…t schemas (5962) facebookincubator#5962

GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Jul 31, 2024

[facebookincubator#5962 ] Support struct column reading with differen…

bed7978

…t schemas (5962) facebookincubator#5962

GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Aug 1, 2024

[facebookincubator#5962 ] Support struct column reading with differen…

df0eeba

…t schemas (5962) facebookincubator#5962

GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Aug 2, 2024

[facebookincubator#5962 ] Support struct column reading with differen…

584944b

…t schemas (5962) facebookincubator#5962

GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Aug 3, 2024

[facebookincubator#5962 ] Support struct column reading with differen…

48caa64

…t schemas (5962) facebookincubator#5962

GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Aug 4, 2024

[facebookincubator#5962 ] Support struct column reading with differen…

ff4b07f

…t schemas (5962) facebookincubator#5962

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support struct column reading with different schemas #5962

Support struct column reading with different schemas #5962

rui-mo commented Aug 2, 2023

netlify bot commented Aug 2, 2023 •

edited

Loading

Yuhta Aug 2, 2023

Yuhta Aug 2, 2023

rui-mo Aug 4, 2023

Yuhta Aug 4, 2023

rui-mo Aug 7, 2023 •

edited

Loading

Yuhta Aug 8, 2023

rui-mo Aug 9, 2023

Support struct column reading with different schemas #5962

Are you sure you want to change the base?

Support struct column reading with different schemas #5962

Conversation

rui-mo commented Aug 2, 2023

netlify bot commented Aug 2, 2023 • edited Loading

✅ Deploy Preview for meta-velox canceled.

Yuhta Aug 2, 2023

Choose a reason for hiding this comment

Yuhta Aug 2, 2023

Choose a reason for hiding this comment

rui-mo Aug 4, 2023

Choose a reason for hiding this comment

Yuhta Aug 4, 2023

Choose a reason for hiding this comment

rui-mo Aug 7, 2023 • edited Loading

Choose a reason for hiding this comment

Yuhta Aug 8, 2023

Choose a reason for hiding this comment

rui-mo Aug 9, 2023

Choose a reason for hiding this comment

netlify bot commented Aug 2, 2023 •

edited

Loading

rui-mo Aug 7, 2023 •

edited

Loading