Remove partition columns before writing partitioned file #8089

JkSelf · 2023-12-18T10:11:28Z

Spark distinguishes between data columns and partition columns, and it only writes the data columns into the file. However, Velox writes both the data columns and partition columns into the file. This PR removes the partition columns from the row vector before writing file.

netlify · 2023-12-18T10:11:33Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`7b9ab80`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/65a1c6be8cd6ba0009f053d1

JkSelf · 2023-12-18T10:11:50Z

@mbasmanova Can you help to review? Thanks.

mbasmanova

CC: @gggrace14

I wonder if this is a bug in the query planner. Ge, would you help take a look?

gggrace14 · 2023-12-18T20:22:28Z

CC: @gggrace14

I wonder if this is a bug in the query planner. Ge, would you help take a look?

Looking at it

velox/connectors/hive/HiveDataSink.cpp

gggrace14 · 2023-12-19T06:41:47Z

CC: @gggrace14

I wonder if this is a bug in the query planner. Ge, would you help take a look?

Hi @mbasmanova Masha, I checked the Presto code, and it does look like a bug in the original code. That is, Presto does not include partition columns today when writing to a file under a partition directory. So I think we should have this change (and not for only parquet file).

Also I will run Presto Verifier after or before this change is merged, as it is on a critical path.

Better to ask you to also take a look to help confirm.

https://github.com/prestodb/presto/blob/9104d718332913be6dc1f461917cd1b8594b04ec/presto-hive/src/main/java/com/facebook/presto/hive/HivePageSink.java#L390
https://github.com/prestodb/presto/blob/9104d718332913be6dc1f461917cd1b8594b04ec/presto-hive/src/main/java/com/facebook/presto/hive/HivePageSink.java#L509-L517
https://github.com/prestodb/presto/blob/9104d718332913be6dc1f461917cd1b8594b04ec/presto-hive/src/main/java/com/facebook/presto/hive/HivePageSink.java#L163

mbasmanova · 2023-12-19T11:41:51Z

@gggrace14 Ge, thank you for looking and commenting. Sounds like this is a generic bug not specific to Parquet. Indeed, partition keys should not be written in to files, not only for Parquet, but for all formats.

@JkSelf This is a nice finding. Let's update PR description to explain that this is a generic bug. You may mention that it was discovered by running Spark tests on Gluten. I'll take a look at the code. Thanks.

mbasmanova

@JkSelf Thank you for identifying this issue and working on a fix. Some comments below.

velox/exec/tests/TableWriteTest.cpp

mbasmanova · 2023-12-19T11:44:13Z

velox/exec/tests/TableWriteTest.cpp

@@ -1985,6 +1988,49 @@ TEST_P(PartitionedWithoutBucketTableWriterTest, fromSinglePartitionToMultiple) {
      "SELECT * FROM tmp");
 }

+TEST_P(PartitionedTableWriterTest, removePartitionColumns) {


Do we need a new test or could we modify existing tests to verify the list of columns written into the file? I assume there are existing tests that write partitioned tables. These test would start failing once we add verification for list of columns in the file, no?

@mbasmanova Removed.

velox/connectors/hive/HiveDataSink.cpp

JkSelf · 2023-12-25T07:11:14Z

@mbasmanova @gggrace14 Thanks for your review. I have updated all your comments. Can you help to review again? Thanks.

gggrace14

@JkSelf The structure looks a lot better now and we're very close. Left some comments.

velox/connectors/hive/HiveDataSink.cpp

velox/connectors/hive/HiveDataSink.h

velox/connectors/hive/HiveDataSink.cpp

JkSelf · 2024-01-03T02:34:19Z

@gggrace14 Thanks for your review. Can you help to review again? Thanks.

JkSelf · 2024-01-09T03:06:15Z

@mbasmanova Can you help to review again? Thanks.

mbasmanova · 2024-01-09T03:23:41Z

@gggrace14 Ge will help review and merge this PR.

gggrace14

@JkSelf , thank you for revising! Left a bit more comments. Also I see this PR contains 7 commits. Would you be able to fold them into one? Not sure how they will end up in the master branch after merge.

velox/connectors/hive/HiveDataSink.cpp

JkSelf · 2024-01-10T06:40:49Z

@gggrace14 Resolved all your comments. Can you help to review again? Thanks.

gggrace14

It looks great! Thank you for making the change, @JkSelf !

facebook-github-bot · 2024-01-10T21:04:09Z

@gggrace14 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-01-10T21:38:14Z

@gggrace14 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

kewang1024

Thanks @JkSelf for fixing the problem!

Overall PR lgtm, proposed some nits

velox/connectors/hive/HiveDataSink.cpp

velox/exec/tests/TableWriteTest.cpp

…ubator#8089)

JkSelf · 2024-01-11T23:27:23Z

@kewang1024 Resolved your comments. Can you help to review again? Thanks.

…ubator#8089)

kewang1024

Thank you @JkSelf, accepted with one NIT:
the functions can be static since they don't access any member variable (getNonPartitionTypes / getNonPartitionChannels / getNonPartitionsColumns / makeDataInput)

JkSelf · 2024-01-12T23:10:31Z

@kewang1024 @gggrace14 Updated. Can you help to review again? Thanks.

…ubator#8089)

gggrace14 · 2024-01-15T19:18:40Z

The revision looks good to me! Thank you, @JkSelf , for making the change! I'm merging the PR

facebook-github-bot · 2024-01-15T19:34:12Z

@gggrace14 merged this pull request in 8f0dd8b.

conbench-facebook · 2024-01-15T19:56:00Z

Conbench analyzed the 1 benchmark run on commit 8f0dd8b7.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

…ubator#8089) Summary: Spark distinguishes between [data columns](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala#L583 ) and [partition columns](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala#L584), and it only writes the [data columns](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala#L298) into the file. However, Velox writes both the data columns and partition columns into the file. This PR removes the partition columns from the row vector before writing file. Pull Request resolved: facebookincubator#8089 Reviewed By: kewang1024 Differential Revision: D52670985 Pulled By: gggrace14 fbshipit-source-id: 11e5ef4cd99903adaf191b650318f195ef1be62d

This is from cherry-picking of facebookincubator#8089, and is delta of the PR on top of main branch. When we merged the PR, we failed to merge the latest version. Everywhere is consistent though. This change renames the func getDataType and getDataChannels in HiveDataSink, and makes the static.

…ubator#8089) Summary: Spark distinguishes between [data columns](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala#L583 ) and [partition columns](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala#L584), and it only writes the [data columns](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala#L298) into the file. However, Velox writes both the data columns and partition columns into the file. This PR removes the partition columns from the row vector before writing file. Pull Request resolved: facebookincubator#8089 Reviewed By: kewang1024 Differential Revision: D52670985 Pulled By: gggrace14 fbshipit-source-id: 11e5ef4cd99903adaf191b650318f195ef1be62d

Summary: This is from cherry-picking of #8089, and is delta of the PR on top of main branch. When we merged the PR, we failed to merge the latest version. Everywhere is consistent with version though. This change renames the func getDataType and getDataChannels in HiveDataSink, and makes the static. Pull Request resolved: #8404 Reviewed By: xiaoxmeng, kewang1024 Differential Revision: D52822889 Pulled By: gggrace14 fbshipit-source-id: d2fb6fd8cb87fd77d9897555f89e35ec197e3c7f

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 18, 2023

JkSelf force-pushed the remove-partition-columns branch from cdf0ec0 to 18bb57d Compare December 18, 2023 10:18

mbasmanova reviewed Dec 18, 2023

View reviewed changes

gggrace14 reviewed Dec 19, 2023

View reviewed changes

JkSelf mentioned this pull request Dec 19, 2023

[GLUTEN-3547][CORE] [VL] Add native parquet writer in spark 3.4 apache/incubator-gluten#3690

Merged

mbasmanova requested a review from kewang1024 December 19, 2023 11:42

mbasmanova reviewed Dec 19, 2023

View reviewed changes

JkSelf changed the title ~~Remove partition columns before writing partitioned parquet file~~ Remove partition columns before writing partitioned file Dec 21, 2023

JkSelf force-pushed the remove-partition-columns branch 2 times, most recently from 48decd5 to c36db67 Compare December 25, 2023 07:08

JkSelf force-pushed the remove-partition-columns branch 2 times, most recently from 2e53f5a to a160a7b Compare December 27, 2023 01:49

JkSelf mentioned this pull request Dec 27, 2023

Pass spark 3.4 unit test when enabling native parquet write oap-project/velox#466

Merged

gggrace14 reviewed Jan 2, 2024

View reviewed changes

velox/connectors/hive/HiveDataSink.cpp Outdated Show resolved Hide resolved

velox/connectors/hive/HiveDataSink.h Outdated Show resolved Hide resolved

velox/connectors/hive/HiveDataSink.cpp Outdated Show resolved Hide resolved

JkSelf force-pushed the remove-partition-columns branch from 2a4ef01 to 160767c Compare January 3, 2024 02:04

gggrace14 reviewed Jan 10, 2024

View reviewed changes

JkSelf force-pushed the remove-partition-columns branch from 3989089 to 994965b Compare January 10, 2024 06:39

gggrace14 self-requested a review January 10, 2024 20:59

gggrace14 approved these changes Jan 10, 2024

View reviewed changes

kewang1024 reviewed Jan 11, 2024

View reviewed changes

marin-ma pushed a commit to oap-project/velox that referenced this pull request Jan 11, 2024

Remove partition columns before writing partitioned file (facebookinc…

aa61172

…ubator#8089)

JkSelf force-pushed the remove-partition-columns branch from 431139d to 6e6e2c2 Compare January 11, 2024 23:24

GlutenPerfBot pushed a commit to GlutenPerfBot/velox that referenced this pull request Jan 12, 2024

Remove partition columns before writing partitioned file (facebookinc…

b17674a

…ubator#8089)

kewang1024 approved these changes Jan 12, 2024

View reviewed changes

JkSelf force-pushed the remove-partition-columns branch from 32a54db to 7b9ab80 Compare January 12, 2024 23:09

GlutenPerfBot pushed a commit to GlutenPerfBot/velox that referenced this pull request Jan 13, 2024

Remove partition columns before writing partitioned file (facebookinc…

2c76020

…ubator#8089)

Skip the partition columns in table write

7b9ab80

GlutenPerfBot pushed a commit to GlutenPerfBot/velox that referenced this pull request Jan 14, 2024

Remove partition columns before writing partitioned file (facebookinc…

a72f301

…ubator#8089)

GlutenPerfBot pushed a commit to GlutenPerfBot/velox that referenced this pull request Jan 15, 2024

Remove partition columns before writing partitioned file (facebookinc…

476dc65

…ubator#8089)

rui-mo pushed a commit to oap-project/velox that referenced this pull request Jan 15, 2024

Remove partition columns before writing partitioned file (facebookinc…

67f5004

…ubator#8089)

gggrace14 approved these changes Jan 15, 2024

View reviewed changes

facebook-github-bot closed this in 8f0dd8b Jan 15, 2024

facebook-github-bot added the Merged label Jan 15, 2024

gggrace14 mentioned this pull request Jan 17, 2024

Rename getDataType and getDataChannels funcs in HiveDataSink #8404

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove partition columns before writing partitioned file #8089

Remove partition columns before writing partitioned file #8089

JkSelf commented Dec 18, 2023 •

edited

netlify bot commented Dec 18, 2023 •

edited

JkSelf commented Dec 18, 2023

mbasmanova left a comment

gggrace14 commented Dec 18, 2023

gggrace14 commented Dec 19, 2023 •

edited

mbasmanova commented Dec 19, 2023

mbasmanova left a comment

mbasmanova Dec 19, 2023

JkSelf Dec 25, 2023

JkSelf commented Dec 25, 2023

gggrace14 left a comment

JkSelf commented Jan 3, 2024

JkSelf commented Jan 9, 2024

mbasmanova commented Jan 9, 2024

gggrace14 left a comment

JkSelf commented Jan 10, 2024

gggrace14 left a comment

facebook-github-bot commented Jan 10, 2024

facebook-github-bot commented Jan 10, 2024

kewang1024 left a comment

JkSelf commented Jan 11, 2024

kewang1024 left a comment •

edited

JkSelf commented Jan 12, 2024

gggrace14 commented Jan 15, 2024

facebook-github-bot commented Jan 15, 2024

conbench-facebook bot commented Jan 15, 2024

Remove partition columns before writing partitioned file #8089

Remove partition columns before writing partitioned file #8089

Conversation

JkSelf commented Dec 18, 2023 • edited

netlify bot commented Dec 18, 2023 • edited

✅ Deploy Preview for meta-velox canceled.

JkSelf commented Dec 18, 2023

mbasmanova left a comment

Choose a reason for hiding this comment

gggrace14 commented Dec 18, 2023

gggrace14 commented Dec 19, 2023 • edited

mbasmanova commented Dec 19, 2023

mbasmanova left a comment

Choose a reason for hiding this comment

mbasmanova Dec 19, 2023

Choose a reason for hiding this comment

JkSelf Dec 25, 2023

Choose a reason for hiding this comment

JkSelf commented Dec 25, 2023

gggrace14 left a comment

Choose a reason for hiding this comment

JkSelf commented Jan 3, 2024

JkSelf commented Jan 9, 2024

mbasmanova commented Jan 9, 2024

gggrace14 left a comment

Choose a reason for hiding this comment

JkSelf commented Jan 10, 2024

gggrace14 left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Jan 10, 2024

facebook-github-bot commented Jan 10, 2024

kewang1024 left a comment

Choose a reason for hiding this comment

JkSelf commented Jan 11, 2024

kewang1024 left a comment • edited

Choose a reason for hiding this comment

JkSelf commented Jan 12, 2024

gggrace14 commented Jan 15, 2024

facebook-github-bot commented Jan 15, 2024

conbench-facebook bot commented Jan 15, 2024

JkSelf commented Dec 18, 2023 •

edited

netlify bot commented Dec 18, 2023 •

edited

gggrace14 commented Dec 19, 2023 •

edited

kewang1024 left a comment •

edited