Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove partition columns before writing partitioned file #8089

Closed

Conversation

JkSelf
Copy link
Contributor

@JkSelf JkSelf commented Dec 18, 2023

Spark distinguishes between data columns and partition columns, and it only writes the data columns into the file. However, Velox writes both the data columns and partition columns into the file. This PR removes the partition columns from the row vector before writing file.

Copy link

netlify bot commented Dec 18, 2023

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 7b9ab80
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/65a1c6be8cd6ba0009f053d1

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 18, 2023
@JkSelf
Copy link
Contributor Author

JkSelf commented Dec 18, 2023

@mbasmanova Can you help to review? Thanks.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC: @gggrace14

I wonder if this is a bug in the query planner. Ge, would you help take a look?

@gggrace14
Copy link
Contributor

CC: @gggrace14

I wonder if this is a bug in the query planner. Ge, would you help take a look?

Looking at it

velox/connectors/hive/HiveDataSink.cpp Outdated Show resolved Hide resolved
velox/connectors/hive/HiveDataSink.cpp Outdated Show resolved Hide resolved
velox/connectors/hive/HiveDataSink.cpp Outdated Show resolved Hide resolved
velox/connectors/hive/HiveDataSink.cpp Outdated Show resolved Hide resolved
velox/connectors/hive/HiveDataSink.cpp Outdated Show resolved Hide resolved
velox/connectors/hive/HiveDataSink.cpp Outdated Show resolved Hide resolved
velox/connectors/hive/HiveDataSink.cpp Outdated Show resolved Hide resolved
velox/connectors/hive/HiveDataSink.cpp Outdated Show resolved Hide resolved
@gggrace14
Copy link
Contributor

gggrace14 commented Dec 19, 2023

CC: @gggrace14

I wonder if this is a bug in the query planner. Ge, would you help take a look?

Hi @mbasmanova Masha, I checked the Presto code, and it does look like a bug in the original code. That is, Presto does not include partition columns today when writing to a file under a partition directory. So I think we should have this change (and not for only parquet file).

Also I will run Presto Verifier after or before this change is merged, as it is on a critical path.

Better to ask you to also take a look to help confirm.

https://github.com/prestodb/presto/blob/9104d718332913be6dc1f461917cd1b8594b04ec/presto-hive/src/main/java/com/facebook/presto/hive/HivePageSink.java#L390
https://github.com/prestodb/presto/blob/9104d718332913be6dc1f461917cd1b8594b04ec/presto-hive/src/main/java/com/facebook/presto/hive/HivePageSink.java#L509-L517
https://github.com/prestodb/presto/blob/9104d718332913be6dc1f461917cd1b8594b04ec/presto-hive/src/main/java/com/facebook/presto/hive/HivePageSink.java#L163

@mbasmanova
Copy link
Contributor

@gggrace14 Ge, thank you for looking and commenting. Sounds like this is a generic bug not specific to Parquet. Indeed, partition keys should not be written in to files, not only for Parquet, but for all formats.

@JkSelf This is a nice finding. Let's update PR description to explain that this is a generic bug. You may mention that it was discovered by running Spark tests on Gluten. I'll take a look at the code. Thanks.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JkSelf Thank you for identifying this issue and working on a fix. Some comments below.

velox/exec/tests/TableWriteTest.cpp Outdated Show resolved Hide resolved
@@ -1985,6 +1988,49 @@ TEST_P(PartitionedWithoutBucketTableWriterTest, fromSinglePartitionToMultiple) {
"SELECT * FROM tmp");
}

TEST_P(PartitionedTableWriterTest, removePartitionColumns) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a new test or could we modify existing tests to verify the list of columns written into the file? I assume there are existing tests that write partitioned tables. These test would start failing once we add verification for list of columns in the file, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbasmanova Removed.

velox/connectors/hive/HiveDataSink.cpp Outdated Show resolved Hide resolved
velox/connectors/hive/HiveDataSink.cpp Outdated Show resolved Hide resolved
@JkSelf JkSelf changed the title Remove partition columns before writing partitioned parquet file Remove partition columns before writing partitioned file Dec 21, 2023
@JkSelf JkSelf force-pushed the remove-partition-columns branch 2 times, most recently from 48decd5 to c36db67 Compare December 25, 2023 07:08
@JkSelf
Copy link
Contributor Author

JkSelf commented Dec 25, 2023

@mbasmanova @gggrace14 Thanks for your review. I have updated all your comments. Can you help to review again? Thanks.

Copy link
Contributor

@gggrace14 gggrace14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JkSelf The structure looks a lot better now and we're very close. Left some comments.

velox/connectors/hive/HiveDataSink.cpp Outdated Show resolved Hide resolved
velox/connectors/hive/HiveDataSink.h Outdated Show resolved Hide resolved
velox/connectors/hive/HiveDataSink.cpp Outdated Show resolved Hide resolved
@JkSelf
Copy link
Contributor Author

JkSelf commented Jan 3, 2024

@gggrace14 Thanks for your review. Can you help to review again? Thanks.

@JkSelf
Copy link
Contributor Author

JkSelf commented Jan 9, 2024

@mbasmanova Can you help to review again? Thanks.

@mbasmanova
Copy link
Contributor

@gggrace14 Ge will help review and merge this PR.

Copy link
Contributor

@gggrace14 gggrace14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JkSelf , thank you for revising! Left a bit more comments. Also I see this PR contains 7 commits. Would you be able to fold them into one? Not sure how they will end up in the master branch after merge.

velox/connectors/hive/HiveDataSink.cpp Outdated Show resolved Hide resolved
velox/connectors/hive/HiveDataSink.cpp Outdated Show resolved Hide resolved
velox/connectors/hive/HiveDataSink.cpp Outdated Show resolved Hide resolved
velox/connectors/hive/HiveDataSink.cpp Outdated Show resolved Hide resolved
@JkSelf
Copy link
Contributor Author

JkSelf commented Jan 10, 2024

@gggrace14 Resolved all your comments. Can you help to review again? Thanks.

@gggrace14 gggrace14 self-requested a review January 10, 2024 20:59
Copy link
Contributor

@gggrace14 gggrace14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks great! Thank you for making the change, @JkSelf !

@facebook-github-bot
Copy link
Contributor

@gggrace14 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@gggrace14 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@kewang1024 kewang1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @JkSelf for fixing the problem!

Overall PR lgtm, proposed some nits

velox/connectors/hive/HiveDataSink.cpp Outdated Show resolved Hide resolved
velox/connectors/hive/HiveDataSink.cpp Outdated Show resolved Hide resolved
velox/connectors/hive/HiveDataSink.cpp Outdated Show resolved Hide resolved
velox/connectors/hive/HiveDataSink.cpp Outdated Show resolved Hide resolved
velox/connectors/hive/HiveDataSink.cpp Outdated Show resolved Hide resolved
velox/connectors/hive/HiveDataSink.cpp Outdated Show resolved Hide resolved
velox/exec/tests/TableWriteTest.cpp Outdated Show resolved Hide resolved
@JkSelf
Copy link
Contributor Author

JkSelf commented Jan 11, 2024

@kewang1024 Resolved your comments. Can you help to review again? Thanks.

GlutenPerfBot pushed a commit to GlutenPerfBot/velox that referenced this pull request Jan 12, 2024
Copy link
Contributor

@kewang1024 kewang1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @JkSelf, accepted with one NIT:
the functions can be static since they don't access any member variable (getNonPartitionTypes / getNonPartitionChannels / getNonPartitionsColumns / makeDataInput)

@JkSelf
Copy link
Contributor Author

JkSelf commented Jan 12, 2024

@kewang1024 @gggrace14 Updated. Can you help to review again? Thanks.

GlutenPerfBot pushed a commit to GlutenPerfBot/velox that referenced this pull request Jan 13, 2024
GlutenPerfBot pushed a commit to GlutenPerfBot/velox that referenced this pull request Jan 14, 2024
GlutenPerfBot pushed a commit to GlutenPerfBot/velox that referenced this pull request Jan 15, 2024
rui-mo pushed a commit to oap-project/velox that referenced this pull request Jan 15, 2024
@gggrace14
Copy link
Contributor

The revision looks good to me! Thank you, @JkSelf , for making the change! I'm merging the PR

@facebook-github-bot
Copy link
Contributor

@gggrace14 merged this pull request in 8f0dd8b.

Copy link

Conbench analyzed the 1 benchmark run on commit 8f0dd8b7.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

liujiayi771 pushed a commit to liujiayi771/velox that referenced this pull request Jan 16, 2024
…ubator#8089)

Summary:
Spark distinguishes between [data columns](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala#L583
) and [partition columns](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala#L584), and it only writes the [data columns](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala#L298) into the file. However, Velox writes both the data columns and partition columns into the file. This PR removes the partition columns from the row vector before writing file.

Pull Request resolved: facebookincubator#8089

Reviewed By: kewang1024

Differential Revision: D52670985

Pulled By: gggrace14

fbshipit-source-id: 11e5ef4cd99903adaf191b650318f195ef1be62d
gggrace14 added a commit to gggrace14/velox that referenced this pull request Jan 17, 2024
This is from cherry-picking of
facebookincubator#8089, and is delta
of the PR on top of main branch. When we merged the PR, we failed
to merge the latest version. Everywhere is consistent though.

This change renames the func getDataType and getDataChannels
in HiveDataSink, and makes the static.
mbasmanova pushed a commit to mbasmanova/velox-1 that referenced this pull request Jan 17, 2024
…ubator#8089)

Summary:
Spark distinguishes between [data columns](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala#L583
) and [partition columns](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala#L584), and it only writes the [data columns](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala#L298) into the file. However, Velox writes both the data columns and partition columns into the file. This PR removes the partition columns from the row vector before writing file.

Pull Request resolved: facebookincubator#8089

Reviewed By: kewang1024

Differential Revision: D52670985

Pulled By: gggrace14

fbshipit-source-id: 11e5ef4cd99903adaf191b650318f195ef1be62d
facebook-github-bot pushed a commit that referenced this pull request Jan 17, 2024
Summary:
This is from cherry-picking of
#8089, and is delta
of the PR on top of main branch. When we merged the PR, we failed
to merge the latest version. Everywhere is consistent with version though.

This change renames the func getDataType and getDataChannels
in HiveDataSink, and makes the static.

Pull Request resolved: #8404

Reviewed By: xiaoxmeng, kewang1024

Differential Revision: D52822889

Pulled By: gggrace14

fbshipit-source-id: d2fb6fd8cb87fd77d9897555f89e35ec197e3c7f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants