New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove partition columns before writing partitioned file #8089
Conversation
✅ Deploy Preview for meta-velox canceled.
|
@mbasmanova Can you help to review? Thanks. |
cdf0ec0
to
18bb57d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CC: @gggrace14
I wonder if this is a bug in the query planner. Ge, would you help take a look?
Looking at it |
Hi @mbasmanova Masha, I checked the Presto code, and it does look like a bug in the original code. That is, Presto does not include partition columns today when writing to a file under a partition directory. So I think we should have this change (and not for only parquet file). Also I will run Presto Verifier after or before this change is merged, as it is on a critical path. Better to ask you to also take a look to help confirm. https://github.com/prestodb/presto/blob/9104d718332913be6dc1f461917cd1b8594b04ec/presto-hive/src/main/java/com/facebook/presto/hive/HivePageSink.java#L390 |
@gggrace14 Ge, thank you for looking and commenting. Sounds like this is a generic bug not specific to Parquet. Indeed, partition keys should not be written in to files, not only for Parquet, but for all formats. @JkSelf This is a nice finding. Let's update PR description to explain that this is a generic bug. You may mention that it was discovered by running Spark tests on Gluten. I'll take a look at the code. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JkSelf Thank you for identifying this issue and working on a fix. Some comments below.
velox/exec/tests/TableWriteTest.cpp
Outdated
@@ -1985,6 +1988,49 @@ TEST_P(PartitionedWithoutBucketTableWriterTest, fromSinglePartitionToMultiple) { | |||
"SELECT * FROM tmp"); | |||
} | |||
|
|||
TEST_P(PartitionedTableWriterTest, removePartitionColumns) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a new test or could we modify existing tests to verify the list of columns written into the file? I assume there are existing tests that write partitioned tables. These test would start failing once we add verification for list of columns in the file, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbasmanova Removed.
48decd5
to
c36db67
Compare
@mbasmanova @gggrace14 Thanks for your review. I have updated all your comments. Can you help to review again? Thanks. |
2e53f5a
to
a160a7b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JkSelf The structure looks a lot better now and we're very close. Left some comments.
2a4ef01
to
160767c
Compare
@gggrace14 Thanks for your review. Can you help to review again? Thanks. |
@mbasmanova Can you help to review again? Thanks. |
@gggrace14 Ge will help review and merge this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JkSelf , thank you for revising! Left a bit more comments. Also I see this PR contains 7 commits. Would you be able to fold them into one? Not sure how they will end up in the master branch after merge.
3989089
to
994965b
Compare
@gggrace14 Resolved all your comments. Can you help to review again? Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks great! Thank you for making the change, @JkSelf !
@gggrace14 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@gggrace14 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @JkSelf for fixing the problem!
Overall PR lgtm, proposed some nits
431139d
to
6e6e2c2
Compare
@kewang1024 Resolved your comments. Can you help to review again? Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @JkSelf, accepted with one NIT:
the functions can be static since they don't access any member variable (getNonPartitionTypes / getNonPartitionChannels / getNonPartitionsColumns / makeDataInput)
32a54db
to
7b9ab80
Compare
@kewang1024 @gggrace14 Updated. Can you help to review again? Thanks. |
The revision looks good to me! Thank you, @JkSelf , for making the change! I'm merging the PR |
@gggrace14 merged this pull request in 8f0dd8b. |
Conbench analyzed the 1 benchmark run on commit There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
…ubator#8089) Summary: Spark distinguishes between [data columns](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala#L583 ) and [partition columns](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala#L584), and it only writes the [data columns](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala#L298) into the file. However, Velox writes both the data columns and partition columns into the file. This PR removes the partition columns from the row vector before writing file. Pull Request resolved: facebookincubator#8089 Reviewed By: kewang1024 Differential Revision: D52670985 Pulled By: gggrace14 fbshipit-source-id: 11e5ef4cd99903adaf191b650318f195ef1be62d
This is from cherry-picking of facebookincubator#8089, and is delta of the PR on top of main branch. When we merged the PR, we failed to merge the latest version. Everywhere is consistent though. This change renames the func getDataType and getDataChannels in HiveDataSink, and makes the static.
…ubator#8089) Summary: Spark distinguishes between [data columns](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala#L583 ) and [partition columns](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala#L584), and it only writes the [data columns](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala#L298) into the file. However, Velox writes both the data columns and partition columns into the file. This PR removes the partition columns from the row vector before writing file. Pull Request resolved: facebookincubator#8089 Reviewed By: kewang1024 Differential Revision: D52670985 Pulled By: gggrace14 fbshipit-source-id: 11e5ef4cd99903adaf191b650318f195ef1be62d
Summary: This is from cherry-picking of #8089, and is delta of the PR on top of main branch. When we merged the PR, we failed to merge the latest version. Everywhere is consistent with version though. This change renames the func getDataType and getDataChannels in HiveDataSink, and makes the static. Pull Request resolved: #8404 Reviewed By: xiaoxmeng, kewang1024 Differential Revision: D52822889 Pulled By: gggrace14 fbshipit-source-id: d2fb6fd8cb87fd77d9897555f89e35ec197e3c7f
Spark distinguishes between data columns and partition columns, and it only writes the data columns into the file. However, Velox writes both the data columns and partition columns into the file. This PR removes the partition columns from the row vector before writing file.