Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add schema to parquet writer options #6074

Closed
wants to merge 10 commits into from

Conversation

JkSelf
Copy link
Contributor

@JkSelf JkSelf commented Aug 10, 2023

Here is the schema information for the TPCH supplier table.

root
 |-- s_suppkey: long (nullable = true)  
 |-- s_name: string (nullable = true) 
 |-- s_address: string (nullable = true)  
 |-- s_nationkey: long (nullable = true)  
 |-- s_phone: string (nullable = true)  
 |-- s_acctbal: double (nullable = true)  
 |-- s_comment: string (nullable = true)  

When we write the supplier table into a parquet file using velox parquet writer in Gluten, we encounter the incorrect schema issue.

root
 |-- n0_0: long (nullable = true)
 |-- n0_1: string (nullable = true)
 |-- n0_2: string (nullable = true)
 |-- n0_3: long (nullable = true)
 |-- n0_4: string (nullable = true)
 |-- n0_5: double (nullable = true)
 |-- n0_6: string (nullable = true)

The reason for this is that the schema is inferred from the record batch. To address this, this PR introduces a schema into the WriterOption. This allows us to pass the appropriate schema to the Parquet writer.

@netlify
Copy link

netlify bot commented Aug 10, 2023

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 5cec6cf
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/6580e83cdb740a0008820fe1

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 10, 2023
@JkSelf JkSelf force-pushed the parquet-writer-schema branch 3 times, most recently from 6928163 to 5cc5e09 Compare August 10, 2023 08:54
Copy link
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WriterOptions has the Velox schema. We should be able to convert that to Arrow schema in the Parquet Writer.

@JkSelf
Copy link
Contributor Author

JkSelf commented Aug 10, 2023

WriterOptions has the Velox schema. We should be able to convert that to Arrow schema in the Parquet Writer.

@majetideepak It appears that the WriterOptions in the DWRF writer contains the Velox schema, while the Parquet writer does not.

@majetideepak
Copy link
Collaborator

majetideepak commented Aug 11, 2023

@JkSelf Please add a Velox schema to Parquet writer Options.

@JkSelf JkSelf changed the title [GLUTEN] Make the file schema of written by parquet writer can be configurable Make the file schema of written by parquet writer can be configurable Aug 24, 2023
@JkSelf
Copy link
Contributor Author

JkSelf commented Aug 28, 2023

@JkSelf Please add a Velox schema to Parquet writer Options.

@majetideepak Updated. Can you help to review again?

@JkSelf JkSelf changed the title Make the file schema of written by parquet writer can be configurable Add velox schema to parquet writer options Aug 29, 2023
velox/dwio/parquet/writer/Writer.cpp Outdated Show resolved Hide resolved
velox/dwio/parquet/writer/Writer.cpp Outdated Show resolved Hide resolved
velox/dwio/parquet/tests/reader/E2EFilterTest.cpp Outdated Show resolved Hide resolved
@JkSelf
Copy link
Contributor Author

JkSelf commented Aug 30, 2023

@majetideepak Thanks for your review. I have fixed your comments. And can you help to review again? Thanks.

@JkSelf
Copy link
Contributor Author

JkSelf commented Aug 30, 2023

@majetideepak Can you help to review again?

@JkSelf
Copy link
Contributor Author

JkSelf commented Sep 5, 2023

@majetideepak Please help to review again. Thanks.

@JkSelf
Copy link
Contributor Author

JkSelf commented Sep 6, 2023

@majetideepak @mbasmanova Please help to review again. Thanks.

velox/dwio/parquet/writer/Writer.h Outdated Show resolved Hide resolved
velox/dwio/parquet/writer/Writer.cpp Outdated Show resolved Hide resolved
velox/dwio/parquet/writer/Writer.cpp Outdated Show resolved Hide resolved
Copy link
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JkSelf Sorry, I missed your comments. I think the PR description should clarify that this is an improvement from the current approach of inferring the schema from the input data.

velox/dwio/parquet/writer/Writer.cpp Outdated Show resolved Hide resolved
velox/dwio/parquet/writer/Writer.cpp Outdated Show resolved Hide resolved
velox/dwio/parquet/writer/Writer.cpp Outdated Show resolved Hide resolved
velox/dwio/parquet/writer/Writer.cpp Outdated Show resolved Hide resolved
@JkSelf
Copy link
Contributor Author

JkSelf commented Sep 12, 2023

@majetideepak I have fixed your comments. Can you help to review again?

@JkSelf
Copy link
Contributor Author

JkSelf commented Dec 15, 2023

Seeing test failures:

Note: Google Test filter = E2EFilterTest.shortDecimalDirect
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from E2EFilterTest
[ RUN      ] E2EFilterTest.shortDecimalDirect

=================================================================
==10221==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x7f02d037f808 at pc 0x7f02da58211a bp 0x7ffcf1082290 sp 0x7ffcf1082288
READ of size 8 at 0x7f02d037f808 thread T0
SCARINESS: 33 (8-byte-read-heap-buffer-overflow-far-from-bounds)
    #0 0x7f02da582119 in facebook::velox::parquet::arrow::FixedLenByteArray facebook::velox::parquet::arrow::SerializeFunctor<facebook::velox::parquet::arrow::PhysicalType<(facebook::velox::parquet::arrow::Type::type)7>, arrow::Decimal128Type, void>::FixDecimalEndianess<16>(unsigned char const*, long) fbcode/velox/dwio/parquet/writer/arrow/ColumnWriter.cpp:2848
    #1 0x7f02da581759 in facebook::velox::parquet::arrow::SerializeFunctor<facebook::velox::parquet::arrow::PhysicalType<(facebook::velox::parquet::arrow::Type::type)7>, arrow::Decimal128Type, void>::Serialize(arrow::Decimal128Array const&, facebook::velox::parquet::arrow::ArrowWriteContext*, facebook::velox::parquet::arrow::FixedLenByteArray*) fbcode/velox/dwio/parquet/writer/arrow/ColumnWriter.cpp:2806

In this pull request, we only modify the schema. I'm not sure why this issue is occurring. Could it be a bug in Arrow? @majetideepak @mbasmanova Do you have any input? Thanks for your help.

@mbasmanova @majetideepak The issue seems to be caused by not using dictionary encoding in the vector in Writer::Writer(vector) API to get the arrow schema if creating a empty vector. Currently, I have placed the schema update inside the Writer::Writer(vector) API so that we can retrieve the Arrow schema based on the vector. Then, update the name in the Arrow schema based on the name in the schema_. And also added complex type unit tests. Please help to review again. Thanks.

velox/dwio/parquet/writer/Writer.cpp Outdated Show resolved Hide resolved
velox/dwio/parquet/writer/Writer.cpp Outdated Show resolved Hide resolved
velox/dwio/parquet/writer/Writer.cpp Outdated Show resolved Hide resolved
@facebook-github-bot
Copy link
Contributor

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JkSelf Thank you for iterating. Looks good % couple of nits.

velox/dwio/parquet/writer/Writer.cpp Outdated Show resolved Hide resolved
velox/dwio/parquet/writer/Writer.cpp Outdated Show resolved Hide resolved
@JkSelf
Copy link
Contributor Author

JkSelf commented Dec 19, 2023

@mbasmanova Resolved all your comments. Can you help to review again? Thanks.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@facebook-github-bot
Copy link
Contributor

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@mbasmanova merged this pull request in 2ada2ec.

yingsu00 added a commit to yingsu00/velox that referenced this pull request Jan 5, 2024
ParquetReaderBenchmark was broken due to facebookincubator#6074. This commit fixes the
issue.
yingsu00 added a commit to yingsu00/velox that referenced this pull request Jan 5, 2024
ParquetReaderBenchmark was broken due to facebookincubator#6074. This commit fixes the
issue.
facebook-github-bot pushed a commit that referenced this pull request Jan 12, 2024
Summary:
ParquetReaderBenchmark was broken due to #6074. This commit fixes the issue.

Pull Request resolved: #8274

Reviewed By: xiaoxmeng

Differential Revision: D52703288

Pulled By: kewang1024

fbshipit-source-id: fc8aede66ff2d2dda7b8d2777b5e18316296aaf6
liujiayi771 pushed a commit to liujiayi771/velox that referenced this pull request Jan 16, 2024
Summary:
ParquetReaderBenchmark was broken due to facebookincubator#6074. This commit fixes the issue.

Pull Request resolved: facebookincubator#8274

Reviewed By: xiaoxmeng

Differential Revision: D52703288

Pulled By: kewang1024

fbshipit-source-id: fc8aede66ff2d2dda7b8d2777b5e18316296aaf6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants