pass down the scan table schema type to parquet column reader #6404

gaoyangxiaozhu · 2023-09-04T07:56:27Z

pass down the scan table schema to parquet column reader

Details:

currently, the requestedType, which is available in ParquetColumnReader.cpp, are set based on the schema present in the parquet file (file data type) instead of scan table schema.

The issue occurs when the expected output of the TableScan differs from the schema of the parquet file. Spark's data format for some types differs from Parquet's format. Similar to schema evolution, when the type differs, Spark performs an implicit conversion. The conversions that Spark performs can be seen in ParquetVectorUpdaterFactory.java.

This PR fix the issue by setting requestedType` with scan table schema data type to parquet column reader.

It's a follow PR of this PR #5786 to address the issue by following the comments of @Yuhta

Please check detail context from #5786

This update is one of the modifications necessary for issue #5770.

netlify · 2023-09-04T07:56:32Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`a0cbec0`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/64f591385e351a000841d762

facebook-github-bot · 2023-09-04T07:56:33Z

Hi @gaoyangxiaozhu!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

gaoyangxiaozhu · 2023-09-04T08:30:18Z

hey @Yuhta could you help review the PR, it address the requestDataType issue of ParquetColumnReader by following your comments in this PR #5786

gaoyangxiaozhu · 2023-09-06T12:23:10Z

hey @Yuhta could you help review the PR, it address the requestDataType issue of ParquetColumnReader by following your comments in this PR #5786

any anyone help review @Yuhta

yingsu00 · 2023-09-14T04:06:50Z

velox/dwio/parquet/reader/RepeatedColumnReader.cpp

          params,
          scanSpec) {
+  DWIO_ENSURE_EQ(fileType_->id(), dataType->id(), "working on the same node");


I suggest to rename fileType_ and the dataType parameter name to either "dataFileType" or "fileType" so that the code is easier to understand. You see, fileType_ comes from dataType parameter but they are named differently, making the code not as straitforward as it should be.

We can rename it to fileType_ in the future. For now dataType and fileType are used interchangeably, and requestedType is different, the situation is not too bad.

yingsu00 · 2023-09-14T04:20:24Z

velox/dwio/parquet/reader/RepeatedColumnReader.cpp

@@ -108,21 +108,23 @@ void ensureRepDefs(
 }

 MapColumnReader::MapColumnReader(
-    std::shared_ptr<const dwio::common::TypeWithId> requestedType,
+    const std::shared_ptr<const dwio::common::TypeWithId>& requestedType,
+    const std::shared_ptr<const dwio::common::TypeWithId>& dataType,
    ParquetParams& params,
    common::ScanSpec& scanSpec)
    : dwio::common::SelectiveMapColumnReader(


This is not in this PR, but could you please check SelectiveMapColumnReader's cstr?

SelectiveMapColumnReader::SelectiveMapColumnReader( const std::shared_ptr<const dwio::common::TypeWithId>& requestedType, const std::shared_ptr<const dwio::common::TypeWithId>& dataType, FormatParams& params, velox::common::ScanSpec& scanSpec) : SelectiveRepeatedColumnReader( dataType->type(), params, scanSpec, dataType), requestedType_{requestedType} {}

I see 2 issues:

the types were not passed in correctly to SelectiveRepeatedColumnReader;

SelectiveMapColumnReader defines itself requestedType_ which may shadow the one in SelectiveColumnReader. Is this necessary?

We can do this in a different PR. Let's keep this one for parquet only.

@kevinwilfong Do you remember any blockers if we try to fix these 2 types in map column reader?

facebook-github-bot · 2023-09-15T19:36:42Z

@Yuhta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

gaoyangxiaozhu · 2023-09-18T06:25:36Z

@Yuhta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

hey @Yuhta thank you, I saw you import the PR, does any thing else i can do for this PR, or you will continue merge the change from your private branch ?

facebook-github-bot · 2023-09-18T16:37:40Z

@Yuhta merged this pull request in b803a26.

…okincubator#6404) Summary: pass down the scan table schema to parquet column reader Details: currently, the requestedType, which is available in [ParquetColumnReader.cpp](https://github.com/facebookincubator/velox/blob/517e3e3a0c8308c96ca068444dfeee37204f7773/velox/dwio/parquet/reader/ParquetColumnReader.cpp#L37C60-L37C68), are set based on the schema present in the parquet file (file data type) instead of scan table schema. The issue occurs when the expected output of the TableScan differs from the schema of the parquet file. Spark's data format for some types differs from Parquet's format. Similar to schema evolution, when the type differs, Spark performs an implicit conversion. The conversions that Spark performs can be seen in [ParquetVectorUpdaterFactory.java](https://github.com/apache/spark/blob/6ca45c52b7416e7b3520dc902cb24f060c7c72dd/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java#L67C3-L185C6). This PR fix the issue by setting requestedType` with scan table schema data type to parquet column reader. It's a follow PR of this PR facebookincubator#5786 to address the issue by following the comments of Yuhta Please check detail context from facebookincubator#5786 This update is one of the modifications necessary for issue facebookincubator#5770. Pull Request resolved: facebookincubator#6404 Reviewed By: pedroerp Differential Revision: D49330580 Pulled By: Yuhta fbshipit-source-id: bd56bda6efd708691ee35b5b66d5ba9536df525f

yingsu00 · 2023-12-08T06:43:38Z

@gaoyangxiaozhu Hi I just noticed this PR doesn't have a test. Will you be able to add a test for this? Or share with us a simple file that reproduce the issue?

pass requestData type to parquetColumn builder

cb8c65c

format fix

a0cbec0

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 4, 2023

gaoyangxiaozhu changed the title ~~pass down the scan table schema to parquet column reader~~ pass down the scan table schema type to parquet column reader Sep 4, 2023

gaoyangxiaozhu mentioned this pull request Sep 5, 2023

Store output Type information for each ScanSpec node #5786

Closed

yingsu00 reviewed Sep 14, 2023

View reviewed changes

pedroerp requested review from Yuhta and majetideepak September 15, 2023 17:44

facebook-github-bot closed this in b803a26 Sep 18, 2023

facebook-github-bot added the Merged label Sep 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pass down the scan table schema type to parquet column reader #6404

pass down the scan table schema type to parquet column reader #6404

gaoyangxiaozhu commented Sep 4, 2023 •

edited

netlify bot commented Sep 4, 2023 •

edited

facebook-github-bot commented Sep 4, 2023

gaoyangxiaozhu commented Sep 4, 2023 •

edited

gaoyangxiaozhu commented Sep 6, 2023

yingsu00 Sep 14, 2023

Yuhta Sep 15, 2023

yingsu00 Sep 14, 2023

Yuhta Sep 15, 2023

Yuhta Sep 15, 2023 •

edited

facebook-github-bot commented Sep 15, 2023

gaoyangxiaozhu commented Sep 18, 2023

facebook-github-bot commented Sep 18, 2023

yingsu00 commented Dec 8, 2023

pass down the scan table schema type to parquet column reader #6404

pass down the scan table schema type to parquet column reader #6404

Conversation

gaoyangxiaozhu commented Sep 4, 2023 • edited

netlify bot commented Sep 4, 2023 • edited

✅ Deploy Preview for meta-velox canceled.

facebook-github-bot commented Sep 4, 2023

Action Required

Process

gaoyangxiaozhu commented Sep 4, 2023 • edited

gaoyangxiaozhu commented Sep 6, 2023

yingsu00 Sep 14, 2023

Choose a reason for hiding this comment

Yuhta Sep 15, 2023

Choose a reason for hiding this comment

yingsu00 Sep 14, 2023

Choose a reason for hiding this comment

Yuhta Sep 15, 2023

Choose a reason for hiding this comment

Yuhta Sep 15, 2023 • edited

Choose a reason for hiding this comment

facebook-github-bot commented Sep 15, 2023

gaoyangxiaozhu commented Sep 18, 2023

facebook-github-bot commented Sep 18, 2023

yingsu00 commented Dec 8, 2023

gaoyangxiaozhu commented Sep 4, 2023 •

edited

netlify bot commented Sep 4, 2023 •

edited

gaoyangxiaozhu commented Sep 4, 2023 •

edited

Yuhta Sep 15, 2023 •

edited