Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pass down the scan table schema type to parquet column reader #6404

Conversation

gaoyangxiaozhu
Copy link
Contributor

@gaoyangxiaozhu gaoyangxiaozhu commented Sep 4, 2023

pass down the scan table schema to parquet column reader

Details:

currently, the requestedType, which is available in ParquetColumnReader.cpp, are set based on the schema present in the parquet file (file data type) instead of scan table schema.

The issue occurs when the expected output of the TableScan differs from the schema of the parquet file. Spark's data format for some types differs from Parquet's format. Similar to schema evolution, when the type differs, Spark performs an implicit conversion. The conversions that Spark performs can be seen in ParquetVectorUpdaterFactory.java.

This PR fix the issue by setting requestedType` with scan table schema data type to parquet column reader.

It's a follow PR of this PR #5786 to address the issue by following the comments of @Yuhta

Please check detail context from #5786

This update is one of the modifications necessary for issue #5770.

@netlify
Copy link

netlify bot commented Sep 4, 2023

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit a0cbec0
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/64f591385e351a000841d762

@facebook-github-bot
Copy link
Contributor

Hi @gaoyangxiaozhu!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 4, 2023
@gaoyangxiaozhu
Copy link
Contributor Author

gaoyangxiaozhu commented Sep 4, 2023

hey @Yuhta could you help review the PR, it address the requestDataType issue of ParquetColumnReader by following your comments in this PR #5786

@gaoyangxiaozhu gaoyangxiaozhu changed the title pass down the scan table schema to parquet column reader pass down the scan table schema type to parquet column reader Sep 4, 2023
@gaoyangxiaozhu
Copy link
Contributor Author

hey @Yuhta could you help review the PR, it address the requestDataType issue of ParquetColumnReader by following your comments in this PR #5786

any anyone help review @Yuhta

params,
scanSpec) {
DWIO_ENSURE_EQ(fileType_->id(), dataType->id(), "working on the same node");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to rename fileType_ and the dataType parameter name to either "dataFileType" or "fileType" so that the code is easier to understand. You see, fileType_ comes from dataType parameter but they are named differently, making the code not as straitforward as it should be.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can rename it to fileType_ in the future. For now dataType and fileType are used interchangeably, and requestedType is different, the situation is not too bad.

@@ -108,21 +108,23 @@ void ensureRepDefs(
}

MapColumnReader::MapColumnReader(
std::shared_ptr<const dwio::common::TypeWithId> requestedType,
const std::shared_ptr<const dwio::common::TypeWithId>& requestedType,
const std::shared_ptr<const dwio::common::TypeWithId>& dataType,
ParquetParams& params,
common::ScanSpec& scanSpec)
: dwio::common::SelectiveMapColumnReader(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not in this PR, but could you please check SelectiveMapColumnReader's cstr?

SelectiveMapColumnReader::SelectiveMapColumnReader(
    const std::shared_ptr<const dwio::common::TypeWithId>& requestedType,
    const std::shared_ptr<const dwio::common::TypeWithId>& dataType,
    FormatParams& params,
    velox::common::ScanSpec& scanSpec)
    : SelectiveRepeatedColumnReader(
          dataType->type(),
          params,
          scanSpec,
          dataType),
      requestedType_{requestedType} {}

I see 2 issues:

  1. the types were not passed in correctly to SelectiveRepeatedColumnReader;
  2. SelectiveMapColumnReader defines itself requestedType_ which may shadow the one in SelectiveColumnReader. Is this necessary?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do this in a different PR. Let's keep this one for parquet only.

Copy link
Contributor

@Yuhta Yuhta Sep 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevinwilfong Do you remember any blockers if we try to fix these 2 types in map column reader?

@facebook-github-bot
Copy link
Contributor

@Yuhta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@gaoyangxiaozhu
Copy link
Contributor Author

@Yuhta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

hey @Yuhta thank you, I saw you import the PR, does any thing else i can do for this PR, or you will continue merge the change from your private branch ?

@facebook-github-bot
Copy link
Contributor

@Yuhta merged this pull request in b803a26.

codyschierbeck pushed a commit to codyschierbeck/velox that referenced this pull request Sep 27, 2023
…okincubator#6404)

Summary:
pass down the scan table schema to parquet column reader

Details:

currently, the requestedType, which is available in [ParquetColumnReader.cpp](https://github.com/facebookincubator/velox/blob/517e3e3a0c8308c96ca068444dfeee37204f7773/velox/dwio/parquet/reader/ParquetColumnReader.cpp#L37C60-L37C68), are set based on the schema present in the parquet file (file data type) instead of scan table schema.

The issue occurs when the expected output of the TableScan differs from the schema of the parquet file. Spark's data format for some types differs from Parquet's format. Similar to schema evolution, when the type differs, Spark performs an implicit conversion. The conversions that Spark performs can be seen in [ParquetVectorUpdaterFactory.java](https://github.com/apache/spark/blob/6ca45c52b7416e7b3520dc902cb24f060c7c72dd/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java#L67C3-L185C6).

This PR fix the issue by setting requestedType` with scan table schema data type to parquet column reader.

It's a follow PR of this PR facebookincubator#5786 to address the issue by following the comments of Yuhta

Please check detail context from facebookincubator#5786

This update is one of the modifications necessary for issue facebookincubator#5770.

Pull Request resolved: facebookincubator#6404

Reviewed By: pedroerp

Differential Revision: D49330580

Pulled By: Yuhta

fbshipit-source-id: bd56bda6efd708691ee35b5b66d5ba9536df525f
codyschierbeck pushed a commit to codyschierbeck/velox that referenced this pull request Sep 27, 2023
…okincubator#6404)

Summary:
pass down the scan table schema to parquet column reader

Details:

currently, the requestedType, which is available in [ParquetColumnReader.cpp](https://github.com/facebookincubator/velox/blob/517e3e3a0c8308c96ca068444dfeee37204f7773/velox/dwio/parquet/reader/ParquetColumnReader.cpp#L37C60-L37C68), are set based on the schema present in the parquet file (file data type) instead of scan table schema.

The issue occurs when the expected output of the TableScan differs from the schema of the parquet file. Spark's data format for some types differs from Parquet's format. Similar to schema evolution, when the type differs, Spark performs an implicit conversion. The conversions that Spark performs can be seen in [ParquetVectorUpdaterFactory.java](https://github.com/apache/spark/blob/6ca45c52b7416e7b3520dc902cb24f060c7c72dd/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java#L67C3-L185C6).

This PR fix the issue by setting requestedType` with scan table schema data type to parquet column reader.

It's a follow PR of this PR facebookincubator#5786 to address the issue by following the comments of Yuhta

Please check detail context from facebookincubator#5786

This update is one of the modifications necessary for issue facebookincubator#5770.

Pull Request resolved: facebookincubator#6404

Reviewed By: pedroerp

Differential Revision: D49330580

Pulled By: Yuhta

fbshipit-source-id: bd56bda6efd708691ee35b5b66d5ba9536df525f
codyschierbeck pushed a commit to codyschierbeck/velox that referenced this pull request Sep 27, 2023
…okincubator#6404)

Summary:
pass down the scan table schema to parquet column reader

Details:

currently, the requestedType, which is available in [ParquetColumnReader.cpp](https://github.com/facebookincubator/velox/blob/517e3e3a0c8308c96ca068444dfeee37204f7773/velox/dwio/parquet/reader/ParquetColumnReader.cpp#L37C60-L37C68), are set based on the schema present in the parquet file (file data type) instead of scan table schema.

The issue occurs when the expected output of the TableScan differs from the schema of the parquet file. Spark's data format for some types differs from Parquet's format. Similar to schema evolution, when the type differs, Spark performs an implicit conversion. The conversions that Spark performs can be seen in [ParquetVectorUpdaterFactory.java](https://github.com/apache/spark/blob/6ca45c52b7416e7b3520dc902cb24f060c7c72dd/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java#L67C3-L185C6).

This PR fix the issue by setting requestedType` with scan table schema data type to parquet column reader.

It's a follow PR of this PR facebookincubator#5786 to address the issue by following the comments of Yuhta

Please check detail context from facebookincubator#5786

This update is one of the modifications necessary for issue facebookincubator#5770.

Pull Request resolved: facebookincubator#6404

Reviewed By: pedroerp

Differential Revision: D49330580

Pulled By: Yuhta

fbshipit-source-id: bd56bda6efd708691ee35b5b66d5ba9536df525f
ericyuliu pushed a commit to ericyuliu/velox that referenced this pull request Oct 12, 2023
…okincubator#6404)

Summary:
pass down the scan table schema to parquet column reader

Details:

currently, the requestedType, which is available in [ParquetColumnReader.cpp](https://github.com/facebookincubator/velox/blob/517e3e3a0c8308c96ca068444dfeee37204f7773/velox/dwio/parquet/reader/ParquetColumnReader.cpp#L37C60-L37C68), are set based on the schema present in the parquet file (file data type) instead of scan table schema.

The issue occurs when the expected output of the TableScan differs from the schema of the parquet file. Spark's data format for some types differs from Parquet's format. Similar to schema evolution, when the type differs, Spark performs an implicit conversion. The conversions that Spark performs can be seen in [ParquetVectorUpdaterFactory.java](https://github.com/apache/spark/blob/6ca45c52b7416e7b3520dc902cb24f060c7c72dd/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java#L67C3-L185C6).

This PR fix the issue by setting requestedType` with scan table schema data type to parquet column reader.

It's a follow PR of this PR facebookincubator#5786 to address the issue by following the comments of Yuhta

Please check detail context from facebookincubator#5786

This update is one of the modifications necessary for issue facebookincubator#5770.

Pull Request resolved: facebookincubator#6404

Reviewed By: pedroerp

Differential Revision: D49330580

Pulled By: Yuhta

fbshipit-source-id: bd56bda6efd708691ee35b5b66d5ba9536df525f
@yingsu00
Copy link
Collaborator

yingsu00 commented Dec 8, 2023

@gaoyangxiaozhu Hi I just noticed this PR doesn't have a test. Will you be able to add a test for this? Or share with us a simple file that reproduce the issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants