-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pass down the scan table schema type to parquet column reader #6404
pass down the scan table schema type to parquet column reader #6404
Conversation
✅ Deploy Preview for meta-velox canceled.
|
Hi @gaoyangxiaozhu! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
params, | ||
scanSpec) { | ||
DWIO_ENSURE_EQ(fileType_->id(), dataType->id(), "working on the same node"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to rename fileType_ and the dataType parameter name to either "dataFileType" or "fileType" so that the code is easier to understand. You see, fileType_ comes from dataType parameter but they are named differently, making the code not as straitforward as it should be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can rename it to fileType_
in the future. For now dataType
and fileType
are used interchangeably, and requestedType
is different, the situation is not too bad.
@@ -108,21 +108,23 @@ void ensureRepDefs( | |||
} | |||
|
|||
MapColumnReader::MapColumnReader( | |||
std::shared_ptr<const dwio::common::TypeWithId> requestedType, | |||
const std::shared_ptr<const dwio::common::TypeWithId>& requestedType, | |||
const std::shared_ptr<const dwio::common::TypeWithId>& dataType, | |||
ParquetParams& params, | |||
common::ScanSpec& scanSpec) | |||
: dwio::common::SelectiveMapColumnReader( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not in this PR, but could you please check SelectiveMapColumnReader's cstr?
SelectiveMapColumnReader::SelectiveMapColumnReader(
const std::shared_ptr<const dwio::common::TypeWithId>& requestedType,
const std::shared_ptr<const dwio::common::TypeWithId>& dataType,
FormatParams& params,
velox::common::ScanSpec& scanSpec)
: SelectiveRepeatedColumnReader(
dataType->type(),
params,
scanSpec,
dataType),
requestedType_{requestedType} {}
I see 2 issues:
- the types were not passed in correctly to SelectiveRepeatedColumnReader;
- SelectiveMapColumnReader defines itself requestedType_ which may shadow the one in SelectiveColumnReader. Is this necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do this in a different PR. Let's keep this one for parquet only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kevinwilfong Do you remember any blockers if we try to fix these 2 types in map column reader?
@Yuhta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
hey @Yuhta thank you, I saw you import the PR, does any thing else i can do for this PR, or you will continue merge the change from your private branch ? |
…okincubator#6404) Summary: pass down the scan table schema to parquet column reader Details: currently, the requestedType, which is available in [ParquetColumnReader.cpp](https://github.com/facebookincubator/velox/blob/517e3e3a0c8308c96ca068444dfeee37204f7773/velox/dwio/parquet/reader/ParquetColumnReader.cpp#L37C60-L37C68), are set based on the schema present in the parquet file (file data type) instead of scan table schema. The issue occurs when the expected output of the TableScan differs from the schema of the parquet file. Spark's data format for some types differs from Parquet's format. Similar to schema evolution, when the type differs, Spark performs an implicit conversion. The conversions that Spark performs can be seen in [ParquetVectorUpdaterFactory.java](https://github.com/apache/spark/blob/6ca45c52b7416e7b3520dc902cb24f060c7c72dd/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java#L67C3-L185C6). This PR fix the issue by setting requestedType` with scan table schema data type to parquet column reader. It's a follow PR of this PR facebookincubator#5786 to address the issue by following the comments of Yuhta Please check detail context from facebookincubator#5786 This update is one of the modifications necessary for issue facebookincubator#5770. Pull Request resolved: facebookincubator#6404 Reviewed By: pedroerp Differential Revision: D49330580 Pulled By: Yuhta fbshipit-source-id: bd56bda6efd708691ee35b5b66d5ba9536df525f
…okincubator#6404) Summary: pass down the scan table schema to parquet column reader Details: currently, the requestedType, which is available in [ParquetColumnReader.cpp](https://github.com/facebookincubator/velox/blob/517e3e3a0c8308c96ca068444dfeee37204f7773/velox/dwio/parquet/reader/ParquetColumnReader.cpp#L37C60-L37C68), are set based on the schema present in the parquet file (file data type) instead of scan table schema. The issue occurs when the expected output of the TableScan differs from the schema of the parquet file. Spark's data format for some types differs from Parquet's format. Similar to schema evolution, when the type differs, Spark performs an implicit conversion. The conversions that Spark performs can be seen in [ParquetVectorUpdaterFactory.java](https://github.com/apache/spark/blob/6ca45c52b7416e7b3520dc902cb24f060c7c72dd/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java#L67C3-L185C6). This PR fix the issue by setting requestedType` with scan table schema data type to parquet column reader. It's a follow PR of this PR facebookincubator#5786 to address the issue by following the comments of Yuhta Please check detail context from facebookincubator#5786 This update is one of the modifications necessary for issue facebookincubator#5770. Pull Request resolved: facebookincubator#6404 Reviewed By: pedroerp Differential Revision: D49330580 Pulled By: Yuhta fbshipit-source-id: bd56bda6efd708691ee35b5b66d5ba9536df525f
…okincubator#6404) Summary: pass down the scan table schema to parquet column reader Details: currently, the requestedType, which is available in [ParquetColumnReader.cpp](https://github.com/facebookincubator/velox/blob/517e3e3a0c8308c96ca068444dfeee37204f7773/velox/dwio/parquet/reader/ParquetColumnReader.cpp#L37C60-L37C68), are set based on the schema present in the parquet file (file data type) instead of scan table schema. The issue occurs when the expected output of the TableScan differs from the schema of the parquet file. Spark's data format for some types differs from Parquet's format. Similar to schema evolution, when the type differs, Spark performs an implicit conversion. The conversions that Spark performs can be seen in [ParquetVectorUpdaterFactory.java](https://github.com/apache/spark/blob/6ca45c52b7416e7b3520dc902cb24f060c7c72dd/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java#L67C3-L185C6). This PR fix the issue by setting requestedType` with scan table schema data type to parquet column reader. It's a follow PR of this PR facebookincubator#5786 to address the issue by following the comments of Yuhta Please check detail context from facebookincubator#5786 This update is one of the modifications necessary for issue facebookincubator#5770. Pull Request resolved: facebookincubator#6404 Reviewed By: pedroerp Differential Revision: D49330580 Pulled By: Yuhta fbshipit-source-id: bd56bda6efd708691ee35b5b66d5ba9536df525f
…okincubator#6404) Summary: pass down the scan table schema to parquet column reader Details: currently, the requestedType, which is available in [ParquetColumnReader.cpp](https://github.com/facebookincubator/velox/blob/517e3e3a0c8308c96ca068444dfeee37204f7773/velox/dwio/parquet/reader/ParquetColumnReader.cpp#L37C60-L37C68), are set based on the schema present in the parquet file (file data type) instead of scan table schema. The issue occurs when the expected output of the TableScan differs from the schema of the parquet file. Spark's data format for some types differs from Parquet's format. Similar to schema evolution, when the type differs, Spark performs an implicit conversion. The conversions that Spark performs can be seen in [ParquetVectorUpdaterFactory.java](https://github.com/apache/spark/blob/6ca45c52b7416e7b3520dc902cb24f060c7c72dd/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java#L67C3-L185C6). This PR fix the issue by setting requestedType` with scan table schema data type to parquet column reader. It's a follow PR of this PR facebookincubator#5786 to address the issue by following the comments of Yuhta Please check detail context from facebookincubator#5786 This update is one of the modifications necessary for issue facebookincubator#5770. Pull Request resolved: facebookincubator#6404 Reviewed By: pedroerp Differential Revision: D49330580 Pulled By: Yuhta fbshipit-source-id: bd56bda6efd708691ee35b5b66d5ba9536df525f
@gaoyangxiaozhu Hi I just noticed this PR doesn't have a test. Will you be able to add a test for this? Or share with us a simple file that reproduce the issue? |
pass down the scan table schema to parquet column reader
Details:
currently, the requestedType, which is available in ParquetColumnReader.cpp, are set based on the schema present in the parquet file (file data type) instead of scan table schema.
The issue occurs when the expected output of the TableScan differs from the schema of the parquet file. Spark's data format for some types differs from Parquet's format. Similar to schema evolution, when the type differs, Spark performs an implicit conversion. The conversions that Spark performs can be seen in ParquetVectorUpdaterFactory.java.
This PR fix the issue by setting requestedType` with scan table schema data type to parquet column reader.
It's a follow PR of this PR #5786 to address the issue by following the comments of @Yuhta
Please check detail context from #5786
This update is one of the modifications necessary for issue #5770.