Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support date-type partition column #7084

Closed
wants to merge 1 commit into from

Conversation

rui-mo
Copy link
Collaborator

@rui-mo rui-mo commented Oct 17, 2023

When the partition column is of date type, below error occurs when trying to
convert date string as int. This PR fixes this exception by using
castFromDateString with isIso8601 set as true. With this fix, the date
partition value formatted as string should be in the [+-](YYYY-MM-DD) format.
The callers like Gluten need to generate compatible format of strings to be
partition values.

Exceptions.h:69] Line: ../../velox/exec/Driver.cpp:545, Function:runInternal, Expression:  Operator::getOutput failed for [operator: TableScan, plan node ID: 0]: Non-whitespace character found after end of conversion: "-10-27", Source: RUNTIME, ErrorCode: INVALID_STATE
unknown file: Failure
C++ exception with description "Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: Operator::getOutput failed for [operator: TableScan, plan node ID: 0]: Non-whitespace character found after end of conversion: "-10-27"

@netlify
Copy link

netlify bot commented Oct 17, 2023

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit a74b5b5
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/65bcf49c4b4406000873642b

@rui-mo rui-mo marked this pull request as draft October 17, 2023 02:24
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 17, 2023
@rui-mo rui-mo marked this pull request as ready for review October 27, 2023 06:27
@rui-mo
Copy link
Collaborator Author

rui-mo commented Oct 31, 2023

@mbasmanova Could you spare some time to review this PR? Thanks.

@mbasmanova mbasmanova requested a review from Yuhta October 31, 2023 11:01
@@ -307,6 +307,14 @@ class ConnectorQueryCtx {
return planNodeId_;
}

void setCastStringToDateIsIso8601(bool flag) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look right. Shouldn't it be part of the config?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The transfer chain of this value is as queryConfig -> createDataSource -> ReaderOptions -> HiveDataSource -> splitReader_->prepareSplit -> adaptColumns -> setPartitionValue.
I want to get it from queryConfig and set it to connectorQueryCtx->config() before createDataSource, but found Config does not allow adding properties at runtime.
https://github.com/facebookincubator/velox/pull/7084/files#diff-b04232f155cc15faca93b69cf34dd80e29b534a4e54dd774bf8bdbedbfc66d6bR127-R128

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't feel right. HiveConnector should define its own configuration properties and use these. We should not assume that castStringToDateIsIso8601_ config applies to all connectors.

I'm still trying to understand why do you need this. I don't see tests for both cases when castStringToDateIsIso8601_ is true and when it is false. Am I missing something?

Should we add a configuration property to velox/connectors/hive/HiveConfig.h?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This config was introduced by 8d6c296, so I assume it can be reused to control the cast behavior in setPartitionValue. Tests for castStringToDateIsIso8601_ being true and false are already covered by that commit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rui-mo At a high level, engine configs are separate from connector config. We should not leak engine-specific logic into connectors. I wonder why do we pass partition values in HiveConnectorSplit as string. Should we pass these as values of proper types (e.g. ConstantVectors)? Will this remove the need to converting from strings to proper types?

I also see that applyPartitionFilter may need to be updated to handle DATE type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option might be for the connector to access cast logic in the engine via ExpressionEvaluator, i.e. construct and evaluate a CAST expression.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Masha, thank you for the suggestion. For option 1, Velox now uses a map as below to store partition keys in HiveConnectorSplit.h.

const std::unordered_map<std::string, std::optionalstd::string> partitionKeys

Seems const std::unordered_map<std::string, core::ConstantTypedExpr> partitionKeys can help remove the cast process. But I'm not sure if there's any special consideration in Velox to use string type. In Gluten, we need to convert literals to string type to create partition keys, and in Velox strings are converted to variants. If ConstantTypedExpr is used, the intermediate string could be removed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const std::unordered_map<std::string, core::ConstantTypedExpr> partitionKeys

This could work. Would you like to prototype this change? CC: @Yuhta

Also, wondering if any changes are needed to support writing to partitioned tables with DATE partition keys.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions.

@Yuhta Jimmy, would you help review this PR?

CC: @kevinwilfong

@Yuhta
Copy link
Contributor

Yuhta commented Nov 2, 2023

Is partition just a directory name in Hive? If yes it is always a string right? Or it can be stored as some typed metadata?

If the original source of the partition value is a string, I think casting it in connector makes sense. Otherwise we can try passing in the typed value directly.

@rui-mo
Copy link
Collaborator Author

rui-mo commented Nov 3, 2023

Is partition just a directory name in Hive? If yes it is always a string right? Or it can be stored as some typed metadata?

@mbasmanova @Yuhta Partition keys are needed when creating HiveConnectorSplit. In Gluten, typed constant can be got from Spark so we don't need string, but I'm not sure about other engines.

@mbasmanova
Copy link
Contributor

Partition keys can be any type and queries may need to return these as columns. Hence, all engines should be able to specify typed values for partition keys.

@Yuhta
Copy link
Contributor

Yuhta commented Nov 3, 2023

@mbasmanova I understand that we need to return partition value as a typed column in query result. But how is it stored in Hive? Is it type + string, or type + type specific blob? In the first case we can do the conversion in Velox, in the second case Velox should get the typed constant vector I think.

@rui-mo rui-mo marked this pull request as draft January 30, 2024 01:16
@rui-mo rui-mo force-pushed the wip_date branch 3 times, most recently from b3ed5ed to 37bfc06 Compare February 1, 2024 07:53
@rui-mo rui-mo changed the title Support date as partition value type Support date-type partition column Feb 1, 2024
@rui-mo rui-mo marked this pull request as ready for review February 1, 2024 08:55
@rui-mo
Copy link
Collaborator Author

rui-mo commented Feb 1, 2024

Hi @mbasmanova @Yuhta, it has been a long time since last discussion. Apology for the delay and I have made some updates. Would you like to take a review again?

I checked how partitioned table was stored in Hive, and found sub-directory was used. I assume that's why partitions values are passed as string to HiveConnectorSplit, and the conversion from string to typed variant is needed.

Hive internally splits the records based on the partition key and stores each partition data into a sub-directory of tables directory on HDFS. The name of the directory would be partition key and it's value.

Previously, to support the conversion from string to date, I used castFromDateString method, but got confused on how to distinguish isIso8601 flag for Presto and Spark. At present, this issue can be solved by constructing and evaluating a cast expression, as its semantics will follow the registered CastExpr.

expressionEvaluator->compile(callExpr);
VectorPtr result;
const SelectivityVector rows(1);
expressionEvaluator->evaluate(exprSet.get(), rows, *input, result);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using expression evaluation here feels a bit heavy weight. Why can't we use a lower-level utility? Do we not have access to query config?

Copy link
Collaborator Author

@rui-mo rui-mo Feb 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review. These was a deprecated query config isIso8601, but could not be leaked to connector as previous discussion #7084 (comment) pointed out. Do you think it makes sense to add a configuration in HiveConfig? BTW, the semantics of cast string as timestamp are also different between Presto and Spark. Do we intend to use configurations to solve them? (Currently timestamp is kept as is to use Converter because of a workaround to handle timezone.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we use a lower-level utility?

Hi @mbasmanova, I took a further look here. The format of partitionKeys can be controlled by Gluten, so if we ensure that compatible strings are generated as partition values, maybe we don't need to bring more complexity here. For date format YYYY-MM-DD, castFromDateString with true for isIso8601 should be enough. Do you think it makes sense?

auto filePath = TempFilePath::create();
writeToFile(filePath->path, vectors);
createDuckDbTable(vectors);
testPartitionedTable(filePath->path, DATE(), "2023-10-27");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These was a deprecated query config isIso8601, but could not be leaked to connector as previous discussion #7084 (comment) pointed out

I'm remembering now. What other date formats are supported? Can we add test cases to make sure all supported formats work and different format are supported based on what version of CAST is registered?

Copy link
Collaborator Author

@rui-mo rui-mo Feb 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice we can control in Gluten to generate well-formatted strings as partition values. Maybe we can keep using light-weight conversion here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can keep using light-weight conversion here.

That would be great.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Could you help review again? Thanks.

@rui-mo rui-mo force-pushed the wip_date branch 2 times, most recently from 2fad518 to 7e9aabe Compare February 2, 2024 10:40
Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rui-mo Looks great % one nit.

Would you update PR description to provide fuller context about the issue and the solution, including the logic you are planning to add to Gluten?

if (value.has_value()) {
if constexpr (ToKind == TypeKind::VARCHAR) {
return velox::variant(value.value());
}
if constexpr (ToKind == TypeKind::VARBINARY) {
return velox::variant::binary((value.value()));
}
if (toType->isDate()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you add documentation to HiveConnectorSplit::partitionKeys to clarify what values / types are supported and how they are converted from the std::string?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. Updated PR description and documentation.

@facebook-github-bot
Copy link
Contributor

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@@ -27,6 +27,10 @@ struct HiveConnectorSplit : public connector::ConnectorSplit {
dwio::common::FileFormat fileFormat;
const uint64_t start;
const uint64_t length;

/// A map stores partition keys and values in string. Typed partition values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps,

Mapping from partition keys to values. Values are specified as strings formatted the same way as CAST(x as VARCHAR). Null values are specified as std::nullopt. Date values must be formatted using ISO 8601 as YYYY-MM-DD.

All scalar types and date type are supported.

Is this so? Do we support real, double and timestamp? What's the format expected for these?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, real, double and timestamp are supported in Converter (real and double, timestamp). The expected format should be the same as cast expects real and double, timestamp

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering whether Spark support real, double, timestamp as partition keys.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are supported in Spark, but boolean type is not. There is one test on partition column types in Spark link.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Would it be possible to add tests for all supported types in a follow-up ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will follow up in a separate PR. Through first glance, misssing tests are on decimal and timestamp.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbasmanova I double-checked and found only decimal was missing. Opened #8670 to add.

@facebook-github-bot
Copy link
Contributor

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

1 similar comment
@facebook-github-bot
Copy link
Contributor

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@mbasmanova merged this pull request in 78aa782.

Copy link

Conbench analyzed the 1 benchmark run on commit 78aa7825.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

FelixYBW pushed a commit to FelixYBW/velox that referenced this pull request Feb 12, 2024
Summary:
When the partition column is of date type, below error occurs when trying to
convert date string as int. This PR fixes this exception by using
`castFromDateString` with `isIso8601` set as true. With this fix, the date
partition value formatted as string should be in the `[+-](YYYY-MM-DD)` format.
The callers like Gluten need to generate compatible format of strings to be
partition values.

```
Exceptions.h:69] Line: ../../velox/exec/Driver.cpp:545, Function:runInternal, Expression:  Operator::getOutput failed for [operator: TableScan, plan node ID: 0]: Non-whitespace character found after end of conversion: "-10-27", Source: RUNTIME, ErrorCode: INVALID_STATE
unknown file: Failure
C++ exception with description "Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: Operator::getOutput failed for [operator: TableScan, plan node ID: 0]: Non-whitespace character found after end of conversion: "-10-27"
```

Pull Request resolved: facebookincubator#7084

Reviewed By: pedroerp

Differential Revision: D53348851

Pulled By: mbasmanova

fbshipit-source-id: 38f9ae484a9976ddf08f1805fa7d10baa16c2fc5
facebook-github-bot pushed a commit that referenced this pull request May 29, 2024
Summary:
When the partition column is of date type and a filter is applied on it, below
error occurs when trying to convert date string as int. This PR fixes this
exception by utilizing castFromDateString with standard cast behavior.
```
C++ exception with description "Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: Operator::getOutput failed for [operator: TableScan, plan node ID: 0]: Non-whitespace character found after end of conversion: "-10-27"
```
#7084

Pull Request resolved: #9937

Reviewed By: pedroerp

Differential Revision: D57913463

Pulled By: Yuhta

fbshipit-source-id: 7956de4ae122b080ecf80a96cfcf546f8aae7efe
Joe-Abraham pushed a commit to Joe-Abraham/velox that referenced this pull request Jun 7, 2024
Summary:
When the partition column is of date type and a filter is applied on it, below
error occurs when trying to convert date string as int. This PR fixes this
exception by utilizing castFromDateString with standard cast behavior.
```
C++ exception with description "Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: Operator::getOutput failed for [operator: TableScan, plan node ID: 0]: Non-whitespace character found after end of conversion: "-10-27"
```
facebookincubator#7084

Pull Request resolved: facebookincubator#9937

Reviewed By: pedroerp

Differential Revision: D57913463

Pulled By: Yuhta

fbshipit-source-id: 7956de4ae122b080ecf80a96cfcf546f8aae7efe
Joe-Abraham pushed a commit to Joe-Abraham/velox that referenced this pull request Jun 7, 2024
Summary:
When the partition column is of date type and a filter is applied on it, below
error occurs when trying to convert date string as int. This PR fixes this
exception by utilizing castFromDateString with standard cast behavior.
```
C++ exception with description "Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: Operator::getOutput failed for [operator: TableScan, plan node ID: 0]: Non-whitespace character found after end of conversion: "-10-27"
```
facebookincubator#7084

Pull Request resolved: facebookincubator#9937

Reviewed By: pedroerp

Differential Revision: D57913463

Pulled By: Yuhta

fbshipit-source-id: 7956de4ae122b080ecf80a96cfcf546f8aae7efe
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants