-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support date-type partition column #7084
Conversation
✅ Deploy Preview for meta-velox canceled.
|
@mbasmanova Could you spare some time to review this PR? Thanks. |
velox/connectors/Connector.h
Outdated
@@ -307,6 +307,14 @@ class ConnectorQueryCtx { | |||
return planNodeId_; | |||
} | |||
|
|||
void setCastStringToDateIsIso8601(bool flag) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't look right. Shouldn't it be part of the config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The transfer chain of this value is as queryConfig -> createDataSource -> ReaderOptions -> HiveDataSource -> splitReader_->prepareSplit -> adaptColumns -> setPartitionValue
.
I want to get it from queryConfig
and set it to connectorQueryCtx->config()
before createDataSource
, but found Config does not allow adding properties at runtime.
https://github.com/facebookincubator/velox/pull/7084/files#diff-b04232f155cc15faca93b69cf34dd80e29b534a4e54dd774bf8bdbedbfc66d6bR127-R128
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't feel right. HiveConnector should define its own configuration properties and use these. We should not assume that castStringToDateIsIso8601_ config applies to all connectors.
I'm still trying to understand why do you need this. I don't see tests for both cases when castStringToDateIsIso8601_ is true and when it is false. Am I missing something?
Should we add a configuration property to velox/connectors/hive/HiveConfig.h?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This config was introduced by 8d6c296, so I assume it can be reused to control the cast behavior in setPartitionValue
. Tests for castStringToDateIsIso8601_ being true and false are already covered by that commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rui-mo At a high level, engine configs are separate from connector config. We should not leak engine-specific logic into connectors. I wonder why do we pass partition values in HiveConnectorSplit as string. Should we pass these as values of proper types (e.g. ConstantVectors)? Will this remove the need to converting from strings to proper types?
I also see that applyPartitionFilter may need to be updated to handle DATE type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option might be for the connector to access cast logic in the engine via ExpressionEvaluator, i.e. construct and evaluate a CAST expression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Masha, thank you for the suggestion. For option 1, Velox now uses a map as below to store partition keys in HiveConnectorSplit.h.
const std::unordered_map<std::string, std::optionalstd::string> partitionKeys
Seems const std::unordered_map<std::string, core::ConstantTypedExpr> partitionKeys
can help remove the cast process. But I'm not sure if there's any special consideration in Velox to use string type. In Gluten, we need to convert literals to string type to create partition keys, and in Velox strings are converted to variants. If ConstantTypedExpr is used, the intermediate string could be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const std::unordered_map<std::string, core::ConstantTypedExpr> partitionKeys
This could work. Would you like to prototype this change? CC: @Yuhta
Also, wondering if any changes are needed to support writing to partitioned tables with DATE partition keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is partition just a directory name in Hive? If yes it is always a string right? Or it can be stored as some typed metadata? If the original source of the partition value is a string, I think casting it in connector makes sense. Otherwise we can try passing in the typed value directly. |
@mbasmanova @Yuhta Partition keys are needed when creating HiveConnectorSplit. In Gluten, typed constant can be got from Spark so we don't need string, but I'm not sure about other engines. |
Partition keys can be any type and queries may need to return these as columns. Hence, all engines should be able to specify typed values for partition keys. |
@mbasmanova I understand that we need to return partition value as a typed column in query result. But how is it stored in Hive? Is it type + string, or type + type specific blob? In the first case we can do the conversion in Velox, in the second case Velox should get the typed constant vector I think. |
b3ed5ed
to
37bfc06
Compare
Hi @mbasmanova @Yuhta, it has been a long time since last discussion. Apology for the delay and I have made some updates. Would you like to take a review again? I checked how partitioned table was stored in Hive, and found sub-directory was used. I assume that's why partitions values are passed as string to HiveConnectorSplit, and the conversion from string to typed variant is needed.
Previously, to support the conversion from string to date, I used |
expressionEvaluator->compile(callExpr); | ||
VectorPtr result; | ||
const SelectivityVector rows(1); | ||
expressionEvaluator->evaluate(exprSet.get(), rows, *input, result); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using expression evaluation here feels a bit heavy weight. Why can't we use a lower-level utility? Do we not have access to query config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your review. These was a deprecated query config isIso8601
, but could not be leaked to connector as previous discussion #7084 (comment) pointed out. Do you think it makes sense to add a configuration in HiveConfig? BTW, the semantics of cast string as timestamp are also different between Presto and Spark. Do we intend to use configurations to solve them? (Currently timestamp is kept as is to use Converter because of a workaround to handle timezone.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we use a lower-level utility?
Hi @mbasmanova, I took a further look here. The format of partitionKeys can be controlled by Gluten, so if we ensure that compatible strings are generated as partition values, maybe we don't need to bring more complexity here. For date format YYYY-MM-DD
, castFromDateString
with true for isIso8601
should be enough. Do you think it makes sense?
auto filePath = TempFilePath::create(); | ||
writeToFile(filePath->path, vectors); | ||
createDuckDbTable(vectors); | ||
testPartitionedTable(filePath->path, DATE(), "2023-10-27"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These was a deprecated query config isIso8601, but could not be leaked to connector as previous discussion #7084 (comment) pointed out
I'm remembering now. What other date formats are supported? Can we add test cases to make sure all supported formats work and different format are supported based on what version of CAST is registered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I notice we can control in Gluten to generate well-formatted strings as partition values. Maybe we can keep using light-weight conversion here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can keep using light-weight conversion here.
That would be great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. Could you help review again? Thanks.
2fad518
to
7e9aabe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rui-mo Looks great % one nit.
Would you update PR description to provide fuller context about the issue and the solution, including the logic you are planning to add to Gluten?
if (value.has_value()) { | ||
if constexpr (ToKind == TypeKind::VARCHAR) { | ||
return velox::variant(value.value()); | ||
} | ||
if constexpr (ToKind == TypeKind::VARBINARY) { | ||
return velox::variant::binary((value.value())); | ||
} | ||
if (toType->isDate()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you add documentation to HiveConnectorSplit::partitionKeys to clarify what values / types are supported and how they are converted from the std::string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. Updated PR description and documentation.
@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@@ -27,6 +27,10 @@ struct HiveConnectorSplit : public connector::ConnectorSplit { | |||
dwio::common::FileFormat fileFormat; | |||
const uint64_t start; | |||
const uint64_t length; | |||
|
|||
/// A map stores partition keys and values in string. Typed partition values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps,
Mapping from partition keys to values. Values are specified as strings formatted the same way as CAST(x as VARCHAR). Null values are specified as std::nullopt. Date values must be formatted using ISO 8601 as YYYY-MM-DD.
All scalar types and date type are supported.
Is this so? Do we support real, double and timestamp? What's the format expected for these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, real, double and timestamp are supported in Converter (real and double, timestamp). The expected format should be the same as cast expects real and double, timestamp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering whether Spark support real, double, timestamp as partition keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are supported in Spark, but boolean type is not. There is one test on partition column types in Spark link.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Would it be possible to add tests for all supported types in a follow-up ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I will follow up in a separate PR. Through first glance, misssing tests are on decimal and timestamp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbasmanova I double-checked and found only decimal was missing. Opened #8670 to add.
@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
1 similar comment
@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@mbasmanova merged this pull request in 78aa782. |
Conbench analyzed the 1 benchmark run on commit There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
Summary: When the partition column is of date type, below error occurs when trying to convert date string as int. This PR fixes this exception by using `castFromDateString` with `isIso8601` set as true. With this fix, the date partition value formatted as string should be in the `[+-](YYYY-MM-DD)` format. The callers like Gluten need to generate compatible format of strings to be partition values. ``` Exceptions.h:69] Line: ../../velox/exec/Driver.cpp:545, Function:runInternal, Expression: Operator::getOutput failed for [operator: TableScan, plan node ID: 0]: Non-whitespace character found after end of conversion: "-10-27", Source: RUNTIME, ErrorCode: INVALID_STATE unknown file: Failure C++ exception with description "Exception: VeloxRuntimeError Error Source: RUNTIME Error Code: INVALID_STATE Reason: Operator::getOutput failed for [operator: TableScan, plan node ID: 0]: Non-whitespace character found after end of conversion: "-10-27" ``` Pull Request resolved: facebookincubator#7084 Reviewed By: pedroerp Differential Revision: D53348851 Pulled By: mbasmanova fbshipit-source-id: 38f9ae484a9976ddf08f1805fa7d10baa16c2fc5
Summary: When the partition column is of date type and a filter is applied on it, below error occurs when trying to convert date string as int. This PR fixes this exception by utilizing castFromDateString with standard cast behavior. ``` C++ exception with description "Exception: VeloxRuntimeError Error Source: RUNTIME Error Code: INVALID_STATE Reason: Operator::getOutput failed for [operator: TableScan, plan node ID: 0]: Non-whitespace character found after end of conversion: "-10-27" ``` #7084 Pull Request resolved: #9937 Reviewed By: pedroerp Differential Revision: D57913463 Pulled By: Yuhta fbshipit-source-id: 7956de4ae122b080ecf80a96cfcf546f8aae7efe
Summary: When the partition column is of date type and a filter is applied on it, below error occurs when trying to convert date string as int. This PR fixes this exception by utilizing castFromDateString with standard cast behavior. ``` C++ exception with description "Exception: VeloxRuntimeError Error Source: RUNTIME Error Code: INVALID_STATE Reason: Operator::getOutput failed for [operator: TableScan, plan node ID: 0]: Non-whitespace character found after end of conversion: "-10-27" ``` facebookincubator#7084 Pull Request resolved: facebookincubator#9937 Reviewed By: pedroerp Differential Revision: D57913463 Pulled By: Yuhta fbshipit-source-id: 7956de4ae122b080ecf80a96cfcf546f8aae7efe
Summary: When the partition column is of date type and a filter is applied on it, below error occurs when trying to convert date string as int. This PR fixes this exception by utilizing castFromDateString with standard cast behavior. ``` C++ exception with description "Exception: VeloxRuntimeError Error Source: RUNTIME Error Code: INVALID_STATE Reason: Operator::getOutput failed for [operator: TableScan, plan node ID: 0]: Non-whitespace character found after end of conversion: "-10-27" ``` facebookincubator#7084 Pull Request resolved: facebookincubator#9937 Reviewed By: pedroerp Differential Revision: D57913463 Pulled By: Yuhta fbshipit-source-id: 7956de4ae122b080ecf80a96cfcf546f8aae7efe
When the partition column is of date type, below error occurs when trying to
convert date string as int. This PR fixes this exception by using
castFromDateString
withisIso8601
set as true. With this fix, the datepartition value formatted as string should be in the
[+-](YYYY-MM-DD)
format.The callers like Gluten need to generate compatible format of strings to be
partition values.