Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Spark cast expr based on cast hooks #7377

Closed
wants to merge 2 commits into from

Conversation

rui-mo
Copy link
Collaborator

@rui-mo rui-mo commented Nov 2, 2023

There are several corner cases of semantic differences on cast between Presto
and Spark. According to the implementation of
registerFunctionCallToSpecialForm, a previously registered special form can be
overriden by one registered after it and of the same name.This PR implements
SparkCastExpr, which is registered as special form, and can be customized with
the help of cast hooks compatible with Spark's semantics. Below hooks are added
to solve several semantic differences.

  • castStringToTimestamp
  • castStringToDate
  • castTimestampToString
  • legacy
  • removeWhiteSpaces
  • truncate

Two configurations kCastToIntByTruncate and kCastStringToDateIsIso8601 are
replaced by cast hooks. These configurations are no longer used and will be removed
by subsequent PRs.

#4876
Fixes #8121.

Copy link

netlify bot commented Nov 2, 2023

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit fa86d9c
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/659df24130b28f0008712f1c

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 2, 2023
Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rui-mo Rui, overall I think it makes sense to Velox to allow applications to provide custom implementations for special forms and register new special forms as well. We need to figure out how to add proper APIs to do that.

Overriding CAST requires also customizing this code in ExprCompiler. CC: @bikramSingh91

  } else if (auto cast = dynamic_cast<const core::CastTypedExpr*>(expr.get())) {
    VELOX_CHECK(!compiledInputs.empty());
    if (FOLLY_UNLIKELY(*resultType == *compiledInputs[0]->type())) {
      result = compiledInputs[0];
    } else {
      result = std::make_shared<CastExpr>(
          resultType,
          std::move(compiledInputs[0]),
          trackCpuUsage,
          cast->nullOnFailure());
    }

@rui-mo
Copy link
Collaborator Author

rui-mo commented Nov 3, 2023

@mbasmanova Thank you for the review. Can I prototype with kCastStringToDateIsIso8601 in this PR? That is to move the Spark-specific behavior controlled by this config into SparkCastExpr, and remove this config.

@mbasmanova
Copy link
Contributor

@rui-mo Rui, please, go ahead.

@rui-mo
Copy link
Collaborator Author

rui-mo commented Nov 7, 2023

@mbasmanova Masha, I made the prototype with IsIso8601. Could you help review? Thanks.

Overriding CAST requires also customizing this code in ExprCompiler.

Seems it is not necessary if we don't want to parse cast from string. In this PR, SparkCastExpr can to be created with CallTypedExpr, but if we want to parse from string like cast (c0 as int), more changes are required. Besides what is included in f8eb923, we need to decide whether to create CastExpr or SparkCastExpr in parseCastExpr (maybe through a parse option).

In this PR, parsing from string to SparkCastExpr is not included. But if it is needed, I can continue to work on that. Seems it is also a UT conveniency issue.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rui-mo Thank you for prototyping Spark-specific CAST. Overall looks good. A few things that need to be taken care of.

  • Move SparkCastExpr out of velox/expression. Perhaps, into functions/sparksql/specialforms folder I saw you adding in some other PR.
  • Make sure that CastTypedExpr is equivalent to CallTypedExpr("cast"). Currently, these are not the same because ExprCompiler explicitly converts CastTypedExpr to CastExpr. Perhaps, this logic can be replaced with getSpecialForm("cast").
  • Figure out all the differences between Spark and Presto CAST impementations to see whether it is feasible to continue using CastExpr as a base class.

@rui-mo rui-mo force-pushed the wip_spark_cast branch 2 times, most recently from c428aa7 to d8b917d Compare November 10, 2023 06:58
@rui-mo rui-mo force-pushed the wip_spark_cast branch 3 times, most recently from 2743676 to 6caec00 Compare December 4, 2023 08:48
@rui-mo
Copy link
Collaborator Author

rui-mo commented Dec 4, 2023

@mbasmanova Thank you for the suggestions. Fixed them, and apology for the delay.

Figure out all the differences between Spark and Presto CAST impementations to see whether it is feasible to continue using CastExpr as a base class.

Summarized them in this table: #4876 (comment).

@mbasmanova
Copy link
Contributor

Summarized them in this table: #4876 (comment).

@rui-mo Thank you, Rui. What's your assessment of these differences? Is it feasible to continue using CastExpr as a base class?

@rui-mo
Copy link
Collaborator Author

rui-mo commented Dec 5, 2023

What's your assessment of these differences? Is it feasible to continue using CastExpr as a base class?

@mbasmanova These cases can be divided into below categories.

  1. cast_to_int_by_truncate behaviors.
  2. overflow when casting to integral types.
  3. cast from string to other type.
  4. cast from other type to string.

As discussed with @PHILO-HE, we tend to use CastExpr as a base class, which can help reuse existing code and avoid duplicates. Take below case as an example.

Conversion Cases Presto result Spark result
string -> integral/floating-point/decimal/date/timestamp SELECT cast('\t\n\u001F 123\u000B' as int); invalid 123

we can fix some of them by overriding a getCastFromStringOutput method of the base class, and removes the white-spaces (which is required in Spark for cast from string) before calling the cast function.

auto output = util::Converter<ToKind, void, Truncate, LegacyCast>::cast(
inputRowValue);

As for #5307 (comment), we don't need to implement duplicate cast functions in CastExpr and SparkCastExpr. By overriding a small part of code in SparkCastExpr to remove the white-spaces before cast should be enough.

Glad to hear your opinions, thanks.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rui-mo Rui, overall looks good. I have some question about the design: separate hook class vs. virtual methods in CastExpr. I also wonder how many hooks we would need to fully implement Spark semantics. Would it make sense to prototype this to see what it looks like and whether the set of hooks is managable.

velox/functions/prestosql/tests/CastBaseTest.h Outdated Show resolved Hide resolved
velox/functions/prestosql/tests/CastBaseTest.h Outdated Show resolved Hide resolved
facebook-github-bot pushed a commit that referenced this pull request Dec 8, 2023
Summary:
Moves `testCast` function from CastExprTest.cpp to CastBaseTest.h for other
test files to use. Separates `testCast` function into `testCast` (for valid
output test), `testTryCast` (for try_cast test), and `testInvalidCast` (for
exception test).

#7377 (comment)
#7377 (comment)

Pull Request resolved: #7912

Reviewed By: xiaoxmeng

Differential Revision: D51968059

Pulled By: mbasmanova

fbshipit-source-id: 26fe9896766ce5b1ed175d3b85c930e7a0b2a1c8
@rui-mo rui-mo force-pushed the wip_spark_cast branch 2 times, most recently from 79a821d to ef5b1ea Compare December 12, 2023 05:24
velox/core/QueryConfig.h Outdated Show resolved Hide resolved
velox/docs/configs.rst Show resolved Hide resolved
velox/type/TimestampConversion.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rui-mo Rui, thank you for iterating on this. I'm having a few questions about the hooks API.

velox/expression/CastHooks.h Outdated Show resolved Hide resolved
velox/expression/CastHooks.h Show resolved Hide resolved
velox/expression/CastHooks.cpp Outdated Show resolved Hide resolved
velox/expression/CastHooks.cpp Outdated Show resolved Hide resolved
velox/functions/sparksql/specialforms/SparkCastHooks.cpp Outdated Show resolved Hide resolved
velox/expression/CastHooks.h Show resolved Hide resolved
velox/expression/CastHooks.h Outdated Show resolved Hide resolved
velox/expression/CastHooks.h Outdated Show resolved Hide resolved
velox/expression/CastHooks.h Outdated Show resolved Hide resolved
@facebook-github-bot
Copy link
Contributor

@kagamiori has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@kagamiori
Copy link
Contributor

kagamiori commented Jan 10, 2024

FYI, I replace

static void SetUpTestCase() {
    functions::sparksql::registerFunctions("");
  }

in the SparkCastExprTest class with

SparkCastExprTest() {
    functions::sparksql::registerFunctions("");
  }

to resolve some internal test failures. Also made some format changes.

@facebook-github-bot
Copy link
Contributor

@kagamiori merged this pull request in 51dc97b.

Copy link

Conbench analyzed the 1 benchmark run on commit 51dc97b1.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

@kagamiori
Copy link
Contributor

@rui-mo, this change turns out to break some string-related unit test in spark, and hence the linux build job in CI (https://app.circleci.com/pipelines/github/facebookincubator/velox/42248/workflows/18d750f7-eed8-466f-94d8-63958efad94a/jobs/291869/tests), so we're going to revert it. Could you take a look at the failed unit tests?

kagamiori added a commit to kagamiori/velox that referenced this pull request Jan 11, 2024
Summary: The PR facebookincubator#7377 broke string-related unit tests in velox/functions/sparksql/tests/StringTest.cpp

Reviewed By: kgpai

Differential Revision: D52681222
@rui-mo
Copy link
Collaborator Author

rui-mo commented Jan 11, 2024

@kagamiori Thank you for helping land this PR. I know the cause of this failure, and please check #8121. Could you tell me more details about the internal failures as #7377 (comment) mentioned?

@kagamiori
Copy link
Contributor

kagamiori commented Jan 11, 2024

@kagamiori Thank you for helping land this PR. I know the cause of this failure, and please check #8121. Could you tell me more details about the internal failures as #7377 (comment) mentioned?

Hi @rui-mo, we're seeing the CI signal "ci/circleci: linux-build" started failing on the main branch since #7377 was landed (https://github.com/facebookincubator/velox/commits/main/).

cc @kgpai

@rui-mo
Copy link
Collaborator Author

rui-mo commented Jan 11, 2024

we're seeing the CI signal "ci/circleci: linux-build" started failing on the main branch

Hi @kagamiori, I checked the log and found StringTest.sha1 is failing. As I previously found, it is due to #8123 (comment). Could you tell me more about why SetUpTestCase() is removed and replaced as #7377 (comment)? Thanks.

@kagamiori
Copy link
Contributor

kagamiori commented Jan 11, 2024

@kagamiori Thank you for helping land this PR. I know the cause of this failure, and please check #8121. Could you tell me more details about the internal failures as #7377 (comment) mentioned?

OK, after reading the context in #8121, I think I know the reason to the failure. It was likely the replacement of SetUpTestCase() with SparkCastExprTest(). This change however didn't break the internal test of velox_functions_spark_test. Let me try making a fix.

cc @kgpai @mbasmanova

@kagamiori
Copy link
Contributor

kagamiori commented Jan 11, 2024

static void SetUpTestCase() {
functions::sparksql::registerFunctions("");
}

SetUpTestCase() was replaced because parse::registerTypeResolver(); and memory::MemoryManager::testingSetInstance({}); needs to be called additionally to avoid breaking internal test signals. Those were called in FunctionBaseTest::SetUpTestCase(), but when SparkCastExprTest::SetUpTestCase() is defined, the method in parent class wouldn't be called. I'm adding SparkCastExprTest::SetUpTestCase() back and adding these two calls to it.

@rui-mo
Copy link
Collaborator Author

rui-mo commented Jan 11, 2024

@kagamiori Thank you. I understand. In this PR SetUpTestCase() is added to avoid the registration of Presto functions, but another two functions calls are missing. Sorry for the inconvenience, and please feel free to contact me if any other work is needed.

@kagamiori
Copy link
Contributor

Here is a fix: #8346.

kagamiori added a commit to kagamiori/velox that referenced this pull request Jan 12, 2024
Summary: Previously emptyOutput was not called in trimUnicodeWhiteSpace after the refactoring in facebookincubator#7377. This diff fixes it.

Differential Revision: D52737293
kagamiori added a commit to kagamiori/velox that referenced this pull request Jan 12, 2024
Summary:

Previously emptyOutput was not called in trimUnicodeWhiteSpace after the refactoring in facebookincubator#7377. This diff fixes it.

Reviewed By: amitkdutta

Differential Revision: D52737293
facebook-github-bot pushed a commit that referenced this pull request Jan 12, 2024
Summary:
Pull Request resolved: #8368

Previously emptyOutput was not called in trimUnicodeWhiteSpace after the refactoring in #7377. This diff fixes it.

Reviewed By: amitkdutta

Differential Revision: D52737293

fbshipit-source-id: ecaece86172646b1e3fd910555e859de91636b53
liujiayi771 pushed a commit to liujiayi771/velox that referenced this pull request Jan 16, 2024
Summary:
There are several corner cases of semantic differences on cast between Presto
and Spark. According to the implementation of
`registerFunctionCallToSpecialForm`, a previously registered special form can be
overriden by one registered after it and of the same name.This PR implements
SparkCastExpr, which is registered as special form, and can be customized with
the help of cast hooks compatible with Spark's semantics. Below hooks are added
to solve several semantic differences.
- castStringToTimestamp
- castStringToDate
- castTimestampToString
- legacy
- removeWhiteSpaces
- truncate

Two configurations `kCastToIntByTruncate` and `kCastStringToDateIsIso8601` are
replaced by cast hooks. These configurations are no longer used and will be removed by
subsequent PRs.

facebookincubator#4876
Fixes facebookincubator#8121.

Pull Request resolved: facebookincubator#7377

Reviewed By: kgpai

Differential Revision: D52566119

Pulled By: kagamiori

fbshipit-source-id: 34577133550e112eddb7f8080b9d897c45ee1fec
liujiayi771 pushed a commit to liujiayi771/velox that referenced this pull request Jan 16, 2024
Summary:
Pull Request resolved: facebookincubator#8368

Previously emptyOutput was not called in trimUnicodeWhiteSpace after the refactoring in facebookincubator#7377. This diff fixes it.

Reviewed By: amitkdutta

Differential Revision: D52737293

fbshipit-source-id: ecaece86172646b1e3fd910555e859de91636b53
mbasmanova pushed a commit to mbasmanova/velox-1 that referenced this pull request Jan 17, 2024
Summary:
Pull Request resolved: facebookincubator#8368

Previously emptyOutput was not called in trimUnicodeWhiteSpace after the refactoring in facebookincubator#7377. This diff fixes it.

Reviewed By: amitkdutta

Differential Revision: D52737293

fbshipit-source-id: ecaece86172646b1e3fd910555e859de91636b53
@rui-mo
Copy link
Collaborator Author

rui-mo commented Jan 17, 2024

@mbasmanova @kagamiori Thanks for your continuous support on this pull request. I'm working on below tasks discovered when implementing Spark cast. Could you spare some time to take a review? Thanks!

  1. Remove deprecated configs kCastToIntByTruncate and kCastStringToDateIsIso8601 #8352
  2. Combine testComplexCast with testCast #8254
    Mentioned in Add Spark cast expr based on cast hooks #7377 (comment).
  3. Optimize CAST(timestamp as varchar) #8385
    Mentioned in Add Spark cast expr based on cast hooks #7377 (comment).

facebook-github-bot pushed a commit that referenced this pull request Jan 19, 2024
…teIsIso8601` (#8352)

Summary:
After #7377, kCastToIntByTruncate and kCastStringToDateIsIso8601 are no longer in use.

Pull Request resolved: #8352

Reviewed By: mbasmanova

Differential Revision: D52737612

Pulled By: kagamiori

fbshipit-source-id: c82d55a592b85a16bce5c29b20d22f7a19c6c9a3
mapleFU pushed a commit to mapleFU/velox that referenced this pull request Jan 22, 2024
…teIsIso8601` (facebookincubator#8352)

Summary:
After facebookincubator#7377, kCastToIntByTruncate and kCastStringToDateIsIso8601 are no longer in use.

Pull Request resolved: facebookincubator#8352

Reviewed By: mbasmanova

Differential Revision: D52737612

Pulled By: kagamiori

fbshipit-source-id: c82d55a592b85a16bce5c29b20d22f7a19c6c9a3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tests failure for MD5 and sha1 spark functions
5 participants