Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coverage: Add a manual test to show what Spark built in expression the DF can support directly #331

Merged
merged 9 commits into from
May 21, 2024

Conversation

comphead
Copy link
Contributor

Which issue does this PR close?

Closes #.
Related to #240

Rationale for this change

Adding a tool that generates up to date coverage of Spark builtin expressions by Datafusion directly. It is supposed to let the developer better understanding should it back Spark builtin function by using DF match or implement custom functionin Comet native core

What changes are included in this PR?

How are these changes tested?

A test which generates doc/spark_builtin_expr_df_coverage.txt

@comphead comphead changed the title Coverage: Add a manual test for DF supporting Spark expressions directly Coverage: Add a manual test to show what Spark built in expression the DF can support directly Apr 26, 2024
@advancedxy
Copy link
Contributor

Hi @comphead is this ready for review?

@comphead
Copy link
Contributor Author

Hi @comphead is this ready for review?

Thanks @advancedxy I think so.

Copy link
Contributor

@kazuyukitanimura kazuyukitanimura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to understand, backed by datafusion does not automatically mean that has Spark compatibility?

@@ -43,12 +45,14 @@ class CometExpressionCoverageSuite extends CometTestBase with AdaptiveSparkPlanH

private val rawCoverageFilePath = "doc/spark_builtin_expr_coverage.txt"
private val aggCoverageFilePath = "doc/spark_builtin_expr_coverage_agg.txt"
private val rawCoverageFileDatafusionPath = "doc/spark_builtin_expr_df_coverage.txt"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: now we have docs dir. Let's organize the output location...

* Manual test to calculate Spark builtin expressions coverage support by the Datafusion
* directly
*
* The test will update files doc/spark_builtin_expr_df_coverage.txt,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to add the new column in the previous file(spark_builtin_expr_coverage.txt), so that we can clearly get which func has already been implemented, which is not and will that func be supported directly in the DataFusion kernel without looking at two files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. let me think about it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

*
* The test will update files doc/spark_builtin_expr_df_coverage.txt,
*
* Requires to set DATAFUSIONCLI_PATH env variable to valid path to datafusion-cli
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to build the coverage file directly in the CI pipeline, I think this is fine. Otherwise, I think we may need to update the development.md file or any other documentation to indicate how to run this coverage suite.

In my opinion, I would prefer to set up the CI pipeline earlier so others don't have to bother how to run this file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be added to CI in following PRs

@@ -123,7 +134,7 @@ class CometExpressionCoverageSuite extends CometTestBase with AdaptiveSparkPlanH
Seq(("", "No examples found in spark.sessionState.functionRegistry"))))
}

// TODO: convert results into HTML
// TODO: convert results into HTML or .md file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps log an issue for this and refer to that instead?
It would be great to link the output directly into the documentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in next PR its expected to generate MD file in docs, rather than txt

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove the commented out code from this PR?

@comphead
Copy link
Contributor Author

comphead commented May 9, 2024

Just to understand, backed by datafusion does not automatically mean that has Spark compatibility?

I hope in most cases yes as we have a generic wrapper, this file is more for reference what builtin function probably should be added to DF

@comphead
Copy link
Contributor Author

Changes:

  • only 1 spark coverage file: docs/spark_builtin_expr_coverage.txt
  • The file contains table with spark function name, query, result, cometMessage, datafusionMessage

cometMessage is the message obtained from running the query in Comet
datafusionMessage is the message obtained from running the query in DafafusionCli

The is optional now, means it can be run manually to update what is supported, what is not and the reason

@comphead
Copy link
Contributor Author

@parthchandra @advancedxy @kazuyukitanimura @viirya @andygrove if you guys have time to have a second look

@andygrove
Copy link
Member

Just to understand, backed by datafusion does not automatically mean that has Spark compatibility?

I have a similar question, and I'm not sure that I really understand the motivation for the tool.

As we add support for more Spark expressions, I can see that it makes sense to see if these expressions are already implemented in DataFusion, but we would still have to implement Spark-specific tests in Comet and then determine if the DataFusion expression is sufficient. If not, we then have to implement our own custom version.

I don't think there is a scenario where we see that DataFusion has an expression and we just enable it in Comet.

@comphead
Copy link
Contributor Author

As we add support for more Spark expressions, I can see that it makes sense to see if these expressions are already implemented in DataFusion, but we would still have to implement Spark-specific tests in Comet and then determine if the DataFusion expression is sufficient. If not, we then have to implement our own custom version.

Thanks @andygrove for the review. Basically the idea of this test to show how much coverage we currently have in Comet for Spark builtin functions.
I supposed it would be interesting to have the following info per each builtin function:

  • is function supported by Comet?
  • If its not, is function supported by DataFusion directly or we need to send a PR to DF?
  • Sometimes function is supported by DF but it gives a wrong result in Comet comparing to Spark

Having info we can later generate the coverage .MD dynamically instead of static one.

Copy link
Contributor

@advancedxy advancedxy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late response, I was quite busy last two weeks.

LGTM except one minor comment about the javadoc.

Comment on lines +72 to +73
* The test will update files doc/spark_builtin_expr_coverage.txt,
* doc/spark_builtin_expr_coverage_agg.txt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like the latest code only generates spark_builtin_expr_coverage.txt? Maybe these two lines should be updated.

@advancedxy
Copy link
Contributor

I don't think there is a scenario where we see that DataFusion has an expression and we just enable it in Comet.

I agree. However, I believe the value of the generated expr coverage is that it can provide developers with insights about whether the related function has already been implemented in DataFusion or not. If so, developers can use that as a reference and don't have to reinvent the whole thing from scratch.

@comphead
Copy link
Contributor Author

If we dont need to expose datafusion-cli message its possible just hide it from the output file.

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @comphead. It would be good to remove the commented out code and maybe add a link to any follow up issues.

@comphead comphead merged commit 9ca63a2 into apache:main May 21, 2024
40 checks passed
@comphead
Copy link
Contributor Author

Thanks everyone for the review

himadripal pushed a commit to himadripal/datafusion-comet that referenced this pull request Sep 7, 2024
…e DF can support directly (apache#331)

* Coverage: Add a manual test for DF supporting Spark expressions directly

Co-authored-by: advancedxy <xianjin@apache.org>
(cherry picked from commit 9ca63a2)
himadripal pushed a commit to himadripal/datafusion-comet that referenced this pull request Sep 7, 2024
…pache#331)

* chore: Update version to 0.2.0 and add 0.1.0 changelog (apache#696)

* Generate changelog

* fix error in release verification script

* Change version from 0.1.0 to 0.2.0

* add changelog page and update release instructions

* address feedback

* update version to 0.2.0-SNAPSHOT

* fix: do not overwrite withInfo data

* update test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants