Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-29065][SQL][TEST] Extend EXTRACT benchmark #25772

Closed
wants to merge 11 commits into from

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Sep 12, 2019

What changes were proposed in this pull request?

In the PR, I propose to extend ExtractBenchmark and add new ones for:

  • EXTRACT and DATE as input column
  • the DATE_PART function and DATE/TIMESTAMP input column

Why are the changes needed?

The EXTRACT expression is rebased on the DATE_PART expression by the PR #25410 where some of sub-expressions take DATE column as the input (Millennium, Year and etc.) but others require TIMESTAMP column (Hour, Minute). Separate benchmarks for DATE should exclude overhead of implicit conversions DATE <-> TIMESTAMP.

Does this PR introduce any user-facing change?

No, it doesn't.

How was this patch tested?

  • Regenerated results of ExtractBenchmark

@MaxGekk
Copy link
Member Author

MaxGekk commented Sep 12, 2019

@dongjoon-hyun @cloud-fan @maropu Could you take a look at the PR. Should I extract cast-related code to a separate PR? WDYT?

@maropu
Copy link
Member

maropu commented Sep 12, 2019

Ur, I see. I didn't notice we've already accepted explicit casts from long to timestamp though, any reason to support this explicit cast? It seems pgSQL doesn't accept that ;

postgres=# select cast(0::bigint as timestamp);
ERROR:  cannot cast type bigint to timestamp without time zone
LINE 1: select cast(0::bigint as timestamp);
               ^
postgres=# select cast(0::int as date);
ERROR:  cannot cast type integer to date
LINE 1: select cast(0::int as date);
               ^

cc: @gengliangwang

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, @MaxGekk . I'm a little negative with this PR.

  1. First of all, this is not a TEST PR because it touch SQL behavior. You need to split this PR into two.
  2. In addition, the new behavior is not consistent with PostgreSQL, too.
postgres=# select cast(1 as date);
ERROR:  cannot cast type integer to date
LINE 1: select cast(1 as date);

Could you compare this new behavior with the other DBMSs, please?

cc @maropu

@dongjoon-hyun
Copy link
Member

Oops. I'm late. :)

@maropu
Copy link
Member

maropu commented Sep 12, 2019

hahaha, I was faster

@cloud-fan
Copy link
Contributor

cloud-fan commented Sep 12, 2019

I think long to timestamp is a legacy behavior and we shouldn't go further in this direction. If we do want to convert integrals to date in the benchmark, we can create a UDF/expression to do this in the test package, and use it in the benchmark.

@MaxGekk
Copy link
Member Author

MaxGekk commented Sep 12, 2019

@maropu @dongjoon-hyun @cloud-fan Thank you for your quick feedback. I reverted changes in Cast, and did conversion to DATE via TIMESTAMP in the benchmark. Please, take a look at this when you have time.

case other => throw new IllegalArgumentException(
s"Unsupported function '$other'. Valid functions are 'extract' and 'date_part'.")
}
codegenBenchmark(s"$field of $from", cardinality) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need to test with whole-stage-codegen on and off? The benchmark is just a SELECT query and I don't think whole-stage-codegen can help here. We can also see from the benchmark result that there is no difference.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is the same, so, we could benchmark it with default settings only - spark.sql.codegen.wholeStage = true.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I know the code is same, but since we are refining this benchmark, we can fix this issue as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I leaved only codegen benchmarks. From my point of view, it became better when all cases in one table.

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except one comment

@SparkQA
Copy link

SparkQA commented Sep 12, 2019

Test build #110502 has finished for PR 25772 at commit 6b9ad66.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu maropu closed this in 8e9fafb Sep 12, 2019
@maropu
Copy link
Member

maropu commented Sep 12, 2019

Thanks! Merged to master.

@SparkQA
Copy link

SparkQA commented Sep 12, 2019

Test build #110509 has finished for PR 25772 at commit e54c41a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 12, 2019

Test build #110511 has finished for PR 25772 at commit 766110a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

private val spark: SparkSession = SparkSession.builder()
.master("local[1]")
.appName(this.getClass.getCanonicalName)
.getOrCreate()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ur, is there other reason not to use SqlBasedBenchmark? You can do override def getSparkSession from SqlBasedBenchmark like DatasetBenchmark does.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having unused dependency is better, from my point of view. If you think it makes sense, I could change that in a separate PR, and override getSparkSession in other benchmarks like FilterPushdownBenchmark, OrcReadBenchmark, PrimitiveArrayBenchmark.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe later. For now, never mind. We have more important things to do~ 3.0.0-preview is coming! 😄

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have more important things to do~ 3.0.0-preview is coming!

Please, ping me if you think I could help.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the PR #25828 , just in case.

PavithraRamachandran pushed a commit to PavithraRamachandran/spark that referenced this pull request Sep 15, 2019
### What changes were proposed in this pull request?

In the PR, I propose to extend `ExtractBenchmark` and add new ones for:
- `EXTRACT` and `DATE` as input column
- the `DATE_PART` function and `DATE`/`TIMESTAMP` input column

### Why are the changes needed?

The `EXTRACT` expression is rebased on the `DATE_PART` expression by the PR apache#25410 where some of sub-expressions take `DATE` column as the input (`Millennium`, `Year` and etc.) but others require `TIMESTAMP` column (`Hour`, `Minute`). Separate benchmarks for `DATE` should exclude overhead of implicit conversions `DATE` <-> `TIMESTAMP`.

### Does this PR introduce any user-facing change?

No, it doesn't.

### How was this patch tested?
- Regenerated results of `ExtractBenchmark`

Closes apache#25772 from MaxGekk/date_part-benchmark.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
@MaxGekk MaxGekk deleted the date_part-benchmark branch October 5, 2019 19:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants