-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-29065][SQL][TEST] Extend EXTRACT
benchmark
#25772
Conversation
@dongjoon-hyun @cloud-fan @maropu Could you take a look at the PR. Should I extract cast-related code to a separate PR? WDYT? |
Ur, I see. I didn't notice we've already accepted explicit casts from long to timestamp though, any reason to support this explicit cast? It seems pgSQL doesn't accept that ;
cc: @gengliangwang |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, @MaxGekk . I'm a little negative with this PR.
- First of all, this is not a TEST PR because it touch SQL behavior. You need to split this PR into two.
- In addition, the new behavior is not consistent with PostgreSQL, too.
postgres=# select cast(1 as date);
ERROR: cannot cast type integer to date
LINE 1: select cast(1 as date);
Could you compare this new behavior with the other DBMSs, please?
cc @maropu
Oops. I'm late. :) |
hahaha, I was faster |
I think long to timestamp is a legacy behavior and we shouldn't go further in this direction. If we do want to convert integrals to date in the benchmark, we can create a UDF/expression to do this in the test package, and use it in the benchmark. |
@maropu @dongjoon-hyun @cloud-fan Thank you for your quick feedback. I reverted changes in |
case other => throw new IllegalArgumentException( | ||
s"Unsupported function '$other'. Valid functions are 'extract' and 'date_part'.") | ||
} | ||
codegenBenchmark(s"$field of $from", cardinality) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really need to test with whole-stage-codegen on and off? The benchmark is just a SELECT query and I don't think whole-stage-codegen can help here. We can also see from the benchmark result that there is no difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code is the same, so, we could benchmark it with default settings only - spark.sql.codegen.wholeStage
= true
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I know the code is same, but since we are refining this benchmark, we can fix this issue as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I leaved only codegen benchmarks. From my point of view, it became better when all cases in one table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except one comment
Test build #110502 has finished for PR 25772 at commit
|
Thanks! Merged to master. |
Test build #110509 has finished for PR 25772 at commit
|
Test build #110511 has finished for PR 25772 at commit
|
private val spark: SparkSession = SparkSession.builder() | ||
.master("local[1]") | ||
.appName(this.getClass.getCanonicalName) | ||
.getOrCreate() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ur, is there other reason not to use SqlBasedBenchmark
? You can do override def getSparkSession
from SqlBasedBenchmark
like DatasetBenchmark
does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having unused dependency is better, from my point of view. If you think it makes sense, I could change that in a separate PR, and override getSparkSession
in other benchmarks like FilterPushdownBenchmark
, OrcReadBenchmark
, PrimitiveArrayBenchmark
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe later. For now, never mind. We have more important things to do~ 3.0.0-preview is coming! 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have more important things to do~ 3.0.0-preview is coming!
Please, ping me if you think I could help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the PR #25828 , just in case.
### What changes were proposed in this pull request? In the PR, I propose to extend `ExtractBenchmark` and add new ones for: - `EXTRACT` and `DATE` as input column - the `DATE_PART` function and `DATE`/`TIMESTAMP` input column ### Why are the changes needed? The `EXTRACT` expression is rebased on the `DATE_PART` expression by the PR apache#25410 where some of sub-expressions take `DATE` column as the input (`Millennium`, `Year` and etc.) but others require `TIMESTAMP` column (`Hour`, `Minute`). Separate benchmarks for `DATE` should exclude overhead of implicit conversions `DATE` <-> `TIMESTAMP`. ### Does this PR introduce any user-facing change? No, it doesn't. ### How was this patch tested? - Regenerated results of `ExtractBenchmark` Closes apache#25772 from MaxGekk/date_part-benchmark. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
What changes were proposed in this pull request?
In the PR, I propose to extend
ExtractBenchmark
and add new ones for:EXTRACT
andDATE
as input columnDATE_PART
function andDATE
/TIMESTAMP
input columnWhy are the changes needed?
The
EXTRACT
expression is rebased on theDATE_PART
expression by the PR #25410 where some of sub-expressions takeDATE
column as the input (Millennium
,Year
and etc.) but others requireTIMESTAMP
column (Hour
,Minute
). Separate benchmarks forDATE
should exclude overhead of implicit conversionsDATE
<->TIMESTAMP
.Does this PR introduce any user-facing change?
No, it doesn't.
How was this patch tested?
ExtractBenchmark