-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-39281][SQL] Speed up Timestamp type inference with legacy format in JSON/CSV data source #41091
Conversation
… in JSON/CSV data source
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do the benchmarks CSVBenchmark
and JsonBenchmark
show any improvements? Could you regenerate the results JsonBenchmark.*.txt
and CSVBenchmark.*.txt
, please.
I'm doing benchmarks, but I found a problem, like @gengliangwang said in #36562 (comment) .The benchmarks no case for |
Yep. Let's do that. |
…e invalid value ### What changes were proposed in this pull request? When we try to speed up Timestamp type inference with format (PR: #36562 #41078 #41091). There is no way to judge whether the change has improved the speed for Timestamp type inference. So we need a benchmark to measure whether our optimization of Timestamp type inference is useful, we have valid Timestamp value benchmark at now, but don't have invalid Timestamp value benchmark when use Timestamp type inference. ### Why are the changes needed? Add new benchmark for Timestamp type inference when use invalid value, to make sure our speed up PR work normally. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? benchmarks already are test code. Closes #41131 from Hisoka-X/add_banchmarks. Authored-by: Hisoka <fanjiaeminem@qq.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>
@Hisoka-X Please, resolve the conflicts and rebase on the recent master. |
Ok, I will add benchmarks for this too. Please wait. Thanks! |
# Conflicts: # sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/TimestampFormatterSuite.scala
from_json(date) 3553 3574 19 0.3 3553.1 0.1X | ||
infer error timestamps from Dataset[String] with default format 2590 2609 19 0.4 2589.9 0.1X | ||
infer error timestamps from Dataset[String] with user-provided format 2517 2551 30 0.4 2516.8 0.1X | ||
infer error timestamps from Dataset[String] with legacy format 6836 6876 63 0.1 6836.1 0.0X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MaxGekk Hi, I updated the benchmark, the speed already up.
+1, LGTM. Merging to master. |
Thanks @MaxGekk |
What changes were proposed in this pull request?
Follow up #36562 , performance improvement when Timestamp type inference with legacy format.
In the current implementation of CSV/JSON data source, the schema inference with legacy format relies on methods that will throw exceptions if the fields can't convert as some data types .
Throwing and catching exceptions can be slow. We can improve it by creating methods that return optional results instead.
The optimization of DefaultTimestampFormatter has been implemented in #36562 , this PR adds the optimization of legacy format. The basic logic is to prevent the formatter from throwing exceptions, and then use catch to determine whether the parsing is successful.
Why are the changes needed?
Performance improvement when Timestamp type inference with legacy format.
When use JSON datasource, the speed up
67%
. CSV datasource speed also up, but not obvious.Does this PR introduce any user-facing change?
No
How was this patch tested?
Add new test