Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-39281][SQL] Speed up Timestamp type inference with legacy format in JSON/CSV data source #41091

Closed
wants to merge 3 commits into from

Conversation

Hisoka-X
Copy link
Member

@Hisoka-X Hisoka-X commented May 8, 2023

What changes were proposed in this pull request?

Follow up #36562 , performance improvement when Timestamp type inference with legacy format.

In the current implementation of CSV/JSON data source, the schema inference with legacy format relies on methods that will throw exceptions if the fields can't convert as some data types .

Throwing and catching exceptions can be slow. We can improve it by creating methods that return optional results instead.

The optimization of DefaultTimestampFormatter has been implemented in #36562 , this PR adds the optimization of legacy format. The basic logic is to prevent the formatter from throwing exceptions, and then use catch to determine whether the parsing is successful.

Why are the changes needed?

Performance improvement when Timestamp type inference with legacy format.

When use JSON datasource, the speed up 67%. CSV datasource speed also up, but not obvious.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Add new test

@github-actions github-actions bot added the SQL label May 8, 2023
Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the benchmarks CSVBenchmark and JsonBenchmark show any improvements? Could you regenerate the results JsonBenchmark.*.txt and CSVBenchmark.*.txt, please.

@Hisoka-X
Copy link
Member Author

Do the benchmarks CSVBenchmark and JsonBenchmark show any improvements? Could you regenerate the results JsonBenchmark.*.txt and CSVBenchmark.*.txt, please.

I'm doing benchmarks, but I found a problem, like @gengliangwang said in #36562 (comment) .The benchmarks no case for the string inputs are not valid timestamps. The speed up only work when string input are not valid timestamps. I'm worry about the benchmarks can't prove anything. Can I create a PR for add benchmarks for type inference when string input are not valid timestamps

@MaxGekk
Copy link
Member

MaxGekk commented May 10, 2023

Can I create a PR for add benchmarks for type inference when string input are not valid timestamps

Yep. Let's do that.

MaxGekk pushed a commit that referenced this pull request May 11, 2023
…e invalid value

### What changes were proposed in this pull request?

When we try to speed up Timestamp type inference with format (PR: #36562 #41078 #41091). There is no way to judge whether the change has improved the speed for Timestamp type inference.

So we need a benchmark to measure whether our optimization of Timestamp type inference is useful, we have valid Timestamp value benchmark at now, but don't have invalid Timestamp value benchmark when use Timestamp type inference.

### Why are the changes needed?

Add new benchmark for Timestamp type inference when use invalid value, to make sure our speed up PR work normally.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?
benchmarks already are test code.

Closes #41131 from Hisoka-X/add_banchmarks.

Authored-by: Hisoka <fanjiaeminem@qq.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
@MaxGekk
Copy link
Member

MaxGekk commented May 14, 2023

@Hisoka-X Please, resolve the conflicts and rebase on the recent master.

@Hisoka-X
Copy link
Member Author

@Hisoka-X Please, resolve the conflicts and rebase on the recent master.

Ok, I will add benchmarks for this too. Please wait. Thanks!

# Conflicts:
#	sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/TimestampFormatterSuite.scala
from_json(date) 3553 3574 19 0.3 3553.1 0.1X
infer error timestamps from Dataset[String] with default format 2590 2609 19 0.4 2589.9 0.1X
infer error timestamps from Dataset[String] with user-provided format 2517 2551 30 0.4 2516.8 0.1X
infer error timestamps from Dataset[String] with legacy format 6836 6876 63 0.1 6836.1 0.0X
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaxGekk Hi, I updated the benchmark, the speed already up.

@MaxGekk
Copy link
Member

MaxGekk commented May 16, 2023

+1, LGTM. Merging to master.
Thank you, @Hisoka-X.

@MaxGekk MaxGekk closed this in 3192bbd May 16, 2023
@Hisoka-X
Copy link
Member Author

Thanks @MaxGekk

@Hisoka-X Hisoka-X deleted the SPARK-39281_legacy_format branch May 18, 2023 12:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants