[SPARK-39281][SQL] Speed up Timestamp type inference with legacy format in JSON/CSV data source #41091

Hisoka-X · 2023-05-08T12:45:36Z

What changes were proposed in this pull request?

Follow up #36562 , performance improvement when Timestamp type inference with legacy format.

In the current implementation of CSV/JSON data source, the schema inference with legacy format relies on methods that will throw exceptions if the fields can't convert as some data types .

Throwing and catching exceptions can be slow. We can improve it by creating methods that return optional results instead.

The optimization of DefaultTimestampFormatter has been implemented in #36562 , this PR adds the optimization of legacy format. The basic logic is to prevent the formatter from throwing exceptions, and then use catch to determine whether the parsing is successful.

Why are the changes needed?

Performance improvement when Timestamp type inference with legacy format.

When use JSON datasource, the speed up 67%. CSV datasource speed also up, but not obvious.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Add new test

… in JSON/CSV data source

MaxGekk

Do the benchmarks CSVBenchmark and JsonBenchmark show any improvements? Could you regenerate the results JsonBenchmark.*.txt and CSVBenchmark.*.txt, please.

Hisoka-X · 2023-05-10T14:48:18Z

Do the benchmarks CSVBenchmark and JsonBenchmark show any improvements? Could you regenerate the results JsonBenchmark.*.txt and CSVBenchmark.*.txt, please.

I'm doing benchmarks, but I found a problem, like @gengliangwang said in #36562 (comment) .The benchmarks no case for the string inputs are not valid timestamps. The speed up only work when string input are not valid timestamps. I'm worry about the benchmarks can't prove anything. Can I create a PR for add benchmarks for type inference when string input are not valid timestamps

MaxGekk · 2023-05-10T15:41:54Z

Can I create a PR for add benchmarks for type inference when string input are not valid timestamps

Yep. Let's do that.

…e invalid value ### What changes were proposed in this pull request? When we try to speed up Timestamp type inference with format (PR: #36562 #41078 #41091). There is no way to judge whether the change has improved the speed for Timestamp type inference. So we need a benchmark to measure whether our optimization of Timestamp type inference is useful, we have valid Timestamp value benchmark at now, but don't have invalid Timestamp value benchmark when use Timestamp type inference. ### Why are the changes needed? Add new benchmark for Timestamp type inference when use invalid value, to make sure our speed up PR work normally. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? benchmarks already are test code. Closes #41131 from Hisoka-X/add_banchmarks. Authored-by: Hisoka <fanjiaeminem@qq.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

MaxGekk · 2023-05-14T07:25:47Z

@Hisoka-X Please, resolve the conflicts and rebase on the recent master.

Hisoka-X · 2023-05-14T07:30:20Z

@Hisoka-X Please, resolve the conflicts and rebase on the recent master.

Ok, I will add benchmarks for this too. Please wait. Thanks!

# Conflicts: # sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/TimestampFormatterSuite.scala

Hisoka-X · 2023-05-14T12:34:58Z

sql/core/benchmarks/JsonBenchmark-results.txt

-from_json(date)                                                                 3553           3574          19          0.3        3553.1       0.1X
-infer error timestamps from Dataset[String] with default format                 2590           2609          19          0.4        2589.9       0.1X
-infer error timestamps from Dataset[String] with user-provided format           2517           2551          30          0.4        2516.8       0.1X
-infer error timestamps from Dataset[String] with legacy format                  6836           6876          63          0.1        6836.1       0.0X


@MaxGekk Hi, I updated the benchmark, the speed already up.

MaxGekk · 2023-05-16T12:58:19Z

+1, LGTM. Merging to master.
Thank you, @Hisoka-X.

Hisoka-X · 2023-05-16T13:00:41Z

Thanks @MaxGekk

[SPARK-39281][SQL] Fasten Timestamp type inference with legacy format…

22740f3

… in JSON/CSV data source

github-actions bot added the SQL label May 8, 2023

MaxGekk requested changes May 10, 2023

View reviewed changes

MaxGekk mentioned this pull request May 10, 2023

[SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source #41078

Closed

Hisoka-X mentioned this pull request May 11, 2023

[SPARK-43443][SQL] Add benchmark for Timestamp type inference when use invalid value #41131

Closed

Hisoka-X added 2 commits May 14, 2023 15:54

Merge branch 'master_' into SPARK-39281_legacy_format

c78deec

# Conflicts: # sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/TimestampFormatterSuite.scala

update benchmark

b2bb137

Hisoka-X commented May 14, 2023

View reviewed changes

MaxGekk approved these changes May 16, 2023

View reviewed changes

MaxGekk closed this in 3192bbd May 16, 2023

Hisoka-X deleted the SPARK-39281_legacy_format branch May 18, 2023 12:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-39281][SQL] Speed up Timestamp type inference with legacy format in JSON/CSV data source #41091

[SPARK-39281][SQL] Speed up Timestamp type inference with legacy format in JSON/CSV data source #41091

Hisoka-X commented May 8, 2023 •

edited

MaxGekk left a comment

Hisoka-X commented May 10, 2023

MaxGekk commented May 10, 2023

MaxGekk commented May 14, 2023

Hisoka-X commented May 14, 2023

Hisoka-X May 14, 2023

MaxGekk commented May 16, 2023

Hisoka-X commented May 16, 2023

[SPARK-39281][SQL] Speed up Timestamp type inference with legacy format in JSON/CSV data source #41091

[SPARK-39281][SQL] Speed up Timestamp type inference with legacy format in JSON/CSV data source #41091

Conversation

Hisoka-X commented May 8, 2023 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

MaxGekk left a comment

Choose a reason for hiding this comment

Hisoka-X commented May 10, 2023

MaxGekk commented May 10, 2023

MaxGekk commented May 14, 2023

Hisoka-X commented May 14, 2023

Hisoka-X May 14, 2023

Choose a reason for hiding this comment

MaxGekk commented May 16, 2023

Hisoka-X commented May 16, 2023

Hisoka-X commented May 8, 2023 •

edited