[SPARK-31414][SQL] Fix performance regression with new TimestampFormatter for json and csv time parsing#28181
Closed
yaooqinn wants to merge 7 commits intoapache:masterfrom
yaooqinn:SPARK-31414
Closed
[SPARK-31414][SQL] Fix performance regression with new TimestampFormatter for json and csv time parsing#28181yaooqinn wants to merge 7 commits intoapache:masterfrom yaooqinn:SPARK-31414
yaooqinn wants to merge 7 commits intoapache:masterfrom
yaooqinn:SPARK-31414
Conversation
|
Test build #121087 has finished for PR 28181 at commit
|
|
Test build #121091 has finished for PR 28181 at commit
|
|
Test build #121094 has finished for PR 28181 at commit
|
|
Test build #121095 has finished for PR 28181 at commit
|
Member
|
cc @MaxGekk |
Member
Author
|
cc: @cloud-fan @maropu @dongjoon-hyun, thanks |
Contributor
|
good catch! LGTM, merging to master/3.0 |
cloud-fan
pushed a commit
that referenced
this pull request
Apr 13, 2020
…tter for json and csv time parsing
With benchmark original, where the timestamp values are valid to the new parser
the result is
```scala
[info] Running benchmark: Read dates and timestamps
[info] Running case: timestamp strings
[info] Stopped after 3 iterations, 5781 ms
[info] Running case: parse timestamps from Dataset[String]
[info] Stopped after 3 iterations, 44764 ms
[info] Running case: infer timestamps from Dataset[String]
[info] Stopped after 3 iterations, 93764 ms
[info] Running case: from_json(timestamp)
[info] Stopped after 3 iterations, 59021 ms
```
When we modify the benchmark to
```scala
def timestampStr: Dataset[String] = {
spark.range(0, rowsNum, 1, 1).mapPartitions { iter =>
iter.map(i => s"""{"timestamp":"1970-01-01T01:02:03.${i % 100}"}""")
}.select($"value".as("timestamp")).as[String]
}
readBench.addCase("timestamp strings", numIters) { _ =>
timestampStr.noop()
}
readBench.addCase("parse timestamps from Dataset[String]", numIters) { _ =>
spark.read.schema(tsSchema).json(timestampStr).noop()
}
readBench.addCase("infer timestamps from Dataset[String]", numIters) { _ =>
spark.read.json(timestampStr).noop()
}
```
where the timestamp values are invalid for the new parser which causes a fallback to legacy parser(2.4).
the result is
```scala
[info] Running benchmark: Read dates and timestamps
[info] Running case: timestamp strings
[info] Stopped after 3 iterations, 5623 ms
[info] Running case: parse timestamps from Dataset[String]
[info] Stopped after 3 iterations, 506637 ms
[info] Running case: infer timestamps from Dataset[String]
[info] Stopped after 3 iterations, 509076 ms
```
About 10x perf-regression
BUT if we modify the timestamp pattern to `....HH:mm:ss[.SSS][XXX]` which make all timestamp values valid for the new parser to prohibit fallback, the result is
```scala
[info] Running benchmark: Read dates and timestamps
[info] Running case: timestamp strings
[info] Stopped after 3 iterations, 5623 ms
[info] Running case: parse timestamps from Dataset[String]
[info] Stopped after 3 iterations, 506637 ms
[info] Running case: infer timestamps from Dataset[String]
[info] Stopped after 3 iterations, 509076 ms
```
Fix performance regression.
NO
new tests added.
Closes #28181 from yaooqinn/SPARK-31414.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit d65f534)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
sjincho
pushed a commit
to sjincho/spark
that referenced
this pull request
Apr 15, 2020
…tter for json and csv time parsing
### What changes were proposed in this pull request?
With benchmark original, where the timestamp values are valid to the new parser
the result is
```scala
[info] Running benchmark: Read dates and timestamps
[info] Running case: timestamp strings
[info] Stopped after 3 iterations, 5781 ms
[info] Running case: parse timestamps from Dataset[String]
[info] Stopped after 3 iterations, 44764 ms
[info] Running case: infer timestamps from Dataset[String]
[info] Stopped after 3 iterations, 93764 ms
[info] Running case: from_json(timestamp)
[info] Stopped after 3 iterations, 59021 ms
```
When we modify the benchmark to
```scala
def timestampStr: Dataset[String] = {
spark.range(0, rowsNum, 1, 1).mapPartitions { iter =>
iter.map(i => s"""{"timestamp":"1970-01-01T01:02:03.${i % 100}"}""")
}.select($"value".as("timestamp")).as[String]
}
readBench.addCase("timestamp strings", numIters) { _ =>
timestampStr.noop()
}
readBench.addCase("parse timestamps from Dataset[String]", numIters) { _ =>
spark.read.schema(tsSchema).json(timestampStr).noop()
}
readBench.addCase("infer timestamps from Dataset[String]", numIters) { _ =>
spark.read.json(timestampStr).noop()
}
```
where the timestamp values are invalid for the new parser which causes a fallback to legacy parser(2.4).
the result is
```scala
[info] Running benchmark: Read dates and timestamps
[info] Running case: timestamp strings
[info] Stopped after 3 iterations, 5623 ms
[info] Running case: parse timestamps from Dataset[String]
[info] Stopped after 3 iterations, 506637 ms
[info] Running case: infer timestamps from Dataset[String]
[info] Stopped after 3 iterations, 509076 ms
```
About 10x perf-regression
BUT if we modify the timestamp pattern to `....HH:mm:ss[.SSS][XXX]` which make all timestamp values valid for the new parser to prohibit fallback, the result is
```scala
[info] Running benchmark: Read dates and timestamps
[info] Running case: timestamp strings
[info] Stopped after 3 iterations, 5623 ms
[info] Running case: parse timestamps from Dataset[String]
[info] Stopped after 3 iterations, 506637 ms
[info] Running case: infer timestamps from Dataset[String]
[info] Stopped after 3 iterations, 509076 ms
```
### Why are the changes needed?
Fix performance regression.
### Does this PR introduce any user-facing change?
NO
### How was this patch tested?
new tests added.
Closes apache#28181 from yaooqinn/SPARK-31414.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
With benchmark original, where the timestamp values are valid to the new parser
the result is
When we modify the benchmark to
where the timestamp values are invalid for the new parser which causes a fallback to legacy parser(2.4).
the result is
About 10x perf-regression
BUT if we modify the timestamp pattern to
....HH:mm:ss[.SSS][XXX]which make all timestamp values valid for the new parser to prohibit fallback, the result isWhy are the changes needed?
Fix performance regression.
Does this PR introduce any user-facing change?
NO
How was this patch tested?
new tests added.