Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-31111][SQL][Tests] Fix interval output issue in ExtractBenchmark #27867

Closed
wants to merge 4 commits into from
Closed

[SPARK-31111][SQL][Tests] Fix interval output issue in ExtractBenchmark #27867

wants to merge 4 commits into from

Conversation

yaooqinn
Copy link
Member

@yaooqinn yaooqinn commented Mar 10, 2020

What changes were proposed in this pull request?

fix the error caused by interval output in ExtractBenchmark

Why are the changes needed?

fix a bug in the test

[info]   Running case: cast to interval
[error] Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot use interval type in the table schema.;;
[error] OverwriteByExpression RelationV2[] noop-table, true, true
[error] +- Project [(subtractdates(cast(cast(id#0L as timestamp) as date), -719162) + subtracttimestamps(cast(id#0L as timestamp), -30610249419876544)) AS ((CAST(CAST(id AS TIMESTAMP) AS DATE) - DATE '0001-01-01') + (CAST(id AS TIMESTAMP) - TIMESTAMP '1000-01-01 01:02:03.123456'))#2]
[error]    +- Range (1262304000, 1272304000, step=1, splits=Some(1))
[error]
[error] 	at org.apache.spark.sql.catalyst.util.TypeUtils$.failWithIntervalType(TypeUtils.scala:106)
[error] 	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$25(CheckAnalysis.scala:389)
[error] 	at org.a

Does this PR introduce any user-facing change?

no

How was this patch tested?

re-run benchmark

@yaooqinn
Copy link
Member Author

cc @cloud-fan, thanks.

@SparkQA
Copy link

SparkQA commented Mar 10, 2020

Test build #119625 has finished for PR 27867 at commit f4501cd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

MILLISECONDS of interval 1435 1449 16 7.0 143.5 0.9X
MICROSECONDS of interval 1304 1314 9 7.7 130.4 1.0X
EPOCH of interval 1440 1453 19 6.9 144.0 0.9X
cast to interval 3984 4034 79 2.5 398.4 1.0X
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this slowdown came from the change, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hold on a bit... something wrong here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed the problem, thanks, @dongjoon-hyun.

the benchmark cast to interval is slower because of the extra cast to string logic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the extract-related benchmark tests are ok now

case "timestamp" => "cast(id as timestamp)"
case "date" => "cast(cast(id as timestamp) as date)"
case "interval" if toStr => "cast((cast(cast(id as timestamp) as date) - date'0001-01-01') + " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need to cast to string?

Copy link
Member Author

@yaooqinn yaooqinn Mar 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, in this benchmark, there are 2 kinds of tests, one is extract/date_part and the other is string to interval, the latter fails type checking when saving.
Or shall we just remove the string to interval case here https://github.com/apache/spark/pull/27867/files#diff-b89131adf146fef7cd6e3db381fd0807L111

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why it's not a problem for date/timestamp?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We forbid interval as output schema only

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should not use the noop sink here, but df.queryExecution.toRDD.foreach(_ => ())

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good advice, thanks!

@SparkQA
Copy link

SparkQA commented Mar 11, 2020

Test build #119644 has finished for PR 27867 at commit e5be851.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 11, 2020

Test build #119650 has finished for PR 27867 at commit 14722a1.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yaooqinn
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Mar 11, 2020

Test build #119654 has finished for PR 27867 at commit 14722a1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.0!

@cloud-fan cloud-fan closed this in 2b46662 Mar 11, 2020
cloud-fan pushed a commit that referenced this pull request Mar 11, 2020
### What changes were proposed in this pull request?

fix the error caused by interval output in ExtractBenchmark
### Why are the changes needed?

fix a bug in the test

```scala
[info]   Running case: cast to interval
[error] Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot use interval type in the table schema.;;
[error] OverwriteByExpression RelationV2[] noop-table, true, true
[error] +- Project [(subtractdates(cast(cast(id#0L as timestamp) as date), -719162) + subtracttimestamps(cast(id#0L as timestamp), -30610249419876544)) AS ((CAST(CAST(id AS TIMESTAMP) AS DATE) - DATE '0001-01-01') + (CAST(id AS TIMESTAMP) - TIMESTAMP '1000-01-01 01:02:03.123456'))#2]
[error]    +- Range (1262304000, 1272304000, step=1, splits=Some(1))
[error]
[error] 	at org.apache.spark.sql.catalyst.util.TypeUtils$.failWithIntervalType(TypeUtils.scala:106)
[error] 	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$25(CheckAnalysis.scala:389)
[error] 	at org.a
```
### Does this PR introduce any user-facing change?

no

### How was this patch tested?

re-run benchmark

Closes #27867 from yaooqinn/SPARK-31111.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 2b46662)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@@ -42,7 +42,9 @@ object ExtractBenchmark extends SqlBasedBenchmark {
spark
.range(sinceSecond, sinceSecond + cardinality, 1, 1)
.selectExpr(exprs: _*)
.noop()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, @MaxGekk , too.

sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
### What changes were proposed in this pull request?

fix the error caused by interval output in ExtractBenchmark
### Why are the changes needed?

fix a bug in the test

```scala
[info]   Running case: cast to interval
[error] Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot use interval type in the table schema.;;
[error] OverwriteByExpression RelationV2[] noop-table, true, true
[error] +- Project [(subtractdates(cast(cast(id#0L as timestamp) as date), -719162) + subtracttimestamps(cast(id#0L as timestamp), -30610249419876544)) AS ((CAST(CAST(id AS TIMESTAMP) AS DATE) - DATE '0001-01-01') + (CAST(id AS TIMESTAMP) - TIMESTAMP '1000-01-01 01:02:03.123456'))apache#2]
[error]    +- Range (1262304000, 1272304000, step=1, splits=Some(1))
[error]
[error] 	at org.apache.spark.sql.catalyst.util.TypeUtils$.failWithIntervalType(TypeUtils.scala:106)
[error] 	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$25(CheckAnalysis.scala:389)
[error] 	at org.a
```
### Does this PR introduce any user-facing change?

no

### How was this patch tested?

re-run benchmark

Closes apache#27867 from yaooqinn/SPARK-31111.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants