[SPARK-31480][SQL] Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node #28425

dilipbiswal · 2020-04-30T19:45:57Z

What changes were proposed in this pull request?

Improve the EXPLAIN FORMATTED output of DSV2 Scan nodes (file based ones).

Before

== Physical Plan ==
* Project (4)
+- * Filter (3)
   +- * ColumnarToRow (2)
      +- BatchScan (1)


(1) BatchScan
Output [2]: [value#7, id#8]
Arguments: [value#7, id#8], ParquetScan(org.apache.spark.sql.test.TestSparkSession@17477bbb,Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, __spark_hadoop_conf__.xml,org.apache.spark.sql.execution.datasources.InMemoryFileIndex@a6c363ce,StructType(StructField(value,IntegerType,true)),StructType(StructField(value,IntegerType,true)),StructType(StructField(id,IntegerType,true)),[Lorg.apache.spark.sql.sources.Filter;@40fee459,org.apache.spark.sql.util.CaseInsensitiveStringMap@feca1ec6,Vector(isnotnull(id#8), (id#8 > 1)),List(isnotnull(value#7), (value#7 > 2)))
(2) ...
(3) ...
(4) ...

After

== Physical Plan ==
* Project (4)
+- * Filter (3)
   +- * ColumnarToRow (2)
      +- BatchScan (1)


(1) BatchScan
Output [2]: [value#7, id#8]
DataFilters: [isnotnull(value#7), (value#7 > 2)]
Format: parquet
Location: InMemoryFileIndex[....]
PartitionFilters: [isnotnull(id#8), (id#8 > 1)]
PushedFilers: [IsNotNull(id), IsNotNull(value), GreaterThan(id,1), GreaterThan(value,2)]
ReadSchema: struct<value:int>
(2) ...
(3) ...
(4) ...

Why are the changes needed?

The old format is not very readable. This improves the readability of the plan.

Does this PR introduce any user-facing change?

Yes. the explain output will be different.

How was this patch tested?

Added a test case in ExplainSuite.

SparkQA · 2020-04-30T19:51:57Z

Test build #122146 has finished for PR 28425 at commit 688db37.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait SupportsMetadata

SparkQA · 2020-05-01T00:54:40Z

Test build #122147 has finished for PR 28425 at commit 121770c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2020-05-01T02:08:19Z

cc @gatorsmile @cloud-fan @maropu

beliefer · 2020-05-01T03:55:01Z

sql/core/src/test/scala/org/apache/spark/sql/ExplainSuite.scala

+             |PushedFilers: \\[.*\\(id\\), .*\\(value\\), .*\\(id,1\\), .*\\(value,2\\)\\]
+             |ReadSchema: struct\\<value:int\\>
+             |""".stripMargin.trim
+


Seq(("parquet", "\\[.*\$id\$, .*\$value\$, .*\$id,1\$, .*\$value,2\$\\]"), ("orc", "\\[.*\$id\$, .*\$value\$, .*\$id,1\$, .*\$value,2\$\\]"), ("csv", "\\[IsNotNull\$value\$, GreaterThan\$value,2\$\\]"), ("json", "")).foreach { (format, pushedFilters) => val expected_plan_fragment = s""" |\$1\$ BatchScan |Output \\[2\\]: \\[value#x, id#x\\] |DataFilters: \\[isnotnull\$value#x\$, \$value#x > 2\$\\] |Format: $format |Location: InMemoryFileIndex\\[.*\\] |PartitionFilters: \\[isnotnull\$id#x\$, \$id#x > 1\$\\] |PushedFilers: \\[.*\$id\$, .*\$value\$, .*\$id,1\$, .*\$value,2\$\\] ${if (pushedFilters.nonEmpty) "|PushedFilers..."} |ReadSchema: struct\\<value:int\\> |""".stripMargin.trim

beliefer · 2020-05-01T03:56:09Z

sql/core/src/test/scala/org/apache/spark/sql/ExplainSuite.scala

+             |Format: $format
+             |Location: InMemoryFileIndex\\[.*\\]
+             |PartitionFilters: \\[isnotnull\\(id#x\\), \\(id#x > 1\\)\\]
+             |PushedFilers: \\[.*\\(id\\), .*\\(value\\), .*\\(id,1\\), .*\\(value,2\\)\\]


Could extract this line as variable?

@beliefer Thanks.. I have updated.

Looks good for me.

maropu · 2020-05-01T04:14:22Z

external/avro/src/main/scala/org/apache/spark/sql/v2/avro/AvroScan.scala

@@ -65,4 +65,8 @@ case class AvroScan(
  }

  override def hashCode(): Int = super.hashCode()
+
+  override def getMetaData(): Map[String, String] = {
+    super.metaData ++ Map("Format" -> "avro")


Could we move all the Format metadata into the FileScan.metadata side?

"Format" -> s"${this.getClass.getSimpleName.replace("Scan", "").toLowerCase(Locale.ROOT)}"

Also, could we check expain output for Avro V2 scan?

@maropu Actually i had tried to test out avro. But i get the following error:

"Failed to find data source: avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of "Apache Avro Data Source Guide"

Ah, I see... @gengliangwang could you help this?

@dilipbiswal where do you run the test? I think we have to test it under the external/avro module.

@maropu Did you want a explain suite created in the avro external module ?

Yea if we don't have any other suitable place for adding the test. At least, I think its better to add tests for it somewhere.

@maropu OK.. added a test.

maropu · 2020-05-01T04:19:36Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/connector/SupportsMetadata.scala

+
+trait SupportsMetadata {
+  def getMetaData(): Map[String, String]
+}


We don't need to move this file into the java side along with Batch and SupportsReportStatistics?

@maropu Don't see the need to make it a part of external V2 contract. We are using it for explain now.. so thought of keeping it internal just like we use the Logging trait.

maropu · 2020-05-01T04:20:40Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExecBase.scala

+    Utils.redact(sqlContext.sessionState.conf.stringRedactionPattern, text)
+  }
+
+


super nit: remove the single blank line.

maropu · 2020-05-01T04:21:48Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala

 import org.apache.spark.sql.sources.Filter
 import org.apache.spark.sql.types.StructType
 import org.apache.spark.util.Utils

-trait FileScan extends Scan with Batch with SupportsReportStatistics with Logging {
+trait FileScan extends Scan
+  with Batch with SupportsReportStatistics with Logging with SupportsMetadata {


super nit: better to put Logging in the end? with Batch with SupportsReportStatistics with SupportsMetadata with Logging {. I personally think we'd better group them by similar features.

maropu · 2020-05-01T04:36:26Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/connector/SupportsMetadata.scala

+ */
+package org.apache.spark.sql.internal.connector
+
+trait SupportsMetadata {


Plz add some comment about what this class is used for?

maropu · 2020-05-01T04:45:12Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExecBase.scala

+        case (_, _) => false
+      }.map {
+        case (key, value) => s"$key: ${redact(value)}"
+      }


nit: format

val metaDataStr = scan match { case s: SupportsMetadata => s.getMetaData().toSeq.sorted.flatMap { case (key, value) if value.isEmpty || value.equals("[]") => Some(s"$key: ${redact(value)}") case _ => None } case _ => Seq(scan.description()) }

@maropu Thanks. Looks much better :-)

SparkQA · 2020-05-01T12:25:26Z

Test build #122166 has finished for PR 28425 at commit abd7277.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-05-01T21:55:28Z

external/avro/src/main/scala/org/apache/spark/sql/v2/avro/AvroScan.scala

@@ -65,4 +65,6 @@ case class AvroScan(
  }

  override def hashCode(): Int = super.hashCode()
+
+  override def getMetaData(): Map[String, String] = super.metaData


We need this? It seems FileScan already has the implementation.

@maropu I get a compile error that forces me to implement it here ? Let me know if you have any suggestion.

I changed the code in FileScan and the compilation passed;

- protected def getMetadata(): Map[String, String] = { + override def getMetaData(): Map[String, String] = {

@maropu Thanks a lot :-) I will make a change.

maropu · 2020-05-01T21:56:32Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/connector/SupportsMetadata.scala

+ * A mix in interface for {@link FileScan}. This can be used to report metadata
+ * for a file based scan operator. This is currently used for supporting formatted
+ * explain.
+ */


We need @Evolving here?

@maropu Will Add.

@maropu On second thought, this is not an external interface, right ? So don't think we need any annotations here.

Not sure, but if we expose this, developers could improve explain output for their custom scan? cc: @cloud-fan

maropu · 2020-05-01T21:57:14Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetScan.scala

@@ -93,6 +93,11 @@ case class ParquetScan(
    super.description() + ", PushedFilters: " + seqToString(pushedFilters)
  }

+  override def getMetaData(): Map[String, String] = {
+    super.metaData ++ Map("PushedFilers" -> seqToString(pushedFilters))
+


nit: remove the blank here.

SparkQA · 2020-05-02T07:00:54Z

Test build #122187 has finished for PR 28425 at commit 3d6040a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-05-02T07:05:01Z

Test build #122199 has finished for PR 28425 at commit be3bbe4.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-05-02T07:05:02Z

Test build #122191 has finished for PR 28425 at commit 02f230b.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class AvroV2Suite extends AvroSuite with ExplainSuiteHelper

beliefer · 2020-05-02T07:14:52Z

retest this please

SparkQA · 2020-05-02T12:24:42Z

Test build #122205 has finished for PR 28425 at commit be3bbe4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-05-05T07:46:14Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/connector/SupportsMetadata.scala

+ * for a file based scan operator. This is currently used for supporting formatted
+ * explain.
+ */
+@Evolving


if it's internal, we don't need the evolving annotation

@cloud-fan Thanks. will remove.

cloud-fan · 2020-05-05T07:48:53Z

sql/core/src/test/scala/org/apache/spark/sql/ExplainSuite.scala

@@ -343,6 +343,54 @@ class ExplainSuite extends ExplainSuiteHelper with DisableAdaptiveExecutionSuite
      assert(getNormalizedExplain(df1, FormattedMode) === getNormalizedExplain(df2, FormattedMode))
    }
  }
+
+  test("Explain formatted output for scan operator for datasource V2") {


can we add a table-scan-explain.sql to test it? It's easier to see the result.

@cloud-fan Agree. Actually i had tried but could not get the V2 scan set up through SQL. Could you please tell me how to do it ?

Oh I see. Currently DS v2 scan is enabled only in DataFrameReader, so we can't get it through pure SQL. Then this is fine.

+1. Thank you.

Yea, I think so...

SparkQA · 2020-05-06T07:05:02Z

Test build #122340 has finished for PR 28425 at commit 7468251.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2020-05-06T07:14:15Z

retest this please

SparkQA · 2020-05-06T11:42:54Z

Test build #122346 has finished for PR 28425 at commit 7468251.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Ngone51 · 2020-05-07T03:29:15Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExecBase.scala

@@ -46,6 +47,31 @@ trait DataSourceV2ScanExecBase extends LeafExecNode {
    Utils.redact(sqlContext.sessionState.conf.stringRedactionPattern, result)
  }

+  /**
+   * Shorthand for calling redactString() without specifying redacting rules


nit: I don't see a function named redactString() around here.

@Ngone51 Thanks.. have changed it.

SparkQA · 2020-05-07T07:05:02Z

Test build #122394 has finished for PR 28425 at commit 3e61908.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-07-11T15:11:30Z

sql/core/src/test/scala/org/apache/spark/sql/ExplainSuite.scala

+          "csv" ->
+            "|PushedFilers: \\[IsNotNull\\(value\\), GreaterThan\\(value,2\\)\\]",
+          "json" ->
+            "|remove_marker"


Can we simply put ""?

@dongjoon-hyun I had tried and it didn't work for me. Perhaps there is a better way to do this. Basically, for JSON, i don't want a line printed for pushedFilters. Putting a "" results in the following as expected output. Here i wanted to get rid of the empty line between PartitionFilters and ReadSchema

$1$ BatchScan Output \[2\]: \[value#x, id#x\] DataFilters: \[isnotnull$value#x$, $value#x > 2$\] Format: json Location: InMemoryFileIndex\[.*\] PartitionFilters: \[isnotnull$id#x$, $id#x > 1$\] ReadSchema: struct\<value:int\>

dongjoon-hyun · 2020-07-11T15:12:12Z

sql/core/src/test/scala/org/apache/spark/sql/ExplainSuite.scala

+             |PartitionFilters: \\[isnotnull\\(id#x\\), \\(id#x > 1\\)\\]
+             ${pushFilterMaps.get(fmt).get}
+             |ReadSchema: struct\\<value:int\\>
+             |""".stripMargin.replaceAll("\nremove_marker", "").trim


It seems that we can remove .replaceAll("\nremove_marker", "") if we fix line 376. WDYT?

@dongjoon-hyun Please see my response above.

dongjoon-hyun

+1, LGTM (except a few minor comments)

…Node

SparkQA · 2020-07-12T07:05:01Z

Test build #125705 has finished for PR 28425 at commit e177c2a.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-07-12T07:05:53Z

retest this please

SparkQA · 2020-07-12T11:59:15Z

Test build #125708 has finished for PR 28425 at commit e177c2a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2020-07-14T16:40:49Z

@dongjoon-hyun Have addressed most of your comments except a couple. I have put my comments. Please let me know if you are okay with it. If so, i will go ahead and merge this.

maropu · 2020-07-15T01:29:21Z

@dilipbiswal Yea, I checked the latest commit and it looks okay.

dilipbiswal · 2020-07-15T01:55:02Z

retest this please

SparkQA · 2020-07-15T07:05:58Z

Test build #125869 has finished for PR 28425 at commit e177c2a.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2020-07-15T07:30:20Z

retest this please

maropu · 2020-07-15T07:48:15Z

@dilipbiswal Seems like the tests in Github Actions passed. I think the current our policy for merging PRs now is:

I do believe PRs can be merged in most general cases once the Jenkins PR
builder or Github Actions build passes when we expect the successful test results from
the default Jenkins PR builder.

http://apache-spark-developers-list.1001551.n3.nabble.com/PSA-Apache-Spark-uses-GitHub-Actions-to-run-the-tests-tp29785.html

dilipbiswal · 2020-07-15T08:33:42Z

@maropu Thanks for the info.
I merged the PR. However the script didn't update the JIRA as i didn't have the python package installed. Do we have to manually edit JIRA ?

maropu · 2020-07-15T09:14:36Z

Do we have to manually edit JIRA ?

Yea, you need to update it, manually.

dongjoon-hyun · 2020-07-15T15:23:49Z

Congratulation for your first merging, @dilipbiswal . :)
The last commit looks good since that works. You can ignore my previous comment.

dongjoon-hyun · 2020-07-15T15:24:50Z

BTW, sorry for late reply~

dilipbiswal · 2020-07-15T18:50:59Z

@dongjoon-hyun

BTW, sorry for late reply
hey, no problem :-)

Thank you

probot-autolabeler bot added AVRO SQL labels Apr 30, 2020

dilipbiswal changed the title ~~[SPARK-31480] Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node~~ [SPARK-31480][SQL] Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node Apr 30, 2020

beliefer reviewed May 1, 2020

View reviewed changes

maropu reviewed May 1, 2020

View reviewed changes

cloud-fan reviewed May 5, 2020

View reviewed changes

Ngone51 reviewed May 7, 2020

View reviewed changes

dongjoon-hyun reviewed Jul 11, 2020

View reviewed changes

dongjoon-hyun approved these changes Jul 11, 2020

View reviewed changes

dilipbiswal and others added 11 commits July 11, 2020 23:11

[SPARK-31480] Improve the EXPLAIN FORMATTED's output for DSV2's Scan …

87a51ff

…Node

style

f02aad1

Code review

1e442ab

Code review

b8592ff

Add a avro explain test case

8c135d6

Add evolving annotation

f236ce6

Remove evolving annotation

fabe41a

Review

2e133a3

rebase

f557fa0

fix

d3baec7

code review

e177c2a

dilipbiswal force-pushed the dkb_dsv2_explain branch from 7690442 to e177c2a Compare July 12, 2020 06:12

dilipbiswal closed this in e449993 Jul 15, 2020

		Utils.redact(sqlContext.sessionState.conf.stringRedactionPattern, text)
		}

[SPARK-31480][SQL] Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node #28425

[SPARK-31480][SQL] Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node #28425

Conversation

dilipbiswal commented Apr 30, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Apr 30, 2020

SparkQA commented May 1, 2020

dilipbiswal commented May 1, 2020

beliefer May 1, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu May 1, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dilipbiswal May 2, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu May 1, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 1, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 2, 2020

SparkQA commented May 2, 2020

SparkQA commented May 2, 2020

beliefer commented May 2, 2020

SparkQA commented May 2, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 6, 2020

dilipbiswal commented May 6, 2020

SparkQA commented May 6, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 7, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

SparkQA commented Jul 12, 2020

maropu commented Jul 12, 2020

SparkQA commented Jul 12, 2020

dilipbiswal commented Jul 14, 2020 • edited

maropu commented Jul 15, 2020

dilipbiswal commented Jul 15, 2020

SparkQA commented Jul 15, 2020

dilipbiswal commented Jul 15, 2020

maropu commented Jul 15, 2020

dilipbiswal commented Jul 15, 2020

maropu commented Jul 15, 2020 • edited

dongjoon-hyun commented Jul 15, 2020

beliefer May 1, 2020 •

edited

maropu May 1, 2020 •

edited

dilipbiswal May 2, 2020 •

edited

maropu May 1, 2020 •

edited

dilipbiswal commented Jul 14, 2020 •

edited

maropu commented Jul 15, 2020 •

edited