Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-31480][SQL] Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node #28425

Closed
wants to merge 11 commits into from

Conversation

dilipbiswal
Copy link
Contributor

What changes were proposed in this pull request?

Improve the EXPLAIN FORMATTED output of DSV2 Scan nodes (file based ones).

Before

== Physical Plan ==
* Project (4)
+- * Filter (3)
   +- * ColumnarToRow (2)
      +- BatchScan (1)


(1) BatchScan
Output [2]: [value#7, id#8]
Arguments: [value#7, id#8], ParquetScan(org.apache.spark.sql.test.TestSparkSession@17477bbb,Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, __spark_hadoop_conf__.xml,org.apache.spark.sql.execution.datasources.InMemoryFileIndex@a6c363ce,StructType(StructField(value,IntegerType,true)),StructType(StructField(value,IntegerType,true)),StructType(StructField(id,IntegerType,true)),[Lorg.apache.spark.sql.sources.Filter;@40fee459,org.apache.spark.sql.util.CaseInsensitiveStringMap@feca1ec6,Vector(isnotnull(id#8), (id#8 > 1)),List(isnotnull(value#7), (value#7 > 2)))
(2) ...
(3) ...
(4) ...

After

== Physical Plan ==
* Project (4)
+- * Filter (3)
   +- * ColumnarToRow (2)
      +- BatchScan (1)


(1) BatchScan
Output [2]: [value#7, id#8]
DataFilters: [isnotnull(value#7), (value#7 > 2)]
Format: parquet
Location: InMemoryFileIndex[....]
PartitionFilters: [isnotnull(id#8), (id#8 > 1)]
PushedFilers: [IsNotNull(id), IsNotNull(value), GreaterThan(id,1), GreaterThan(value,2)]
ReadSchema: struct<value:int>
(2) ...
(3) ...
(4) ...

Why are the changes needed?

The old format is not very readable. This improves the readability of the plan.

Does this PR introduce any user-facing change?

Yes. the explain output will be different.

How was this patch tested?

Added a test case in ExplainSuite.

@dilipbiswal dilipbiswal changed the title [SPARK-31480] Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node [SPARK-31480][SQL] Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node Apr 30, 2020
@SparkQA
Copy link

SparkQA commented Apr 30, 2020

Test build #122146 has finished for PR 28425 at commit 688db37.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait SupportsMetadata

@SparkQA
Copy link

SparkQA commented May 1, 2020

Test build #122147 has finished for PR 28425 at commit 121770c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor Author

cc @gatorsmile @cloud-fan @maropu

|PushedFilers: \\[.*\\(id\\), .*\\(value\\), .*\\(id,1\\), .*\\(value,2\\)\\]
|ReadSchema: struct\\<value:int\\>
|""".stripMargin.trim

Copy link
Contributor

@beliefer beliefer May 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seq(("parquet", "\\[.*\\(id\\), .*\\(value\\), .*\\(id,1\\), .*\\(value,2\\)\\]"),
        ("orc", "\\[.*\\(id\\), .*\\(value\\), .*\\(id,1\\), .*\\(value,2\\)\\]"),
        ("csv", "\\[IsNotNull\\(value\\), GreaterThan\\(value,2\\)\\]"),
        ("json", "")).foreach { (format, pushedFilters) =>
       val expected_plan_fragment =
           s"""
              |\\(1\\) BatchScan
              |Output \\[2\\]: \\[value#x, id#x\\]
              |DataFilters: \\[isnotnull\\(value#x\\), \\(value#x > 2\\)\\]
              |Format: $format
              |Location: InMemoryFileIndex\\[.*\\]
              |PartitionFilters: \\[isnotnull\\(id#x\\), \\(id#x > 1\\)\\]
              |PushedFilers: \\[.*\\(id\\), .*\\(value\\), .*\\(id,1\\), .*\\(value,2\\)\\]
              ${if (pushedFilters.nonEmpty) "|PushedFilers..."}
              |ReadSchema: struct\\<value:int\\>
              |""".stripMargin.trim

|Format: $format
|Location: InMemoryFileIndex\\[.*\\]
|PartitionFilters: \\[isnotnull\\(id#x\\), \\(id#x > 1\\)\\]
|PushedFilers: \\[.*\\(id\\), .*\\(value\\), .*\\(id,1\\), .*\\(value,2\\)\\]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could extract this line as variable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@beliefer Thanks.. I have updated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good for me.

@@ -65,4 +65,8 @@ case class AvroScan(
}

override def hashCode(): Int = super.hashCode()

override def getMetaData(): Map[String, String] = {
super.metaData ++ Map("Format" -> "avro")
Copy link
Member

@maropu maropu May 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we move all the Format metadata into the FileScan.metadata side?

 "Format" -> s"${this.getClass.getSimpleName.replace("Scan", "").toLowerCase(Locale.ROOT)}"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, could we check expain output for Avro V2 scan?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu Actually i had tried to test out avro. But i get the following error:

"Failed to find data source: avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of "Apache Avro Data Source Guide"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see... @gengliangwang could you help this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dilipbiswal where do you run the test? I think we have to test it under the external/avro module.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu Did you want a explain suite created in the avro external module ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea if we don't have any other suitable place for adding the test. At least, I think its better to add tests for it somewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu OK.. added a test.


trait SupportsMetadata {
def getMetaData(): Map[String, String]
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to move this file into the java side along with Batch and SupportsReportStatistics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu Don't see the need to make it a part of external V2 contract. We are using it for explain now.. so thought of keeping it internal just like we use the Logging trait.

Utils.redact(sqlContext.sessionState.conf.stringRedactionPattern, text)
}


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: remove the single blank line.

import org.apache.spark.sql.sources.Filter
import org.apache.spark.sql.types.StructType
import org.apache.spark.util.Utils

trait FileScan extends Scan with Batch with SupportsReportStatistics with Logging {
trait FileScan extends Scan
with Batch with SupportsReportStatistics with Logging with SupportsMetadata {
Copy link
Member

@maropu maropu May 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: better to put Logging in the end? with Batch with SupportsReportStatistics with SupportsMetadata with Logging {. I personally think we'd better group them by similar features.

*/
package org.apache.spark.sql.internal.connector

trait SupportsMetadata {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plz add some comment about what this class is used for?

case (_, _) => false
}.map {
case (key, value) => s"$key: ${redact(value)}"
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: format

    val metaDataStr = scan match {
      case s: SupportsMetadata =>
        s.getMetaData().toSeq.sorted.flatMap {
          case (key, value) if value.isEmpty || value.equals("[]") =>
            Some(s"$key: ${redact(value)}")
          case _ =>
            None
        }
      case _ =>
        Seq(scan.description())
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu Thanks. Looks much better :-)

@SparkQA
Copy link

SparkQA commented May 1, 2020

Test build #122166 has finished for PR 28425 at commit abd7277.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -65,4 +65,6 @@ case class AvroScan(
}

override def hashCode(): Int = super.hashCode()

override def getMetaData(): Map[String, String] = super.metaData
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this? It seems FileScan already has the implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu I get a compile error that forces me to implement it here ? Let me know if you have any suggestion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the code in FileScan and the compilation passed;

-  protected def getMetadata(): Map[String, String] = {
+  override def getMetaData(): Map[String, String] = {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu Thanks a lot :-) I will make a change.

* A mix in interface for {@link FileScan}. This can be used to report metadata
* for a file based scan operator. This is currently used for supporting formatted
* explain.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need @Evolving here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu Will Add.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu On second thought, this is not an external interface, right ? So don't think we need any annotations here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, but if we expose this, developers could improve explain output for their custom scan? cc: @cloud-fan

@@ -93,6 +93,11 @@ case class ParquetScan(
super.description() + ", PushedFilters: " + seqToString(pushedFilters)
}

override def getMetaData(): Map[String, String] = {
super.metaData ++ Map("PushedFilers" -> seqToString(pushedFilters))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove the blank here.

@SparkQA
Copy link

SparkQA commented May 2, 2020

Test build #122187 has finished for PR 28425 at commit 3d6040a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 2, 2020

Test build #122199 has finished for PR 28425 at commit be3bbe4.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 2, 2020

Test build #122191 has finished for PR 28425 at commit 02f230b.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class AvroV2Suite extends AvroSuite with ExplainSuiteHelper

@beliefer
Copy link
Contributor

beliefer commented May 2, 2020

retest this please

@SparkQA
Copy link

SparkQA commented May 2, 2020

Test build #122205 has finished for PR 28425 at commit be3bbe4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* for a file based scan operator. This is currently used for supporting formatted
* explain.
*/
@Evolving
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it's internal, we don't need the evolving annotation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Thanks. will remove.

@@ -343,6 +343,54 @@ class ExplainSuite extends ExplainSuiteHelper with DisableAdaptiveExecutionSuite
assert(getNormalizedExplain(df1, FormattedMode) === getNormalizedExplain(df2, FormattedMode))
}
}

test("Explain formatted output for scan operator for datasource V2") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a table-scan-explain.sql to test it? It's easier to see the result.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Agree. Actually i had tried but could not get the V2 scan set up through SQL. Could you please tell me how to do it ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. Currently DS v2 scan is enabled only in DataFrameReader, so we can't get it through pure SQL. Then this is fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Thank you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I think so...

@SparkQA
Copy link

SparkQA commented May 6, 2020

Test build #122340 has finished for PR 28425 at commit 7468251.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented May 6, 2020

Test build #122346 has finished for PR 28425 at commit 7468251.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -46,6 +47,31 @@ trait DataSourceV2ScanExecBase extends LeafExecNode {
Utils.redact(sqlContext.sessionState.conf.stringRedactionPattern, result)
}

/**
* Shorthand for calling redactString() without specifying redacting rules
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I don't see a function named redactString() around here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ngone51 Thanks.. have changed it.

@SparkQA
Copy link

SparkQA commented May 7, 2020

Test build #122394 has finished for PR 28425 at commit 3e61908.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

"csv" ->
"|PushedFilers: \\[IsNotNull\\(value\\), GreaterThan\\(value,2\\)\\]",
"json" ->
"|remove_marker"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simply put ""?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun I had tried and it didn't work for me. Perhaps there is a better way to do this. Basically, for JSON, i don't want a line printed for pushedFilters. Putting a "" results in the following as expected output. Here i wanted to get rid of the empty line between PartitionFilters and ReadSchema

\(1\) BatchScan
Output \[2\]: \[value#x, id#x\]
DataFilters: \[isnotnull\(value#x\), \(value#x > 2\)\]
Format: json
Location: InMemoryFileIndex\[.*\]
PartitionFilters: \[isnotnull\(id#x\), \(id#x > 1\)\]

ReadSchema: struct\<value:int\>

|PartitionFilters: \\[isnotnull\\(id#x\\), \\(id#x > 1\\)\\]
${pushFilterMaps.get(fmt).get}
|ReadSchema: struct\\<value:int\\>
|""".stripMargin.replaceAll("\nremove_marker", "").trim
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that we can remove .replaceAll("\nremove_marker", "") if we fix line 376. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun Please see my response above.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM (except a few minor comments)

@SparkQA
Copy link

SparkQA commented Jul 12, 2020

Test build #125705 has finished for PR 28425 at commit e177c2a.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Jul 12, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Jul 12, 2020

Test build #125708 has finished for PR 28425 at commit e177c2a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor Author

dilipbiswal commented Jul 14, 2020

@dongjoon-hyun Have addressed most of your comments except a couple. I have put my comments. Please let me know if you are okay with it. If so, i will go ahead and merge this.

@maropu
Copy link
Member

maropu commented Jul 15, 2020

@dilipbiswal Yea, I checked the latest commit and it looks okay.

@dilipbiswal
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jul 15, 2020

Test build #125869 has finished for PR 28425 at commit e177c2a.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor Author

retest this please

@maropu
Copy link
Member

maropu commented Jul 15, 2020

@dilipbiswal Seems like the tests in Github Actions passed. I think the current our policy for merging PRs now is:

I do believe PRs can be merged in most general cases once the Jenkins PR
builder or Github Actions build passes when we expect the successful test results from
the default Jenkins PR builder.

http://apache-spark-developers-list.1001551.n3.nabble.com/PSA-Apache-Spark-uses-GitHub-Actions-to-run-the-tests-tp29785.html

@dilipbiswal
Copy link
Contributor Author

@maropu Thanks for the info.
I merged the PR. However the script didn't update the JIRA as i didn't have the python package installed. Do we have to manually edit JIRA ?

@maropu
Copy link
Member

maropu commented Jul 15, 2020

Do we have to manually edit JIRA ?

Yea, you need to update it, manually.

@dongjoon-hyun
Copy link
Member

Congratulation for your first merging, @dilipbiswal . :)
The last commit looks good since that works. You can ignore my previous comment.

@dongjoon-hyun
Copy link
Member

BTW, sorry for late reply~

@dilipbiswal
Copy link
Contributor Author

@dongjoon-hyun

BTW, sorry for late reply
hey, no problem :-)

Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants