New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-31480][SQL] Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node #28425
Conversation
Test build #122146 has finished for PR 28425 at commit
|
Test build #122147 has finished for PR 28425 at commit
|
|PushedFilers: \\[.*\\(id\\), .*\\(value\\), .*\\(id,1\\), .*\\(value,2\\)\\] | ||
|ReadSchema: struct\\<value:int\\> | ||
|""".stripMargin.trim | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seq(("parquet", "\\[.*\\(id\\), .*\\(value\\), .*\\(id,1\\), .*\\(value,2\\)\\]"),
("orc", "\\[.*\\(id\\), .*\\(value\\), .*\\(id,1\\), .*\\(value,2\\)\\]"),
("csv", "\\[IsNotNull\\(value\\), GreaterThan\\(value,2\\)\\]"),
("json", "")).foreach { (format, pushedFilters) =>
val expected_plan_fragment =
s"""
|\\(1\\) BatchScan
|Output \\[2\\]: \\[value#x, id#x\\]
|DataFilters: \\[isnotnull\\(value#x\\), \\(value#x > 2\\)\\]
|Format: $format
|Location: InMemoryFileIndex\\[.*\\]
|PartitionFilters: \\[isnotnull\\(id#x\\), \\(id#x > 1\\)\\]
|PushedFilers: \\[.*\\(id\\), .*\\(value\\), .*\\(id,1\\), .*\\(value,2\\)\\]
${if (pushedFilters.nonEmpty) "|PushedFilers..."}
|ReadSchema: struct\\<value:int\\>
|""".stripMargin.trim
|Format: $format | ||
|Location: InMemoryFileIndex\\[.*\\] | ||
|PartitionFilters: \\[isnotnull\\(id#x\\), \\(id#x > 1\\)\\] | ||
|PushedFilers: \\[.*\\(id\\), .*\\(value\\), .*\\(id,1\\), .*\\(value,2\\)\\] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could extract this line as variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@beliefer Thanks.. I have updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good for me.
@@ -65,4 +65,8 @@ case class AvroScan( | |||
} | |||
|
|||
override def hashCode(): Int = super.hashCode() | |||
|
|||
override def getMetaData(): Map[String, String] = { | |||
super.metaData ++ Map("Format" -> "avro") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we move all the Format
metadata into the FileScan.metadata
side?
"Format" -> s"${this.getClass.getSimpleName.replace("Scan", "").toLowerCase(Locale.ROOT)}"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, could we check expain output for Avro V2 scan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu Actually i had tried to test out avro. But i get the following error:
"Failed to find data source: avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of "Apache Avro Data Source Guide"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see... @gengliangwang could you help this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dilipbiswal where do you run the test? I think we have to test it under the external/avro module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu Did you want a explain suite created in the avro external module ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea if we don't have any other suitable place for adding the test. At least, I think its better to add tests for it somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu OK.. added a test.
|
||
trait SupportsMetadata { | ||
def getMetaData(): Map[String, String] | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to move this file into the java side along with Batch
and SupportsReportStatistics
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu Don't see the need to make it a part of external V2 contract. We are using it for explain now.. so thought of keeping it internal just like we use the Logging trait.
Utils.redact(sqlContext.sessionState.conf.stringRedactionPattern, text) | ||
} | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nit: remove the single blank line.
import org.apache.spark.sql.sources.Filter | ||
import org.apache.spark.sql.types.StructType | ||
import org.apache.spark.util.Utils | ||
|
||
trait FileScan extends Scan with Batch with SupportsReportStatistics with Logging { | ||
trait FileScan extends Scan | ||
with Batch with SupportsReportStatistics with Logging with SupportsMetadata { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nit: better to put Logging
in the end? with Batch with SupportsReportStatistics with SupportsMetadata with Logging {
. I personally think we'd better group them by similar features.
*/ | ||
package org.apache.spark.sql.internal.connector | ||
|
||
trait SupportsMetadata { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plz add some comment about what this class is used for?
case (_, _) => false | ||
}.map { | ||
case (key, value) => s"$key: ${redact(value)}" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: format
val metaDataStr = scan match {
case s: SupportsMetadata =>
s.getMetaData().toSeq.sorted.flatMap {
case (key, value) if value.isEmpty || value.equals("[]") =>
Some(s"$key: ${redact(value)}")
case _ =>
None
}
case _ =>
Seq(scan.description())
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu Thanks. Looks much better :-)
Test build #122166 has finished for PR 28425 at commit
|
@@ -65,4 +65,6 @@ case class AvroScan( | |||
} | |||
|
|||
override def hashCode(): Int = super.hashCode() | |||
|
|||
override def getMetaData(): Map[String, String] = super.metaData |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need this? It seems FileScan
already has the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu I get a compile error that forces me to implement it here ? Let me know if you have any suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the code in FileScan
and the compilation passed;
- protected def getMetadata(): Map[String, String] = {
+ override def getMetaData(): Map[String, String] = {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu Thanks a lot :-) I will make a change.
* A mix in interface for {@link FileScan}. This can be used to report metadata | ||
* for a file based scan operator. This is currently used for supporting formatted | ||
* explain. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need @Evolving
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu Will Add.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu On second thought, this is not an external interface, right ? So don't think we need any annotations here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, but if we expose this, developers could improve explain output for their custom scan? cc: @cloud-fan
@@ -93,6 +93,11 @@ case class ParquetScan( | |||
super.description() + ", PushedFilters: " + seqToString(pushedFilters) | |||
} | |||
|
|||
override def getMetaData(): Map[String, String] = { | |||
super.metaData ++ Map("PushedFilers" -> seqToString(pushedFilters)) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove the blank here.
Test build #122187 has finished for PR 28425 at commit
|
Test build #122199 has finished for PR 28425 at commit
|
Test build #122191 has finished for PR 28425 at commit
|
retest this please |
Test build #122205 has finished for PR 28425 at commit
|
* for a file based scan operator. This is currently used for supporting formatted | ||
* explain. | ||
*/ | ||
@Evolving |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it's internal, we don't need the evolving annotation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Thanks. will remove.
@@ -343,6 +343,54 @@ class ExplainSuite extends ExplainSuiteHelper with DisableAdaptiveExecutionSuite | |||
assert(getNormalizedExplain(df1, FormattedMode) === getNormalizedExplain(df2, FormattedMode)) | |||
} | |||
} | |||
|
|||
test("Explain formatted output for scan operator for datasource V2") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a table-scan-explain.sql
to test it? It's easier to see the result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Agree. Actually i had tried but could not get the V2 scan set up through SQL. Could you please tell me how to do it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see. Currently DS v2 scan is enabled only in DataFrameReader
, so we can't get it through pure SQL. Then this is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I think so...
Test build #122340 has finished for PR 28425 at commit
|
retest this please |
Test build #122346 has finished for PR 28425 at commit
|
@@ -46,6 +47,31 @@ trait DataSourceV2ScanExecBase extends LeafExecNode { | |||
Utils.redact(sqlContext.sessionState.conf.stringRedactionPattern, result) | |||
} | |||
|
|||
/** | |||
* Shorthand for calling redactString() without specifying redacting rules |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I don't see a function named redactString()
around here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Ngone51 Thanks.. have changed it.
Test build #122394 has finished for PR 28425 at commit
|
"csv" -> | ||
"|PushedFilers: \\[IsNotNull\\(value\\), GreaterThan\\(value,2\\)\\]", | ||
"json" -> | ||
"|remove_marker" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we simply put ""
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun I had tried and it didn't work for me. Perhaps there is a better way to do this. Basically, for JSON, i don't want a line printed for pushedFilters. Putting a ""
results in the following as expected output. Here i wanted to get rid of the empty line between PartitionFilters
and ReadSchema
\(1\) BatchScan
Output \[2\]: \[value#x, id#x\]
DataFilters: \[isnotnull\(value#x\), \(value#x > 2\)\]
Format: json
Location: InMemoryFileIndex\[.*\]
PartitionFilters: \[isnotnull\(id#x\), \(id#x > 1\)\]
ReadSchema: struct\<value:int\>
|PartitionFilters: \\[isnotnull\\(id#x\\), \\(id#x > 1\\)\\] | ||
${pushFilterMaps.get(fmt).get} | ||
|ReadSchema: struct\\<value:int\\> | ||
|""".stripMargin.replaceAll("\nremove_marker", "").trim |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that we can remove .replaceAll("\nremove_marker", "")
if we fix line 376. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun Please see my response above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM (except a few minor comments)
7690442
to
e177c2a
Compare
Test build #125705 has finished for PR 28425 at commit
|
retest this please |
Test build #125708 has finished for PR 28425 at commit
|
@dongjoon-hyun Have addressed most of your comments except a couple. I have put my comments. Please let me know if you are okay with it. If so, i will go ahead and merge this. |
@dilipbiswal Yea, I checked the latest commit and it looks okay. |
retest this please |
Test build #125869 has finished for PR 28425 at commit
|
retest this please |
@dilipbiswal Seems like the tests in Github Actions passed. I think the current our policy for merging PRs now is:
|
@maropu Thanks for the info. |
Yea, you need to update it, manually. |
Congratulation for your first merging, @dilipbiswal . :) |
BTW, sorry for late reply~ |
Thank you |
What changes were proposed in this pull request?
Improve the EXPLAIN FORMATTED output of DSV2 Scan nodes (file based ones).
Before
After
Why are the changes needed?
The old format is not very readable. This improves the readability of the plan.
Does this PR introduce any user-facing change?
Yes. the explain output will be different.
How was this patch tested?
Added a test case in ExplainSuite.