Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-31615][SQL] Pretty string output for sql method of RuntimeReplaceable expressions #28420

Closed
wants to merge 9 commits into from

Conversation

yaooqinn
Copy link
Member

@yaooqinn yaooqinn commented Apr 30, 2020

What changes were proposed in this pull request?

The RuntimeReplaceable ones are runtime replaceable, thus, their original parameters are not going to be resolved to PrettyAttribute and remain debug style string if we directly implement their sql methods with their parameters' sql methods.

This PR is raised with suggestions by @maropu and @cloud-fan https://github.com/apache/spark/pull/28402/files#r417656589. In this PR, we re-implement the sql methods of the RuntimeReplaceable ones with toPettySQL

Why are the changes needed?

Consistency of schema output between RuntimeReplaceable expressions and normal ones.

For example, date_format vs to_timestamp, before this PR, they output differently

Before

select date_format(timestamp '2019-10-06', 'yyyy-MM-dd uuuu')
struct<date_format(TIMESTAMP '2019-10-06 00:00:00', yyyy-MM-dd uuuu):string>

select to_timestamp("2019-10-06S10:11:12.12345", "yyyy-MM-dd'S'HH:mm:ss.SSSSSS")
struct<to_timestamp('2019-10-06S10:11:12.12345', 'yyyy-MM-dd\'S\'HH:mm:ss.SSSSSS'):timestamp>

After

select date_format(timestamp '2019-10-06', 'yyyy-MM-dd uuuu')
struct<date_format(TIMESTAMP '2019-10-06 00:00:00', yyyy-MM-dd uuuu):string>



select to_timestamp("2019-10-06T10:11:12'12", "yyyy-MM-dd'T'HH:mm:ss''SSSS")

struct<to_timestamp(2019-10-06T10:11:12'12, yyyy-MM-dd'T'HH:mm:ss''SSSS):timestamp>

Does this PR introduce any user-facing change?

Yes, the schema output style changed for the runtime replaceable expressions as shown in the above example

How was this patch tested?

regenerate all related tests

@@ -1206,7 +1207,7 @@ case class DatetimeSub(
interval: Expression,
child: Expression) extends RuntimeReplaceable {
override def toString: String = s"$start - $interval"
override def sql: String = s"${start.sql} - ${interval.sql}"
override def sql: String = s"${toPrettySQL(start)} - ${toPrettySQL(interval)}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means RuntimeReplaceable.sql always use pretty sql, while other expressions only use pretty sql when the caller side needs to (by calling toPrettySQL).

Is it possible to make them consistent? Like make start and interval as innerChildren and handle them in usePrettyExpression.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let me see

@SparkQA
Copy link

SparkQA commented Apr 30, 2020

Test build #122122 has finished for PR 28420 at commit c0fd2a6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 30, 2020

Test build #122123 has finished for PR 28420 at commit 3e66c25.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 30, 2020

Test build #122142 has finished for PR 28420 at commit 29c25b9.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu maropu changed the title [SPARK-31615][SQL] Pretty string output for sql method of RuntimeRepl… [SPARK-31615][SQL] Pretty string output for sql method of RuntimeReplaceable expressions Apr 30, 2020
Copy link
Member

@maropu maropu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the followup, @yaooqinn

override def innerChildren: Seq[Expression] = sys.error("RuntimeReplaceable must implement" +
" innerChildren with the original parameters")

protected val sqlStrSeparator: String = ", "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this variable? It seems only Extract uses this.

@@ -323,6 +323,20 @@ trait RuntimeReplaceable extends UnaryExpression with Unevaluable {
// two `RuntimeReplaceable` are considered to be semantically equal if their "child" expressions
// are semantically equal.
override lazy val canonicalized: Expression = child.canonicalized

override def innerChildren: Seq[Expression] = sys.error("RuntimeReplaceable must implement" +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since compilers cannot catch the case where one forget to override this, how about it like this?

  // RuntimeReplaceable
  def exprsReplaced: Seq[Expression]
  override def innerChildren: Seq[Expression] = exprsReplaced

  // e.g., ParseToTimestamp
  override def exprsReplaced: Seq[Expression] = Seq(Some(left), format).flatten

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good Point!

override def sql: String = RuntimeReplaceable.this.prettyName +
prettyChildren.map(_.sql).mkString("(", sqlStrSeparator, ")")

protected var prettyChildren: Seq[Expression] = innerChildren
Copy link
Member

@maropu maropu Apr 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its better not to use var where possible, so how about it like this?

  // RuntimeReplaceable
  def newInstance(innerChildren: Seq[Expression]): RuntimeReplaceable

  // e.g., ParseToTimestamp
  override def newInstance(innerChildren: Seq[Expression]): ParseToTimestamp = {
    innerChildren match {
      case Seq(left) => ParseToTimestamp(left, None, child)
      case Seq(left, format) => ParseToTimestamp(left, Some(format), child)
    }
  }

Then, update the toPrettySQL in utils.package;

  def toPrettySQL(e: Expression): String = e match {
    case r: RuntimeReplaceable =>
      r.newInstance(r.innerChildren.map(usePrettyExpression)).sql
    case _ => usePrettyExpression(e).sql
  }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The newInstance seems not robust to make a clone with Seq[Expression]

I think we can just add def prettySQL: String in RuntimeReplaceable for utils.package to call, and override it wherever needed without creating new instances.

Seems a better and simpler idea that comes up based on your suggestions, WDYT?

Copy link
Member

@maropu maropu May 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel having the 'pretty' functionality in RuntimeReplaceable is a bit intrusive. Dropping my idea above is okay, but I think we need a more general solution for getting pretty strings outside RuntimeReplaceable. WDYT? @cloud-fan

*/
def exprsReplaced: Seq[Expression]

override def sql: String = RuntimeReplaceable.this.prettyName +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: seems we can remove RuntimeReplaceable.this.?

override def prettyName: String = "extract"
override def exprsReplaced: Seq[Expression] = Seq(field, source)

override def sql: String = super.sql.replace(", ", " FROM ")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is tricky and hard to read, can we just generate the sql string?

@@ -142,7 +142,10 @@ package object util extends Logging {
"`" + name.replace("`", "``") + "`"
}

def toPrettySQL(e: Expression): String = usePrettyExpression(e).sql
def toPrettySQL(e: Expression): String = e match {
case r: RuntimeReplaceable => r.prettySQL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this work if RuntimeReplaceable is not the root node?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah.. no..

Copy link
Member Author

@yaooqinn yaooqinn May 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c42a283#diff-6f4edc80e2cc973e748705e85a6053b4R786-R807

fixed and added some nested tests, thanks, good catch!

@SparkQA
Copy link

SparkQA commented May 1, 2020

Test build #122160 has finished for PR 28420 at commit f6bc861.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 1, 2020

Test build #122172 has finished for PR 28420 at commit c42a283.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 1, 2020

Test build #122176 has finished for PR 28420 at commit 128ca14.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 3, 2020

Test build #122224 has finished for PR 28420 at commit ac511ad.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yaooqinn
Copy link
Member Author

yaooqinn commented May 3, 2020

retest this please

@SparkQA
Copy link

SparkQA commented May 3, 2020

Test build #122225 has finished for PR 28420 at commit ac511ad.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented May 3, 2020

retest this please

@SparkQA
Copy link

SparkQA commented May 3, 2020

Test build #122228 has finished for PR 28420 at commit ac511ad.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

protected def prettySQL: String =
prettyName + exprsReplaced.map(toPrettySQL).mkString("(", ", ", ")")

def asPrettyAttribute(): PrettyAttribute = PrettyAttribute(prettySQL, dataType)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not add a def withPrettyAttributes: this.type? The current way requires fewer changes as we can have a default implementation, but it's hacky to convert RuntimeReplaceable to a single PrettyAttribute.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be better to move this to util.package

@SparkQA
Copy link

SparkQA commented May 5, 2020

Test build #122315 has finished for PR 28420 at commit e43d9ed.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

*/
protected def exprsReplaced: Seq[Expression]

override def sql: String = prettyName + exprsReplaced.map(_.sql).mkString("(", ", ", ")")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another idea: maybe we can define a mkString method?

def exprsReplaced: Seq[Expression]

def mkString(childrenString: Seq[String]): String = childrenString.mkString("(", ", ", ")")

override def sql: String = prettyName + mkString(exprsReplaced.map(_.sql))

then in util

case r: RuntimeReplaceable => PrettyAttribute(r.mkString(r.exprsReplaced.map(toPrettySQL)), r.dataType)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM, this way is much better. We should put the prettyName in mkString though.

@SparkQA
Copy link

SparkQA commented May 6, 2020

Test build #122363 has finished for PR 28420 at commit 5843ac6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yaooqinn
Copy link
Member Author

yaooqinn commented May 6, 2020

retest this please

@SparkQA
Copy link

SparkQA commented May 6, 2020

Test build #122375 has finished for PR 28420 at commit 5843ac6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented May 6, 2020

We need to backport this into branch-2.4, too? I personally think printing debug info looks like a minor bug. @cloud-fan @dongjoon-hyun

@cloud-fan
Copy link
Contributor

It's a minor bug, but the diff is not small. Maybe master-only is good enough?

@maropu
Copy link
Member

maropu commented May 7, 2020

Looks okay to me.

@maropu maropu closed this in b31ae7b May 7, 2020
@maropu
Copy link
Member

maropu commented May 7, 2020

Thanks! Merged to master.

@@ -245,54 +245,54 @@ true
-- !query
select to_date('2009-07-30 04:17:52') <= to_date('2009-07-30 04:17:52')
-- !query schema
struct<(to_date('2009-07-30 04:17:52') <= to_date('2009-07-30 04:17:52')):boolean>
struct<(to_date(2009-07-30 04:17:52) <= to_date(2009-07-30 04:17:52)):boolean>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I happened to leave a comment late but why was this changed? Seems the previous one looks better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check this

+-- !query
+select date_format('2019-10-06', 'yyyy-MM-dd uuuu')
+-- !query schema
+struct<date_format(CAST(2019-10-06 AS TIMESTAMP), yyyy-MM-dd uuuu):string>
+-- !query output
+2019-10-06 Sunday

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we discuss about changing ^ case instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet. This PR just makes the RuntimeReplaceable -> PrettyAttribute. I agree with you that the PrettyAttribute is not beautiful in itself.


def mkString(childrenString: Seq[String]): String = {
prettyName + childrenString.mkString("(", ", ", ")")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need two extra interface methods just to make the column name pretty? Seems the output isn't even completely pretty.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not about pretty or not. It's about consistency. If we call a function with its alias name, then it will not use PrettyAttribute and the string representation is different from if we call the function with the original name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The consistency is good but the readability here is the problem. See the implementation side:

  override def sql: String = s"$prettyName(${field.sql} FROM ${source.sql})"
  override def prettyName: String = "extract"

vs

  override def exprsReplaced: Seq[Expression] = Seq(field, source)

  override def mkString(childrenString: Seq[String]): String = {
    prettyName + childrenString.mkString("(", " FROM ", ")")
  }

Previously, it was fine to only know sql and prettyName but now you should know what exprsReplaced and mkString are. From the naming, it's confusing what that means too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just document that sql and prettyName should be properly overridden at RuntimeReplaceable just like we have the restriction about the child. We could have an util for it as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The caller side decides if he wants pretty string or not, by calling Expression.sql or toPrettySQL(Expression). So simply overridden sql doesn't work, unless we add a boolean parameter.

@@ -134,6 +134,8 @@ package object util extends Logging {
PrettyAttribute(usePrettyExpression(e.child).sql + "." + name, e.dataType)
case e: GetArrayStructFields =>
PrettyAttribute(usePrettyExpression(e.child) + "." + e.field.name, e.dataType)
case r: RuntimeReplaceable =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And strictly RuntimeReplaceable itself isn't an attribute ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already did the same to GetStructField :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants