Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-22500][SQL] Fix 64KB JVM bytecode limit problem with cast #19730

Closed
wants to merge 6 commits into from

Conversation

kiszk
Copy link
Member

@kiszk kiszk commented Nov 12, 2017

What changes were proposed in this pull request?

This PR changes cast code generation to place generated code for expression for fields of a structure into separated methods if these size could be large.

How was this patch tested?

Added new test cases into CastSuite

@@ -1015,7 +1015,9 @@ case class Cast(child: Expression, dataType: DataType, timeZoneId: Option[String
}
val rowClass = classOf[GenericInternalRow].getName
val result = ctx.freshName("result")
ctx.addMutableState(s"$rowClass", result, "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that since we not assigning it, we might keep it as a local variable and pass it to the generated methods. In this way we can avoid to introduce new global variables. Have you tried that?

val tmpRow = ctx.freshName("tmpRow")
ctx.addMutableState("InternalRow", tmpRow, "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@SparkQA
Copy link

SparkQA commented Nov 12, 2017

Test build #83750 has finished for PR 19730 at commit 9013671.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -1039,13 +1039,19 @@ case class Cast(child: Expression, dataType: DataType, timeZoneId: Option[String
}
}
"""
}.mkString("\n")
}
val fieldsEvalCodes = if (ctx.INPUT_ROW != null && ctx.currentVars == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't be ctx.currentVars != null?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If ctx.currentVars != null, we need to use mkString("\n").

@SparkQA
Copy link

SparkQA commented Nov 13, 2017

Test build #83790 has finished for PR 19730 at commit 94dfde2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 13, 2017

Test build #83792 has finished for PR 19730 at commit 39ffaa9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}
val fieldsEvalCodes = if (ctx.INPUT_ROW != null && ctx.currentVars == null) {
ctx.splitExpressions(fieldsEvalCode, "castStruct",
("InternalRow", ctx.INPUT_ROW) :: (rowClass, result) :: ("InternalRow", tmpRow) :: Nil)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about inner struct? We also need to pass in the ctx.INPUT_ROW?

Copy link
Member Author

@kiszk kiszk Nov 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inner struct case in the following code and existing cases in CastSuite works well. Tomorrow, I will add similar test case with many fields.

    val struct = Literal.create(
      Row(
        UTF8String.fromString("123.4"),
        Seq("456", "true", "78.9"),
        Row(7)),
      StructType(Seq(
        StructField("i", StringType, nullable = true),
        StructField("a",
          ArrayType(StringType, containsNull = false), nullable = true),
        StructField("s",
          StructType(Seq(
            StructField("i", IntegerType, nullable = true)))))))

    val ret = cast(struct, StructType(Seq(
      StructField("d", DoubleType, nullable = true),
      StructField("a",
        ArrayType(IntegerType, containsNull = true), nullable = true),
      StructField("s",
        StructType(Seq(
          StructField("l", LongType, nullable = true)))))))

    assert(ret.resolved === true)
    checkEvaluation(ret, Row(123.4, Seq(456, null, 78), Row(7L)))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, we don't need to pass in ctx.INPUT_ROW to the split functions.

checkEvaluation(cast(Literal.create(input1, from1), to1), output1)

val from2 = new StructType(
(1 to N).map(i => StructField(s"a$i", ArrayType(StringType, containsNull = false))).toArray)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd expect something like

val from2 = new StructType(
  (1 to N).map(i => StructField(s"s$i", from1)).toArray)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or just test this case.

@SparkQA
Copy link

SparkQA commented Nov 17, 2017

Test build #83970 has finished for PR 19730 at commit 83fef40.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

ping @kiszk

@kiszk
Copy link
Member Author

kiszk commented Nov 21, 2017

Working for this. I met another problem. Since this code creates a lot of global variable, I met the problem due to 64K constant pool entries.
Since UTF8String.IntWrapper can be reused, I am writing the following code.

      val wrapper = "intWrapper"
      if (!ctx.mutableStates.exists(s => s._1 == wrapper)) {
        ctx.addMutableState("UTF8String.IntWrapper", wrapper,
          s"$wrapper = new UTF8String.IntWrapper();")
      }

@gatorsmile
Copy link
Member

Yeah. We can fix them in this PR.

BTW, could you check all the other calls of addMutableState and fix them too?

@cloud-fan
Copy link
Contributor

I think this is a different issue and should be fixed with another PR. @kiszk how about we change the test to cast int to long to avoid this issue?

@kiszk
Copy link
Member Author

kiszk commented Nov 21, 2017

@cloud-fan I see, I will create another PR to fix this global variable issue.
@gatorsmile I will check other calls.

(1 to M).map(i => StructField(s"s$i", toInner)).toArray)
val inputOuter = Row.fromSeq((1 to M).map(_ => inputInner))
val outputOuter = Row.fromSeq((1 to M).map(_ => outputInner))
checkEvaluation(cast(Literal.create(inputOuter, fromOuter), toOuter), outputOuter)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this case is good enough to cover all the above cases?

@SparkQA
Copy link

SparkQA commented Nov 21, 2017

Test build #84064 has finished for PR 19730 at commit c46ef5c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 21, 2017

Test build #84079 has finished for PR 19730 at commit 574691f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in ac10171 Nov 21, 2017
@cloud-fan
Copy link
Contributor

cloud-fan commented Nov 21, 2017

thanks, merging to master/2.2!

asfgit pushed a commit that referenced this pull request Nov 21, 2017
This PR changes `cast` code generation to place generated code for expression for fields of a structure into separated methods if these size could be large.

Added new test cases into `CastSuite`

Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>

Closes #19730 from kiszk/SPARK-22500.

(cherry picked from commit ac10171)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
asfgit pushed a commit that referenced this pull request Dec 5, 2017
… whole stage codegen

## What changes were proposed in this pull request?

A followup of #19730, we can split the code for casting struct even with whole stage codegen.

This PR also has some renaming to make the code easier to read.

## How was this patch tested?

existing test

Author: Wenchen Fan <wenchen@databricks.com>

Closes #19891 from cloud-fan/cast.
MatthewRBruce pushed a commit to Shopify/spark that referenced this pull request Jul 31, 2018
This PR changes `cast` code generation to place generated code for expression for fields of a structure into separated methods if these size could be large.

Added new test cases into `CastSuite`

Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>

Closes apache#19730 from kiszk/SPARK-22500.

(cherry picked from commit ac10171)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants