Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-31115][SQL] Detect known Janino bug janino-compiler/janino#113 and apply workaround automatically as a fail-back via avoid using switch statement in generated code #27872

Closed
wants to merge 5 commits into from

Conversation

HeartSaVioR
Copy link
Contributor

@HeartSaVioR HeartSaVioR commented Mar 11, 2020

What changes were proposed in this pull request?

This patch proposes to detect whether the generated code hits the known Janino bug janino-compiler/janino#113 from exception thrown in codegen compilation, and re-generate workaround code via not using switch statement, as a "fail-back".

The bug is triggered from some circumstance with using switch statement (please refer Janino issue for more details), but switch statement is still used first, as it makes the generated code more concise, and also more efficient in point of performance's view. So if the generated code doesn't hit the Janino bug, the behavior would be same with the state before the patch.

The conditions are below:

  • The generated code contains 'switch' statement.
  • Exception message contains 'Operand stack inconsistent at offset xxx: Previous size 1, now 0'.

From the new test, the generated code for expand_doConsume is following in normal path:

/* 491 */   private void expand_doConsume_0(UTF8String expand_expr_0_0, boolean expand_exprIsNull_0_0, UTF8String expand_expr_1_0, boolean expand_exprIsNull_1_0, int expand_expr_2_0, int expand_expr_3_0, int expand_expr_4_0, int expand_expr_5_0, int expand_expr_6_0, int expand_expr_7_0, int expand_expr_8_0, int expand_expr_9_0, int expand_expr_10_0, int expand_expr_11_0, int expand_expr_12_0, int expand_expr_13_0, int expand_expr_14_0, int expand_expr_15_0, int expand_expr_16_0, int expand_expr_17_0, int expand_expr_18_0, int expand_expr_19_0, int expand_expr_20_0, int expand_expr_21_0, int expand_expr_22_0, int expand_expr_23_0, int expand_expr_24_0, int expand_expr_25_0, int expand_expr_26_0, int expand_expr_27_0, int expand_expr_28_0, int expand_expr_29_0, int expand_expr_30_0, int expand_expr_31_0, int expand_expr_32_0, int expand_expr_33_0, int expand_expr_34_0, int expand_expr_35_0, int expand_expr_36_0, int expand_expr_37_0, int expand_expr_38_0, int expand_expr_39_0, int expand_expr_40_0, int expand_expr_41_0, int expand_expr_42_0, int expand_expr_43_0, int expand_expr_44_0, int expand_expr_45_0, int expand_expr_46_0, int expand_expr_47_0, int expand_expr_48_0, int expand_expr_49_0, int expand_expr_50_0, int expand_expr_51_0, int expand_expr_52_0, int expand_expr_53_0, int expand_expr_54_0, int expand_expr_55_0, int expand_expr_56_0, int expand_expr_57_0, int expand_expr_58_0, int expand_expr_59_0, int expand_expr_60_0, int expand_expr_61_0, int expand_expr_62_0, int expand_expr_63_0, int expand_expr_64_0, int expand_expr_65_0, int expand_expr_66_0, int expand_expr_67_0, int expand_expr_68_0, int expand_expr_69_0, int expand_expr_70_0, int expand_expr_71_0, int expand_expr_72_0, int expand_expr_73_0, int expand_expr_74_0, int expand_expr_75_0, int expand_expr_76_0, int expand_expr_77_0, int expand_expr_78_0, int expand_expr_79_0, int expand_expr_80_0, int expand_expr_81_0, int expand_expr_82_0, int expand_expr_83_0, int expand_expr_84_0, int expand_expr_85_0, int expand_expr_86_0, int expand_expr_87_0, int expand_expr_88_0, int expand_expr_89_0, int expand_expr_90_0, int expand_expr_91_0, int expand_expr_92_0, int expand_expr_93_0, int expand_expr_94_0, int expand_expr_95_0, int expand_expr_96_0, int expand_expr_97_0, int expand_expr_98_0, int expand_expr_99_0, int expand_expr_100_0) throws java.io.IOException {
/* 492 */     boolean expand_isNull_103 = true;
/* 493 */     int expand_value_103 =
/* 494 */     -1;
...
/* 792 */     for (int expand_i_0 = 0; expand_i_0 < 50; expand_i_0 ++) {
/* 793 */       switch (expand_i_0) {
/* 794 */       case 0:
/* 795 */         expand_isNull_103 = true;
/* 796 */         expand_value_103 = -1;
/* 797 */
...
/* 15592 */       case 49:
/* 15593 */         expand_isNull_103 = true;
/* 15594 */         expand_value_103 = -1;
/* 15595 */
...
/* 15892 */         break;
/* 15893 */       }
/* 15894 */
/* 15895 */       ((org.apache.spark.sql.execution.metric.SQLMetric) references[7] /* numOutputRows */).add(1);
/* 15896 */
/* 15897 */       agg_doConsume_0(expand_expr_0_0, expand_exprIsNull_0_0, expand_expr_1_0, expand_exprIsNull_1_0, expand_value_103, expand_isNull_103, expand_value_104, expand_isNull_104, expand_value_105, expand_isNull_105, expand_value_106, expand_isNull_106, expand_value_107, expand_isNull_107, expand_value_108, expand_isNull_108, expand_value_109, expand_isNull_109, expand_value_110, expand_isNull_110, expand_value_111, expand_isNull_111, expand_value_112, expand_isNull_112, expand_value_113, expand_isNull_113, expand_value_114, expand_isNull_114, expand_value_115, expand_isNull_115, expand_value_116, expand_isNull_116, expand_value_117, expand_isNull_117, expand_value_118, expand_isNull_118, expand_value_119, expand_isNull_119, expand_value_120, expand_isNull_120, expand_value_121, expand_isNull_121, expand_value_122, expand_isNull_122, expand_value_123, expand_isNull_123, expand_value_124, expand_isNull_124, expand_value_125, expand_isNull_125, expand_value_126, expand_isNull_126, expand_value_127, expand_isNull_127, expand_value_128, expand_isNull_128, expand_value_129, expand_isNull_129, expand_value_130, expand_isNull_130, expand_value_131, expand_isNull_131, expand_value_132, expand_isNull_132, expand_value_133, expand_isNull_133, expand_value_134, expand_isNull_134, expand_value_135, expand_isNull_135, expand_value_136, expand_isNull_136, expand_value_137, expand_isNull_137, expand_value_138, expand_isNull_138, expand_value_139, expand_isNull_139, expand_value_140, expand_isNull_140, expand_value_141, expand_isNull_141, expand_value_142, expand_isNull_142, expand_value_143, expand_isNull_143, expand_value_144, expand_isNull_144, expand_value_145, expand_isNull_145, expand_value_146, expand_isNull_146, expand_value_147, expand_isNull_147, expand_value_148, expand_isNull_148, expand_value_149, expand_isNull_149, expand_value_150, expand_isNull_150, expand_value_151, expand_isNull_151, expand_value_152, expand_value_153, expand_isNull_153, expand_value_154, expand_isNull_154, expand_value_155, expand_isNull_155, expand_value_156, expand_isNull_156, expand_value_157, expand_isNull_157, expand_value_158, expand_isNull_158, expand_value_159, expand_isNull_159, expand_value_160, expand_isNull_160, expand_value_161, expand_isNull_161, expand_value_162, expand_isNull_162, expand_value_163, expand_isNull_163, expand_value_164, expand_isNull_164, expand_value_165, expand_isNull_165, expand_value_166, expand_isNull_166, expand_value_167, expand_isNull_167, expand_value_168, expand_isNull_168, expand_value_169, expand_isNull_169, expand_value_170, expand_isNull_170, expand_value_171, expand_isNull_171, expand_value_172, expand_isNull_172, expand_value_173, expand_isNull_173, expand_value_174, expand_isNull_174, expand_value_175, expand_isNull_175, expand_value_176, expand_isNull_176, expand_value_177, expand_isNull_177, expand_value_178, expand_isNull_178, expand_value_179, expand_isNull_179, expand_value_180, expand_isNull_180, expand_value_181, expand_isNull_181, expand_value_182, expand_isNull_182, expand_value_183, expand_isNull_183, expand_value_184, expand_isNull_184, expand_value_185, expand_isNull_185, expand_value_186, expand_isNull_186, expand_value_187, expand_isNull_187, expand_value_188, expand_isNull_188, expand_value_189, expand_isNull_189, expand_value_190, expand_isNull_190, expand_value_191, expand_isNull_191, expand_value_192, expand_isNull_192, expand_value_193, expand_isNull_193, expand_value_194, expand_isNull_194, expand_value_195, expand_isNull_195, expand_value_196, expand_isNull_196, expand_value_197, expand_isNull_197, expand_value_198, expand_isNull_198, expand_value_199, expand_isNull_199, expand_value_200, expand_isNull_200, expand_value_201, expand_isNull_201, expand_value_202, expand_isNull_202);
/* 15898 */
/* 15899 */     }

and after applying the workaround automatically as a fail-back, the new generated code for expand_doConsume is following:

/* 491 */   private void expand_doConsume_0(UTF8String expand_expr_0_0, boolean expand_exprIsNull_0_0, UTF8String expand_expr_1_0, boolean expand_exprIsNull_1_0, int expand_expr_2_0, int expand_expr_3_0, int expand_expr_4_0, int expand_expr_5_0, int expand_expr_6_0, int expand_expr_7_0, int expand_expr_8_0, int expand_expr_9_0, int expand_expr_10_0, int expand_expr_11_0, int expand_expr_12_0, int expand_expr_13_0, int expand_expr_14_0, int expand_expr_15_0, int expand_expr_16_0, int expand_expr_17_0, int expand_expr_18_0, int expand_expr_19_0, int expand_expr_20_0, int expand_expr_21_0, int expand_expr_22_0, int expand_expr_23_0, int expand_expr_24_0, int expand_expr_25_0, int expand_expr_26_0, int expand_expr_27_0, int expand_expr_28_0, int expand_expr_29_0, int expand_expr_30_0, int expand_expr_31_0, int expand_expr_32_0, int expand_expr_33_0, int expand_expr_34_0, int expand_expr_35_0, int expand_expr_36_0, int expand_expr_37_0, int expand_expr_38_0, int expand_expr_39_0, int expand_expr_40_0, int expand_expr_41_0, int expand_expr_42_0, int expand_expr_43_0, int expand_expr_44_0, int expand_expr_45_0, int expand_expr_46_0, int expand_expr_47_0, int expand_expr_48_0, int expand_expr_49_0, int expand_expr_50_0, int expand_expr_51_0, int expand_expr_52_0, int expand_expr_53_0, int expand_expr_54_0, int expand_expr_55_0, int expand_expr_56_0, int expand_expr_57_0, int expand_expr_58_0, int expand_expr_59_0, int expand_expr_60_0, int expand_expr_61_0, int expand_expr_62_0, int expand_expr_63_0, int expand_expr_64_0, int expand_expr_65_0, int expand_expr_66_0, int expand_expr_67_0, int expand_expr_68_0, int expand_expr_69_0, int expand_expr_70_0, int expand_expr_71_0, int expand_expr_72_0, int expand_expr_73_0, int expand_expr_74_0, int expand_expr_75_0, int expand_expr_76_0, int expand_expr_77_0, int expand_expr_78_0, int expand_expr_79_0, int expand_expr_80_0, int expand_expr_81_0, int expand_expr_82_0, int expand_expr_83_0, int expand_expr_84_0, int expand_expr_85_0, int expand_expr_86_0, int expand_expr_87_0, int expand_expr_88_0, int expand_expr_89_0, int expand_expr_90_0, int expand_expr_91_0, int expand_expr_92_0, int expand_expr_93_0, int expand_expr_94_0, int expand_expr_95_0, int expand_expr_96_0, int expand_expr_97_0, int expand_expr_98_0, int expand_expr_99_0, int expand_expr_100_0) throws java.io.IOException {
/* 492 */     boolean expand_isNull_103 = true;
/* 493 */     int expand_value_103 =
/* 494 */     -1;
...
/* 792 */     for (int expand_i_0 = 0; expand_i_0 < 50; expand_i_0 ++) {
/* 793 */       if (expand_i_0 == 0) {
/* 794 */         expand_isNull_103 = true;
/* 795 */         expand_value_103 = -1;
/* 796 */
...
/* 1095 */       else
/* 1096 */       if (expand_i_0 == 1) {
/* 1097 */         expand_isNull_103 = false;
/* 1098 */         expand_value_103 = expand_expr_58_0;
/* 1099 */
...
/* 15639 */       else
/* 15640 */       if (expand_i_0 == 49) {
/* 15641 */         expand_isNull_103 = true;
/* 15642 */         expand_value_103 = -1;
/* 15643 */
...
/* 15940 */       }
/* 15941 */       ((org.apache.spark.sql.execution.metric.SQLMetric) references[7] /* numOutputRows */).add(1);
/* 15942 */
/* 15943 */       agg_doConsume_0(expand_expr_0_0, expand_exprIsNull_0_0, expand_expr_1_0, expand_exprIsNull_1_0, expand_value_103, expand_isNull_103, expand_value_104, expand_isNull_104, expand_value_105, expand_isNull_105, expand_value_106, expand_isNull_106, expand_value_107, expand_isNull_107, expand_value_108, expand_isNull_108, expand_value_109, expand_isNull_109, expand_value_110, expand_isNull_110, expand_value_111, expand_isNull_111, expand_value_112, expand_isNull_112, expand_value_113, expand_isNull_113, expand_value_114, expand_isNull_114, expand_value_115, expand_isNull_115, expand_value_116, expand_isNull_116, expand_value_117, expand_isNull_117, expand_value_118, expand_isNull_118, expand_value_119, expand_isNull_119, expand_value_120, expand_isNull_120, expand_value_121, expand_isNull_121, expand_value_122, expand_isNull_122, expand_value_123, expand_isNull_123, expand_value_124, expand_isNull_124, expand_value_125, expand_isNull_125, expand_value_126, expand_isNull_126, expand_value_127, expand_isNull_127, expand_value_128, expand_isNull_128, expand_value_129, expand_isNull_129, expand_value_130, expand_isNull_130, expand_value_131, expand_isNull_131, expand_value_132, expand_isNull_132, expand_value_133, expand_isNull_133, expand_value_134, expand_isNull_134, expand_value_135, expand_isNull_135, expand_value_136, expand_isNull_136, expand_value_137, expand_isNull_137, expand_value_138, expand_isNull_138, expand_value_139, expand_isNull_139, expand_value_140, expand_isNull_140, expand_value_141, expand_isNull_141, expand_value_142, expand_isNull_142, expand_value_143, expand_isNull_143, expand_value_144, expand_isNull_144, expand_value_145, expand_isNull_145, expand_value_146, expand_isNull_146, expand_value_147, expand_isNull_147, expand_value_148, expand_isNull_148, expand_value_149, expand_isNull_149, expand_value_150, expand_isNull_150, expand_value_151, expand_isNull_151, expand_value_152, expand_value_153, expand_isNull_153, expand_value_154, expand_isNull_154, expand_value_155, expand_isNull_155, expand_value_156, expand_isNull_156, expand_value_157, expand_isNull_157, expand_value_158, expand_isNull_158, expand_value_159, expand_isNull_159, expand_value_160, expand_isNull_160, expand_value_161, expand_isNull_161, expand_value_162, expand_isNull_162, expand_value_163, expand_isNull_163, expand_value_164, expand_isNull_164, expand_value_165, expand_isNull_165, expand_value_166, expand_isNull_166, expand_value_167, expand_isNull_167, expand_value_168, expand_isNull_168, expand_value_169, expand_isNull_169, expand_value_170, expand_isNull_170, expand_value_171, expand_isNull_171, expand_value_172, expand_isNull_172, expand_value_173, expand_isNull_173, expand_value_174, expand_isNull_174, expand_value_175, expand_isNull_175, expand_value_176, expand_isNull_176, expand_value_177, expand_isNull_177, expand_value_178, expand_isNull_178, expand_value_179, expand_isNull_179, expand_value_180, expand_isNull_180, expand_value_181, expand_isNull_181, expand_value_182, expand_isNull_182, expand_value_183, expand_isNull_183, expand_value_184, expand_isNull_184, expand_value_185, expand_isNull_185, expand_value_186, expand_isNull_186, expand_value_187, expand_isNull_187, expand_value_188, expand_isNull_188, expand_value_189, expand_isNull_189, expand_value_190, expand_isNull_190, expand_value_191, expand_isNull_191, expand_value_192, expand_isNull_192, expand_value_193, expand_isNull_193, expand_value_194, expand_isNull_194, expand_value_195, expand_isNull_195, expand_value_196, expand_isNull_196, expand_value_197, expand_isNull_197, expand_value_198, expand_isNull_198, expand_value_199, expand_isNull_199, expand_value_200, expand_isNull_200, expand_value_201, expand_isNull_201, expand_value_202, expand_isNull_202);
/* 15944 */
/* 15945 */     }

Why are the changes needed?

We got some report on failure on user's query which Janino throws error on compiling generated code. The issue is here: janino-compiler/janino#113 It contains the information of generated code, symptom (error), and analysis of the bug, so please refer the link for more details.

We provided the patch to Janino via janino-compiler/janino#114 and Janino 3.1.1 was released which contains the patch, but we realized lots of unit tests fail once we apply Janino 3.1.1.

We have asked about releasing Janino 3.0.16 to Janino maintainer (see janino-compiler/janino#115), but given there's no guarantee to get it, we'd be better to have our own workaround.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New UT. Confirmed that changing ExpandExec to not use CodegenContext.disallowSwitchStatement made the new test fail.

// the query fails with switch statement, whereas it passes with if-else statement.
// Note that the value depends on the Spark logic as well - different Spark versions may
// require different value to ensure the test failing with switch statement.
val numNewFields = 100
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally I was crafting the patch against Spark 2.3 - in Spark 2.3, setting this to 100 throws exception which is not from Janino bug, but from either hitting 64KB limit or parameter limitation on method signature. (That's why I added the details on exceptions when the value exceeds upper limit.)

For Spark 2.3, 70 is the thing making switch statement failing and if ~ else if ~ else statement passing.

@SparkQA
Copy link

SparkQA commented Mar 11, 2020

Test build #119662 has finished for PR 27872 at commit 23ec81b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

Retest this, please

@SparkQA
Copy link

SparkQA commented Mar 11, 2020

Test build #119656 has finished for PR 27872 at commit 90ef125.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 11, 2020

Test build #119670 has finished for PR 27872 at commit 23ec81b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

if (canBeComputedUsingSwitch && hset.size <= SQLConf.get.optimizerInSetSwitchThreshold) {
val sqlConf = SQLConf.get
if (canBeComputedUsingSwitch && hset.size <= sqlConf.optimizerInSetSwitchThreshold &&
sqlConf.codegenUseSwitchStatement) {
Copy link
Contributor

@squito squito Mar 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of an added configuration, could this be a try / catch? first try with switch, if it fails, then log an error and use if-else? avoids the users having to set a config, and then unset it again when we have the right fix from janino

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it would be ideal. Thanks! I over-thought a bit and thought it would be hard to deal with it, but looks like not that complicated. Let me update the code and PR information.

@@ -178,27 +179,48 @@ case class ExpandExec(
|${ev.code}
|${outputColumns(col).isNull} = ${ev.isNull};
|${outputColumns(col).value} = ${ev.value};
""".stripMargin
""".stripMargin
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks to be a right indentation so fixed.

@@ -647,7 +651,7 @@ case class WholeStageCodegenExec(child: SparkPlan)(val codegenStageId: Int)
}

${ctx.registerComment(
s"""Codegend pipeline for stage (id=$codegenStageId)
s"""Codegen pipeline for stage (id=$codegenStageId)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed a typo as well.

@HeartSaVioR HeartSaVioR changed the title [SPARK-31115][SQL] Provide config to avoid using switch statement in generated code to avoid Janino bug [SPARK-31115][SQL] Detect known Janino bug #113 and apply workaround automatically via avoid using switch statement in generated code Mar 11, 2020
@HeartSaVioR HeartSaVioR changed the title [SPARK-31115][SQL] Detect known Janino bug #113 and apply workaround automatically via avoid using switch statement in generated code [SPARK-31115][SQL] Detect known Janino bug janino-compiler/janino#113 and apply workaround automatically via avoid using switch statement in generated code Mar 11, 2020
@HeartSaVioR
Copy link
Contributor Author

cc. @dongjoon-hyun and @maropu as well, as they showed reactions in #27860

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Mar 11, 2020

It may cause performance regression in some cases, @HeartSaVioR .

cc @kiszk and @rednaxelafx and @gatorsmile , too.

@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented Mar 11, 2020

It would be appreciated if we can elaborate the possible performance regression (even it's like a "thinking out loud"); "may" and "some cases" are too ambiguous.

If the query doesn't fall into the case via hitting Janino issue, it shouldn't bring perf. regression, as it tries to generate code with switch and compiles first. If there's perf. regression here, it says I have some unintended mistake in fix which should be fixed.

If the query fall into the case via hitting Janino issue, either the query has been failing or has been failing back to non-codegen. Taking alternative is definitely better than failing the query; the remaining one is whether using if ~ else if chain (+ cost to regenerate code and compile) is worse than going through non-codegen.

@HeartSaVioR HeartSaVioR changed the title [SPARK-31115][SQL] Detect known Janino bug janino-compiler/janino#113 and apply workaround automatically via avoid using switch statement in generated code [SPARK-31115][SQL] Detect known Janino bug janino-compiler/janino#113 and apply workaround automatically as a fail-back via avoid using switch statement in generated code Mar 11, 2020
@HeartSaVioR
Copy link
Contributor Author

I just enriched some information in PR title/description to make clear that workaround is applied as a fail-back. I'm sorry if anyone is confused because of lack of information.

@maropu
Copy link
Member

maropu commented Mar 11, 2020

Hi, @HeartSaVioR , thanks for the work! btw, do you know what's an user query to reproduce this issue? If this issue occurs frequently on the user side, it might be worth adding the workaround. its a corner case, I personally think its ok just to fall back into the interpreter mode for avoiding the maintenance overhead of the workaround.

* is due to the known bug, it generates workaround code via touching flag in CodegenContext and
* compile again.
*/
private def doGenCodeAndCompile(): CompileResult = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we handle the non-whole stage codegen case? e.g., GenerateMutableProjection via CodeGeneratorWithInterpretedFallback?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switch is now used specific to ExpandExec and InSet; originally what I tracked was only ExpandExec, which doesn't fall into the case if I understand correctly. Maybe InSet has upper/lower limit configuration which wouldn't trigger the issue - just apply to ExpandExec only?

Copy link
Member

@maropu maropu Mar 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, I personally think we'd better fix the generated code of ExpandExec (I'm not sure now that its worth fixing this issue).

@SparkQA
Copy link

SparkQA commented Mar 11, 2020

Test build #119682 has finished for PR 27872 at commit c4375de.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented Mar 12, 2020

do you know what's an user query to reproduce this issue?

I have one, but I cannot share since the query is from actual customer. If you're OK with just generated code, I've attached the file in Janino issue janino-compiler/janino#113.

To share the information which directly impacts to hit the Janino bug, there're 70+ columns, grouped by 2 columns, and applies 60 aggregate functions (30 non-distinct, 30 distinct).

If this issue occurs frequently on the user side, it might be worth adding the workaround. its a corner case, I personally think its ok just to fall back into the interpreter mode for avoiding the maintenance overhead of the workaround.

I'm not sure whether it'd be an edge-case happening very rarely; I'd like to hear more voices, but if the consensus would be treating this as rare case, I'd agree that we would like to avoid the bandaid fix. Let's see others' voices as well.

@HeartSaVioR
Copy link
Contributor Author

retest this, please

@SparkQA
Copy link

SparkQA commented Mar 12, 2020

Test build #119681 has finished for PR 27872 at commit 5aa7ece.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class CodegenContext(val disallowSwitchStatement: Boolean = false) extends Logging

@SparkQA
Copy link

SparkQA commented Mar 12, 2020

Test build #119686 has finished for PR 27872 at commit c4375de.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 12, 2020

Test build #119693 has finished for PR 27872 at commit 4bd12d8.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rednaxelafx
Copy link
Contributor

I have mixed feelings about this PR.

The good part:

It does do what it advertised, and should not result in performance regressions since the code that would have passed compilation would stay as-is, and only the code that triggers error would go to the fallback mode which does come with slightly bigger code size, but wouldn't really affect runtime performance if the generated code is still JIT compiled by an optimizing JIT.

The less-good part:

I feel like the codegen system may make good use of a good retry framework, so that it may be able to do things that it couldn't do right now.
e.g. code splitting comes at a cost of extra function invocation overhead. If we generated code without code split and still stayed within either 8000 bytes or 64KB (depending on which threshold you care about more), then it'd be better not to split. Right now Spark SQL's codegen just does that unconditionally using "length of code text" as the trigger; if a retry framework were in place, we could have tried to generate without splitting first, and only go conservative if the thresholds are crossed.

The code presented here is an attempt at a one-off case of retry, and I'm somewhat concerned about future PRs that try to pile on top of it to implement retry for other scenarios, and just let the code "evolve organically" without a good base design first.

The the main focus is to buy some time before we can get Janino to release 3.0.16, my hunch is it's possible to slightly tune the codegen for ExpandExec to make it generate just small enough code to work around the customer issue bugging you right now.

e.g. if a part of the problem was caused by the case body of the switch statement being too big, and if your customer was only hitting that in Spark 2.3.x but not in Spark 2.4.x / 3.0.x, then using this style of generating the code for Literal(null, dataType) may help, at least with the example code you showed in the Janino bug you reported:

ExprCode(code = EmptyBlock, isNull = TrueLiteral, JavaCode.defaultLiteral(dataType))

(from Spark 2.4 / 3.0 / master: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala#L69)

WDYT @HeartSaVioR ?

}
)

val aggExprs: Array[Column] = Range(1, numNewFields).map { idx =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: How about (1 to numNewFields)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was using steps parameter and eventually removed it. Given we seem to be between neutral to negative on adopting this patch, I'll defer addressing the nit for now.

@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented Mar 13, 2020

Thanks for spending your time to analyze the patch and provide and detailed voice, @rednaxelafx .

The rationalization of this patch is to address known issue "before" we reach the real solution, as the real solution is out of our control; no one can answer when Janino 3.0.16 will be released, and even whether it will be released or not. Janino maintainer refactors everything by oneself in 3.1.x and makes things be unstable - recent version of Janino makes lots of Spark UTs fail which makes me less confident we can migrate to Janino 3.1.x.

I totally agree the patch is a band-aid fix and not well designed to extend (I didn't intend it be extendable though), and there're better options available (more to less preferable personally)

  1. Participate actively to Janino - fix all the issues in 3.1.x which breaks Spark, and adopt Janino 3.1.x. It should be ideal if we can provide set of regression tests so that Janino community can keep the stability. For sure, I'm not an expert of JVM so I can't lead the effort. A couple of JVM experts should jump in and lead the effort.

  2. Fork Janino - create a main dev. branch from v3.0.15, apply the patch for Issue #113: Grow the code for relocatables, and do fixup, and relocate janino-compiler/janino#114 (and port back some bugfixes after 3.0.15), release v3.0.16 by our side. License seems to be 3-clause BSD license, which doesn't seem to restrict redistribution. The thing of this option would be who/which group is willing to maintain the fork and release under their name.
    (I might not want to maintain the fork, but may want to contribute the new fork, as I'm the author of the patch for Issue #113: Grow the code for relocatables, and do fixup, and relocate janino-compiler/janino#114 and interested to fix the stack overflow issue in SPARK-25987 which is also an issue of Janino 3.0.x.)

  3. This patch. (try to compile and fail-back)

  4. Modify ExpandExec to check the number of operations in for statement, and use if ~ else if when the number of operations exceed the threshold. This should be ideally checking the length of offset but it would be weird if Spark does it, so count the lines blindly. Performance regression may happen in some cases where it can run with switch but due to blind count it runs with if ~ else if, but the case wouldn't be common.

  5. Give up addressing issue - either let end users to tune their query or guide that end users should enable failback option and run with interactive mode.

WDYT? Would it be better to raise this to discussion in dev@ list?

@HeartSaVioR
Copy link
Contributor Author

Btw, I'm happy to hear if the approach of this patch (retry) initiates to imagine the new idea of improvement. That would be enough worthy.

@srowen
Copy link
Member

srowen commented Mar 13, 2020

Sounds good, though yeah ideally we get a 3.0.16 update instead to fix it. Maybe worth waiting a beat but not holding up 3.0.

@kiszk
Copy link
Member

kiszk commented Mar 13, 2020

My 2 cents for 4 while we are waiting for 3.0.16. I am also neutral on other options except 2.

@HeartSaVioR
Copy link
Contributor Author

UPDATE: I’ve managed to fix all of Spark test failures we’ve seen with Janino 3.1.1 - see #27860. (Say, dealt with option 1) We still need to wait for these patches to be merged and released (so this patch is not still outdated) but in better situation as both 3.1.2 and 3.0.16 will work, worth to wait a bit more to see how things will go with Janino side.

@viirya
Copy link
Member

viirya commented Mar 14, 2020

Thanks for doing this fix!

Among the options, I'd prefer to tune ExpandExec like @rednaxelafx proposed to work around this issue for short term. Then we can either wait for a future release of 3.0.16, or Janino 3.1.2 that fixes all test failures in Spark.

@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented Mar 14, 2020

Honestly I'm not sure about the details I got the proposal from @rednaxelafx - what I can imagine for only touching ExpandExec to get around is option 4 from my proposal, and if he meant other trick/technique than I don't get it. It would be nice if someone could help me understand via elaborating a bit.

@maropu
Copy link
Member

maropu commented Mar 14, 2020

I think the option 4 looks fine to me. btw, splitting large code into pieces in switch is a solution for this issue? Additionally, we need to replace switch with if?

Modify ExpandExec to check the number of operations in for statement, and use if ~ else if when the number of operations exceed the threshold. This should be ideally checking the length of offset but it would be weird if Spark does it, so count the lines blindly. Performance regression may happen in some cases where it can run with switch but due to blind count it runs with if ~ else if, but the case wouldn't be common.

I just want to know the actual performance numbers of this approach. I think splitting large code into small parts might improve performance.

I have one, but I cannot share since the query is from actual customer. If you're OK with just generated code, I've attached the file in Janino issue janino-compiler/janino#113.

To reproduce the issue, could you build the simple query that you can show us based on your private customer's query? I think the query can make us understood more for the issue.

@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented Mar 14, 2020

btw, splitting large code into pieces in switch is a solution for this issue? Additionally, we need to replace switch with if?
I just want to know the actual performance numbers of this approach. I think splitting large code into small parts might improve performance.

This patch is intended to add workaround for the bug until the actual patch will be placed into Janino; it would be nice if we could develop the ideas of "improvement" via separate thread. I'm not an expert of JVM so not clear how much JIT would help (if what we expect from making method lines smaller is just inlining the method, that would be technically same as before).

That's why I've been also investigating to go forward with option 1 as well; this patch can be simply abandoned if we are OK with #27860 (for sure, after official release of Janino).

To reproduce the issue, could you build the simple query that you can show us based on your private customer's query? I think the query can make us understood more for the issue.

I've added UT which fails in master branch. I've also added comments around UT to explain the details. I don't think it would require very complicated query; lots of columns and enough number of distinct aggregation function will trigger the bug.

@rednaxelafx
Copy link
Contributor

rednaxelafx commented Mar 16, 2020

My suggestion for tuning codegen may have been too cryptic, sorry about not providing more background explanation.

Let me explain why I arrived at the Literal codegen improvement suggestion (targeted for Spark 2.3.x branch -- anything starting from Spark 2.4.0 has it already). @viirya probably saw through this already ;-)

My assumptions:

  1. The problematic code from actual customer was generated from Spark 2.3.x that @HeartSaVioR is trying to help fix. So to unblock this specific customer, slightly improving Spark 2.3.x branch may help.
  2. This PR targets master because the same Janino issue does still show up in master as well.
  3. But the specific customer isn't using Spark 2.4 yet (nor Spark 3.0 since it isn't even released yet).

From what @HeartSaVioR was able to share in janino-compiler/janino#113, specifically in the attachment https://github.com/janino-compiler/janino/files/4295191/error-codegen.log, we can see that a large portion of the switch...case body is filled with code like the following:

/* 4677 */       case 0:
/* 4678 */         final long expand_value129 = -1L;
/* 4679 */         expand_isNull66 = true;
/* 4680 */         expand_value66 = expand_value129;
/* 4681 */
/* 4682 */         final long expand_value130 = -1L;
/* 4683 */         expand_isNull67 = true;
/* 4684 */         expand_value67 = expand_value130;
/* 4685 */
/* 4686 */         final long expand_value131 = -1L;
/* 4687 */         expand_isNull68 = true;
/* 4688 */         expand_value68 = expand_value131;
/* 4689 */
/* 4690 */         final long expand_value132 = -1L;
/* 4691 */         expand_isNull69 = true;
/* 4692 */         expand_value69 = expand_value132;

If we zoom in on the 3 statements here:

/* 4678 */         final long expand_value129 = -1L;
/* 4679 */         expand_isNull66 = true;
/* 4680 */         expand_value66 = expand_value129;

javac would have followed JLS and compiled it into two assignments (L4679 and L4680 with constant propagation), and skipped L4678 because it's a Java language level constant expression initialization.
In pseudo bytecode, that's:

load constant long 1  # constant true
store to local expand_isNull66
load constant int -1L
store to local expand_value66

Janino, on the other hand, doesn't handle all of Java language level constant expression constructs. Specifically, it does not support local constant declarations. So it compiles the code as 3 assignments:

load constant long -1L
store to local expand_value129
load constant int 1  # constant true
store to local expand_isNull66
load from local expand_value129
store to local expand_value66

So every "chunk" of code above would be a few more bytes of bytecode. It may be small for just one chunk, but the codegen for ExpandExec makes that problem really bad by generating a lot of them.

So where does this "value = -1L; isNull = true" thingy come from? We can infer that it's generated from Literal(null, LongType). That is, https://github.com/apache/spark/blob/branch-2.3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala#L282-L285

We can easily make Spark 2.3.x generate slightly better code, by improving Literal's codegen, replacing

    if (value == null) {
      ev.isNull = "true"
      ev.copy(s"final $javaType ${ev.value} = ${ctx.defaultValue(dataType)};")
    }

with something like:

    if (value == null) {
      ev.copy(
        isNull = "true",
        value = ctx.defaultValue(dataType)
      )
    }

(the value part may or may not have to be further adjusted. Perhaps like value = s"(($javaType)${ctx.defaultValue(dataType)})") for maximal safety.

which would improve the example above into:

/* 4678 */         // THIS IS GONE: final long expand_value129 = -1L;
/* 4679 */         expand_isNull66 = true;
/* 4680 */         expand_value66 = -1L;

so that Janino can generate less bytecode for this particular query.

@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented Mar 17, 2020

Thanks for the detailed explanation, @rednaxelafx !

Btw, as I shared earlier, I have been investigating with Janino itself - I fixed all of bugs on 3.1.x which made Spark UT failing, and hopefully Janino maintainer made a branch for 3.0.x as well. Maybe we can get both of 3.0.16 & 3.1.2 versions in this week (I've asked it from janino-compiler/janino#115 (comment)), and then we can simply upgrade Janino to abandon this PR.

I'm running UT in #27860 for Janino 3.1.2 (custom jar via Jitpack) as well as #27932 for Janino 3.0.16 (custom jar via Jitpack). Once both builds run fine, I'll forward the result to Janino maintainer.

@HeartSaVioR
Copy link
Contributor Author

Both builds run fine - I've shared the result to Janino maintainer. Looks like he is open to release 3.0.16, but most likely one-time release and 3.0.x would never be LTS. So that would be ideal if we feel OK to go with Janino 3.1.2, but if we concern about stability we could just go with 3.0.16.

@HeartSaVioR
Copy link
Contributor Author

I've finally updated my two WIP PRs to official releases, which invalidates the purpose of this PR.

Let's move on, but decide which version would be better before moving on, 3.1.2 vs 3.0.16. (For sure, looks like Janino maintainer wants to see us adopt 3.1.x, but the decision is up to us.)

#27860 for Janino 3.1.2
#27932 for Janino 3.0.16

cc. to everyone commented here to see preference for Janino version.

@squito @dongjoon-hyun @maropu @kiszk @rednaxelafx @srowen @viirya

@srowen
Copy link
Member

srowen commented Mar 19, 2020

I don't have a strong preference, but if 3.1.2 works, seems like the time to jump to it is in Spark 3.0, all else equal?

@rednaxelafx
Copy link
Contributor

rednaxelafx commented Mar 19, 2020

@HeartSaVioR thank you so much for sorting it out! It's great to see the problems fixed on the Janino side.

My preference on the version to upgrade to is on the conservative side. It'd be nice if we can take a baby step at this stage of Spark 3.0 release... so +1 on 3.0.16 from me for 3.0 (and maybe 2.4 too), and 3.1.2 on master.

@maropu
Copy link
Member

maropu commented Mar 19, 2020

But, if we use janino v3.0.16 for spark v3.0.0, future janino 3.1.x releases for bugfixes have no luck for our 3.0.x maintance releases?

@HeartSaVioR HeartSaVioR deleted the SPARK-31115 branch March 19, 2020 08:45
@HeartSaVioR
Copy link
Contributor Author

Looks like Janino maintainer explicitly mentioned 3.0.x line is deprecated in Janino changelog.

Version 3.0.16, 2020-03-18 (the 3.0.x line is deprecated)

Upgrading Janino to 3.0.16 in Spark 2.4.x wouldn't be any issue as I expect only few of new Spark 2.4.x releases, but for Spark 3.0.0 we expect many further bugfix releases which looks to me that @maropu voice is valid.

I'm not sure there's some policy on upgrading dependency based on semver - like upgrading minor version of dependency in bugfix version, but if that's considered as a bad practice in community, we may be better to avoid deprecated version line. Let's collect more voices here.

@rednaxelafx
Copy link
Contributor

rednaxelafx commented Mar 19, 2020

Hmm, that's a hard choice. Janino's bultin tests are far from having good coverage, having hacked my own fork of Janino once and know the code fairly well, I feel somewhat uneasy about the huge refactorings...

Sticking to the "deprecated" Janino 3.0.x branch for Spark 3.0.x doesn't sound like that bad of a situation to me. Spark 3.0 took a long time, but Spark 3.1 shouldn't be that far away.

That said, I wouldn't be too sad if we end up taking Janino 3.1.2 into Spark 3.0.0.
We'll just release Spark patch version with a new Janino release if we hit a Janino bug...right?

@HeartSaVioR
Copy link
Contributor Author

Janino's bultin tests are far from having good coverage

I'd tend to agree, though I'd also understand about Janino's status - only one maintainer, hard to guess the all usages. Spark generated code is far far from hand written code which no one would have been imagine about ;)

Maybe it would be ideal to construct the regression tests in Spark side - either having set of generated codes, or E2E tests.

We'll just release Spark patch version with a new Janino release if we hit a Janino bug...right?

I'll try to sort it out once it happens but 3rd party is a kind of outer control. 3.0.17 might be unlikely happening, whereas 3.1.3 is likely, easier to persuade having new release.

@srowen
Copy link
Member

srowen commented Mar 19, 2020

semver still applies in that behavior changes - and the shaded copy - can affects apps. But there's no issue w.r.t. semver in updating to a new minor release, especially in a major Spark release.
Of course, we have to evaluate how much the dependency follows semver.

But yes the issues are: how risky is the update in general? breakage is bad in any Spark release. How much does it help? if the current version isn't maintained, when do you have to update and take that risk, and is that better on a major vs minor release boundary?

That is, if there is any perceived risk, would you take it now or in a minor Spark release.
I don't feel strongly but I think I'd try for 3.1.x in Spark 3.0 if there is no plausible reason to expect a risk of breakage that we know of.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
9 participants