SPARK-1189: Add Security to Spark - Akka, Http, ConnectionManager, UI use servlets #33

tgravescs · 2014-02-27T13:34:26Z

resubmit pull request. was https://github.com/apache/incubator-spark/pull/332.

…lets

…based on comments, and remove extra lines I missed in rebase from AkkaUtils

….9-with-client-rebase_rework Conflicts: core/src/main/scala/org/apache/spark/SparkEnv.scala core/src/main/scala/org/apache/spark/network/ConnectionManager.scala core/src/main/scala/org/apache/spark/ui/JettyUtils.scala repl/src/main/scala/org/apache/spark/repl/ExecutorClassLoader.scala

….ui to spark.ui.acls.enable, and fix up various other things from review comments

AmplabJenkins · 2014-02-27T13:38:50Z

Build triggered.

AmplabJenkins · 2014-02-27T13:38:50Z

Build started.

AmplabJenkins · 2014-02-27T13:39:00Z

Build triggered.

AmplabJenkins · 2014-02-27T14:07:22Z

Build finished.

AmplabJenkins · 2014-02-27T14:07:22Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/12918/

pwendell · 2014-02-27T18:31:10Z

@tgravescs is this ready for another round of review or are you still working on it?

tgravescs · 2014-02-27T18:39:10Z

its ready for review. I believe I've addressed all the comments from the previous PR.

pwendell · 2014-02-27T20:21:00Z

core/src/main/scala/org/apache/spark/SecurityManager.scala

+ * Spark does not currently support encryption after authentication.
+ * 
+ * At this point spark has multiple communication protocols that need to be secured and
+ * different underlying mechisms are used depending on the protocol:


These were causing errors on the configuration page. Author: Patrick Wendell <pwendell@gmail.com> Closes #111 from pwendell/master and squashes the following commits: 8467a86 [Patrick Wendell] Fix markup errors introduced in #33 (SPARK-1189)

…key_change Fixing SPARK-602: PythonPartitioner Currently PythonPartitioner determines partition ID by hashing a byte-array representation of PySpark's key. This PR lets PythonPartitioner use the actual partition ID, which is required e.g. for sorting via PySpark. (cherry picked from commit 0864193) Signed-off-by: Reynold Xin <rxin@apache.org>

JoshRosen · 2014-08-06T19:09:38Z

core/src/main/scala/org/apache/spark/network/ConnectionManager.scala

+      val securityMsg = SecurityMessage.fromBufferMessage(bufferMessage)
+      val connectionId = ConnectionId.createConnectionIdFromString(securityMsg.getConnectionId)
+
+      connectionsAwaitingSasl.get(connectionId) match {


@tgravescs

Is this prone to race conditions if both connection managers simultaneously try to initiate connections with each other? Here's the scenario I'm worried about:

ConnectionManagers Alice and Bob, both newly-created, simultaneously attempt to send messages to each other.

Alice adds Bob to connectionsAwaitingSasl and sends a message to Bob to begin negotiating the connection.

Bob's first message to Alice, negotiating his sending connection, arrives at Alice before Bob receives and responds to Alice's request.

When Bob receives Alice's first message, he incorrectly assumes that it's a response to his send and chooses to handle it with handleClientAuthentication, even though he should have handled it with handleServerAuthentication.

Can this problematic scenario occur? If so, it would be safer to add a field to the SecurityMessage to indicate whether it's from a SASL client or server.

This is what the unique connectionId is for. If the message returned has the connectionId from when it sent the security message and is in the connectionsawaitingsasl queue then its a response, otherwise the connectionId wouldn't match.

In this case Alice sends a message to Bob.

connectionAwaitingSasl on Alice gets Alice_1 added and Alice_1 is connecitonId of that message

Bob sends message to Alice

connetionAwaitingSasl on Bob gets Bob_1 added and Bob_1 is connectionId of that message.

Bob gets Alices first message, checks the connectionId in that message which is Alice_1 and it isn't in the connectionsAwaitingSasl because it only contains Bob_1 so acts as the server.

Fix Datadog metric name sanitization

resolve merge conflicts in vectorized parquet reader

## What changes were proposed in this pull request? This PR upgrade Janino version to 3.0.8. [Janino 3.0.8](https://janino-compiler.github.io/janino/changelog.html) includes an important fix to reduce the number of constant pool entries by using 'sipush' java bytecode. * SIPUSH bytecode is not used for short integer constant [apache#33](janino-compiler/janino#33). Please see detail in [this discussion thread](apache#19518 (comment)). ## How was this patch tested? Existing tests Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes apache#19890 from kiszk/SPARK-22688.

This PR upgrade Janino version to 3.0.8. [Janino 3.0.8](https://janino-compiler.github.io/janino/changelog.html) includes an important fix to reduce the number of constant pool entries by using 'sipush' java bytecode. * SIPUSH bytecode is not used for short integer constant [#33](janino-compiler/janino#33). Please see detail in [this discussion thread](#19518 (comment)). Existing tests Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19890 from kiszk/SPARK-22688. (cherry picked from commit 8ae004b) Signed-off-by: Sean Owen <sowen@cloudera.com>

This PR upgrade Janino version to 3.0.8. [Janino 3.0.8](https://janino-compiler.github.io/janino/changelog.html) includes an important fix to reduce the number of constant pool entries by using 'sipush' java bytecode. * SIPUSH bytecode is not used for short integer constant [apache#33](janino-compiler/janino#33). Please see detail in [this discussion thread](apache#19518 (comment)). Existing tests Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes apache#19890 from kiszk/SPARK-22688. (cherry picked from commit 8ae004b) Signed-off-by: Sean Owen <sowen@cloudera.com>

…nput of UDF as double in the failed test in udf-aggregate_part1.sql ## What changes were proposed in this pull request? It still can be flaky on certain environments due to float limitation described at apache#25110 . See apache#25110 (comment) - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/6584/testReport/org.apache.spark.sql/SQLQueryTestSuite/udf_pgSQL_udf_aggregates_part1_sql___Regular_Python_UDF/ ``` Expected "700000000000[6] 1", but got "700000000000[5] 1" Result did not match for query #33
SELECT CAST(avg(udf(CAST(x AS DOUBLE))) AS long), CAST(udf(var_pop(CAST(x AS DOUBLE))) AS decimal(10,3))
FROM (VALUES (7000000000005), (7000000000007)) v(x) ``` Here;s what's going on: apache#25110 (comment) ``` scala> Seq("7000000000004.999", "7000000000006.999").toDF().selectExpr("CAST(avg(value) AS long)").show() +--------------------------+ |CAST(avg(value) AS BIGINT)| +--------------------------+ | 7000000000005| +--------------------------+ ``` Therefore, this PR just avoid to cast in the specific test. This is a temp fix. We need more robust way to avoid such cases. ## How was this patch tested? It passes with Maven in my local before/after this PR. I believe the problem seems similarly the Python or OS installed in the machine. I should test this against PR builder with `test-maven` for sure.. Closes apache#25128 from HyukjinKwon/SPARK-28270-2. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…nput of UDF as double in the failed test in udf-aggregate_part1.sql ## What changes were proposed in this pull request? It still can be flaky on certain environments due to float limitation described at apache#25110 . See apache#25110 (comment) - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/6584/testReport/org.apache.spark.sql/SQLQueryTestSuite/udf_pgSQL_udf_aggregates_part1_sql___Regular_Python_UDF/ ``` Expected "700000000000[6] 1", but got "700000000000[5] 1" Result did not match for query apache#33&apache#10;SELECT CAST(avg(udf(CAST(x AS DOUBLE))) AS long), CAST(udf(var_pop(CAST(x AS DOUBLE))) AS decimal(10,3))&apache#10;FROM (VALUES (7000000000005), (7000000000007)) v(x) ``` Here;s what's going on: apache#25110 (comment) ``` scala> Seq("7000000000004.999", "7000000000006.999").toDF().selectExpr("CAST(avg(value) AS long)").show() +--------------------------+ |CAST(avg(value) AS BIGINT)| +--------------------------+ | 7000000000005| +--------------------------+ ``` Therefore, this PR just avoid to cast in the specific test. This is a temp fix. We need more robust way to avoid such cases. ## How was this patch tested? It passes with Maven in my local before/after this PR. I believe the problem seems similarly the Python or OS installed in the machine. I should test this against PR builder with `test-maven` for sure.. Closes apache#25128 from HyukjinKwon/SPARK-28270-2. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

Support separate LBaaS tests for terraform-openstack-provider

[YSPARK-1595][FOLLOWUP] Add extraClassPath option to the rhel8 images for oozie workflow

### What changes were proposed in this pull request? As title. This PR is to add code-gen support for LEFT SEMI sort merge join. The main change is to add `semiJoin` code path in `SortMergeJoinExec.doProduce()` and introduce `onlyBufferFirstMatchedRow` in `SortMergeJoinExec.genScanner()`. The latter is for left semi sort merge join without condition. For this kind of query, we don't need to buffer all matched rows, but only the first one (this is same as non-code-gen code path). Example query: ``` val df1 = spark.range(10).select($"id".as("k1")) val df2 = spark.range(4).select($"id".as("k2")) val oneJoinDF = df1.join(df2.hint("SHUFFLE_MERGE"), $"k1" === $"k2", "left_semi") ``` Example of generated code for the query: ``` == Subtree 5 / 5 (maxMethodCodeSize:302; maxConstantPoolSize:156(0.24% used); numInnerClasses:0) == *(5) Project [id#0L AS k1#2L] +- *(5) SortMergeJoin [id#0L], [k2#6L], LeftSemi :- *(2) Sort [id#0L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(id#0L, 5), ENSURE_REQUIREMENTS, [id=#27] : +- *(1) Range (0, 10, step=1, splits=2) +- *(4) Sort [k2#6L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(k2#6L, 5), ENSURE_REQUIREMENTS, [id=#33] +- *(3) Project [id#4L AS k2#6L] +- *(3) Range (0, 4, step=1, splits=2) Generated code: /* 001 */ public Object generate(Object[] references) { /* 002 */ return new GeneratedIteratorForCodegenStage5(references); /* 003 */ } /* 004 */ /* 005 */ // codegenStageId=5 /* 006 */ final class GeneratedIteratorForCodegenStage5 extends org.apache.spark.sql.execution.BufferedRowIterator { /* 007 */ private Object[] references; /* 008 */ private scala.collection.Iterator[] inputs; /* 009 */ private scala.collection.Iterator smj_streamedInput_0; /* 010 */ private scala.collection.Iterator smj_bufferedInput_0; /* 011 */ private InternalRow smj_streamedRow_0; /* 012 */ private InternalRow smj_bufferedRow_0; /* 013 */ private long smj_value_2; /* 014 */ private org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray smj_matches_0; /* 015 */ private long smj_value_3; /* 016 */ private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] smj_mutableStateArray_0 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[2]; /* 017 */ /* 018 */ public GeneratedIteratorForCodegenStage5(Object[] references) { /* 019 */ this.references = references; /* 020 */ } /* 021 */ /* 022 */ public void init(int index, scala.collection.Iterator[] inputs) { /* 023 */ partitionIndex = index; /* 024 */ this.inputs = inputs; /* 025 */ smj_streamedInput_0 = inputs[0]; /* 026 */ smj_bufferedInput_0 = inputs[1]; /* 027 */ /* 028 */ smj_matches_0 = new org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray(1, 2147483647); /* 029 */ smj_mutableStateArray_0[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0); /* 030 */ smj_mutableStateArray_0[1] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0); /* 031 */ /* 032 */ } /* 033 */ /* 034 */ private boolean findNextJoinRows( /* 035 */ scala.collection.Iterator streamedIter, /* 036 */ scala.collection.Iterator bufferedIter) { /* 037 */ smj_streamedRow_0 = null; /* 038 */ int comp = 0; /* 039 */ while (smj_streamedRow_0 == null) { /* 040 */ if (!streamedIter.hasNext()) return false; /* 041 */ smj_streamedRow_0 = (InternalRow) streamedIter.next(); /* 042 */ long smj_value_0 = smj_streamedRow_0.getLong(0); /* 043 */ if (false) { /* 044 */ smj_streamedRow_0 = null; /* 045 */ continue; /* 046 */ /* 047 */ } /* 048 */ if (!smj_matches_0.isEmpty()) { /* 049 */ comp = 0; /* 050 */ if (comp == 0) { /* 051 */ comp = (smj_value_0 > smj_value_3 ? 1 : smj_value_0 < smj_value_3 ? -1 : 0); /* 052 */ } /* 053 */ /* 054 */ if (comp == 0) { /* 055 */ return true; /* 056 */ } /* 057 */ smj_matches_0.clear(); /* 058 */ } /* 059 */ /* 060 */ do { /* 061 */ if (smj_bufferedRow_0 == null) { /* 062 */ if (!bufferedIter.hasNext()) { /* 063 */ smj_value_3 = smj_value_0; /* 064 */ return !smj_matches_0.isEmpty(); /* 065 */ } /* 066 */ smj_bufferedRow_0 = (InternalRow) bufferedIter.next(); /* 067 */ long smj_value_1 = smj_bufferedRow_0.getLong(0); /* 068 */ if (false) { /* 069 */ smj_bufferedRow_0 = null; /* 070 */ continue; /* 071 */ } /* 072 */ smj_value_2 = smj_value_1; /* 073 */ } /* 074 */ /* 075 */ comp = 0; /* 076 */ if (comp == 0) { /* 077 */ comp = (smj_value_0 > smj_value_2 ? 1 : smj_value_0 < smj_value_2 ? -1 : 0); /* 078 */ } /* 079 */ /* 080 */ if (comp > 0) { /* 081 */ smj_bufferedRow_0 = null; /* 082 */ } else if (comp < 0) { /* 083 */ if (!smj_matches_0.isEmpty()) { /* 084 */ smj_value_3 = smj_value_0; /* 085 */ return true; /* 086 */ } else { /* 087 */ smj_streamedRow_0 = null; /* 088 */ } /* 089 */ } else { /* 090 */ if (smj_matches_0.isEmpty()) { /* 091 */ smj_matches_0.add((UnsafeRow) smj_bufferedRow_0); /* 092 */ } /* 093 */ /* 094 */ smj_bufferedRow_0 = null; /* 095 */ } /* 096 */ } while (smj_streamedRow_0 != null); /* 097 */ } /* 098 */ return false; // unreachable /* 099 */ } /* 100 */ /* 101 */ protected void processNext() throws java.io.IOException { /* 102 */ while (findNextJoinRows(smj_streamedInput_0, smj_bufferedInput_0)) { /* 103 */ long smj_value_4 = -1L; /* 104 */ smj_value_4 = smj_streamedRow_0.getLong(0); /* 105 */ scala.collection.Iterator<UnsafeRow> smj_iterator_0 = smj_matches_0.generateIterator(); /* 106 */ boolean smj_hasOutputRow_0 = false; /* 107 */ /* 108 */ while (!smj_hasOutputRow_0 && smj_iterator_0.hasNext()) { /* 109 */ InternalRow smj_bufferedRow_1 = (InternalRow) smj_iterator_0.next(); /* 110 */ /* 111 */ smj_hasOutputRow_0 = true; /* 112 */ ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] /* numOutputRows */).add(1); /* 113 */ /* 114 */ // common sub-expressions /* 115 */ /* 116 */ smj_mutableStateArray_0[1].reset(); /* 117 */ /* 118 */ smj_mutableStateArray_0[1].write(0, smj_value_4); /* 119 */ append((smj_mutableStateArray_0[1].getRow()).copy()); /* 120 */ /* 121 */ } /* 122 */ if (shouldStop()) return; /* 123 */ } /* 124 */ ((org.apache.spark.sql.execution.joins.SortMergeJoinExec) references[1] /* plan */).cleanupResources(); /* 125 */ } /* 126 */ /* 127 */ } ``` ### Why are the changes needed? Improve query CPU performance. Test with one query: ``` def sortMergeJoin(): Unit = { val N = 2 << 20 codegenBenchmark("left semi sort merge join", N) { val df1 = spark.range(N).selectExpr(s"id * 2 as k1") val df2 = spark.range(N).selectExpr(s"id * 3 as k2") val df = df1.join(df2, col("k1") === col("k2"), "left_semi") assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[SortMergeJoinExec]).isDefined) df.noop() } } ``` Seeing 30% of run-time improvement: ``` Running benchmark: left semi sort merge join Running case: left semi sort merge join code-gen off Stopped after 2 iterations, 1369 ms Running case: left semi sort merge join code-gen on Stopped after 5 iterations, 2743 ms Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz left semi sort merge join: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ left semi sort merge join code-gen off 676 685 13 3.1 322.2 1.0X left semi sort merge join code-gen on 524 549 32 4.0 249.7 1.3X ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit test in `WholeStageCodegenSuite.scala` and `ExistenceJoinSuite.scala`. Closes #32528 from c21/smj-left-semi. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? As title. This PR is to add code-gen support for LEFT ANTI sort merge join. The main change is to extract `loadStreamed` in `SortMergeJoinExec.doProduce()`. That is to set all columns values for streamed row, when the streamed row has no output row. Example query: ``` val df1 = spark.range(10).select($"id".as("k1")) val df2 = spark.range(4).select($"id".as("k2")) df1.join(df2.hint("SHUFFLE_MERGE"), $"k1" === $"k2", "left_anti") ``` Example generated code: ``` == Subtree 5 / 5 (maxMethodCodeSize:296; maxConstantPoolSize:156(0.24% used); numInnerClasses:0) == *(5) Project [id#0L AS k1#2L] +- *(5) SortMergeJoin [id#0L], [k2#6L], LeftAnti :- *(2) Sort [id#0L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(id#0L, 5), ENSURE_REQUIREMENTS, [id=#27] : +- *(1) Range (0, 10, step=1, splits=2) +- *(4) Sort [k2#6L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(k2#6L, 5), ENSURE_REQUIREMENTS, [id=#33] +- *(3) Project [id#4L AS k2#6L] +- *(3) Range (0, 4, step=1, splits=2) Generated code: /* 001 */ public Object generate(Object[] references) { /* 002 */ return new GeneratedIteratorForCodegenStage5(references); /* 003 */ } /* 004 */ /* 005 */ // codegenStageId=5 /* 006 */ final class GeneratedIteratorForCodegenStage5 extends org.apache.spark.sql.execution.BufferedRowIterator { /* 007 */ private Object[] references; /* 008 */ private scala.collection.Iterator[] inputs; /* 009 */ private scala.collection.Iterator smj_streamedInput_0; /* 010 */ private scala.collection.Iterator smj_bufferedInput_0; /* 011 */ private InternalRow smj_streamedRow_0; /* 012 */ private InternalRow smj_bufferedRow_0; /* 013 */ private long smj_value_2; /* 014 */ private org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray smj_matches_0; /* 015 */ private long smj_value_3; /* 016 */ private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] smj_mutableStateArray_0 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[2]; /* 017 */ /* 018 */ public GeneratedIteratorForCodegenStage5(Object[] references) { /* 019 */ this.references = references; /* 020 */ } /* 021 */ /* 022 */ public void init(int index, scala.collection.Iterator[] inputs) { /* 023 */ partitionIndex = index; /* 024 */ this.inputs = inputs; /* 025 */ smj_streamedInput_0 = inputs[0]; /* 026 */ smj_bufferedInput_0 = inputs[1]; /* 027 */ /* 028 */ smj_matches_0 = new org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray(1, 2147483647); /* 029 */ smj_mutableStateArray_0[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0); /* 030 */ smj_mutableStateArray_0[1] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0); /* 031 */ /* 032 */ } /* 033 */ /* 034 */ private boolean findNextJoinRows( /* 035 */ scala.collection.Iterator streamedIter, /* 036 */ scala.collection.Iterator bufferedIter) { /* 037 */ smj_streamedRow_0 = null; /* 038 */ int comp = 0; /* 039 */ while (smj_streamedRow_0 == null) { /* 040 */ if (!streamedIter.hasNext()) return false; /* 041 */ smj_streamedRow_0 = (InternalRow) streamedIter.next(); /* 042 */ long smj_value_0 = smj_streamedRow_0.getLong(0); /* 043 */ if (false) { /* 044 */ if (!smj_matches_0.isEmpty()) { /* 045 */ smj_matches_0.clear(); /* 046 */ } /* 047 */ return false; /* 048 */ /* 049 */ } /* 050 */ if (!smj_matches_0.isEmpty()) { /* 051 */ comp = 0; /* 052 */ if (comp == 0) { /* 053 */ comp = (smj_value_0 > smj_value_3 ? 1 : smj_value_0 < smj_value_3 ? -1 : 0); /* 054 */ } /* 055 */ /* 056 */ if (comp == 0) { /* 057 */ return true; /* 058 */ } /* 059 */ smj_matches_0.clear(); /* 060 */ } /* 061 */ /* 062 */ do { /* 063 */ if (smj_bufferedRow_0 == null) { /* 064 */ if (!bufferedIter.hasNext()) { /* 065 */ smj_value_3 = smj_value_0; /* 066 */ return !smj_matches_0.isEmpty(); /* 067 */ } /* 068 */ smj_bufferedRow_0 = (InternalRow) bufferedIter.next(); /* 069 */ long smj_value_1 = smj_bufferedRow_0.getLong(0); /* 070 */ if (false) { /* 071 */ smj_bufferedRow_0 = null; /* 072 */ continue; /* 073 */ } /* 074 */ smj_value_2 = smj_value_1; /* 075 */ } /* 076 */ /* 077 */ comp = 0; /* 078 */ if (comp == 0) { /* 079 */ comp = (smj_value_0 > smj_value_2 ? 1 : smj_value_0 < smj_value_2 ? -1 : 0); /* 080 */ } /* 081 */ /* 082 */ if (comp > 0) { /* 083 */ smj_bufferedRow_0 = null; /* 084 */ } else if (comp < 0) { /* 085 */ if (!smj_matches_0.isEmpty()) { /* 086 */ smj_value_3 = smj_value_0; /* 087 */ return true; /* 088 */ } else { /* 089 */ return false; /* 090 */ } /* 091 */ } else { /* 092 */ if (smj_matches_0.isEmpty()) { /* 093 */ smj_matches_0.add((UnsafeRow) smj_bufferedRow_0); /* 094 */ } /* 095 */ /* 096 */ smj_bufferedRow_0 = null; /* 097 */ } /* 098 */ } while (smj_streamedRow_0 != null); /* 099 */ } /* 100 */ return false; // unreachable /* 101 */ } /* 102 */ /* 103 */ protected void processNext() throws java.io.IOException { /* 104 */ while (smj_streamedInput_0.hasNext()) { /* 105 */ findNextJoinRows(smj_streamedInput_0, smj_bufferedInput_0); /* 106 */ /* 107 */ long smj_value_4 = -1L; /* 108 */ smj_value_4 = smj_streamedRow_0.getLong(0); /* 109 */ scala.collection.Iterator<UnsafeRow> smj_iterator_0 = smj_matches_0.generateIterator(); /* 110 */ /* 111 */ boolean wholestagecodegen_hasOutputRow_0 = false; /* 112 */ /* 113 */ while (!wholestagecodegen_hasOutputRow_0 && smj_iterator_0.hasNext()) { /* 114 */ InternalRow smj_bufferedRow_1 = (InternalRow) smj_iterator_0.next(); /* 115 */ /* 116 */ wholestagecodegen_hasOutputRow_0 = true; /* 117 */ } /* 118 */ /* 119 */ if (!wholestagecodegen_hasOutputRow_0) { /* 120 */ // load all values of streamed row, because the values not in join condition are not /* 121 */ // loaded yet. /* 122 */ /* 123 */ ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] /* numOutputRows */).add(1); /* 124 */ /* 125 */ // common sub-expressions /* 126 */ /* 127 */ smj_mutableStateArray_0[1].reset(); /* 128 */ /* 129 */ smj_mutableStateArray_0[1].write(0, smj_value_4); /* 130 */ append((smj_mutableStateArray_0[1].getRow()).copy()); /* 131 */ /* 132 */ } /* 133 */ if (shouldStop()) return; /* 134 */ } /* 135 */ ((org.apache.spark.sql.execution.joins.SortMergeJoinExec) references[1] /* plan */).cleanupResources(); /* 136 */ } /* 137 */ /* 138 */ } ``` ### Why are the changes needed? Improve the query CPU performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit test in `WholeStageCodegenSuite.scala`, and existed unit test in `ExistenceJoinSuite.scala`. Closes #32547 from c21/smj-left-anti. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

### What changes were proposed in this pull request? Optimize the TransposeWindow rule to extend applicable cases and optimize time complexity. TransposeWindow rule will try to eliminate unnecessary shuffle: but the function compatiblePartitions will only take the first n elements of the window2 partition sequence, for some cases, this will not take effect, like the case below: val df = spark.range(10).selectExpr("id AS a", "id AS b", "id AS c", "id AS d") df.selectExpr( "sum(`d`) OVER(PARTITION BY `b`,`a`) as e", "sum(`c`) OVER(PARTITION BY `a`) as f" ).explain Current plan == Physical Plan == *(5) Project [e#10L, f#11L] +- Window [sum(c#4L) windowspecdefinition(a#2L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS f#11L], [a#2L] +- *(4) Sort [a#2L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(a#2L, 200), true, [id=#41] +- *(3) Project [a#2L, c#4L, e#10L] +- Window [sum(d#5L) windowspecdefinition(b#3L, a#2L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS e#10L], [b#3L, a#2L] +- *(2) Sort [b#3L ASC NULLS FIRST, a#2L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(b#3L, a#2L, 200), true, [id=#33] +- *(1) Project [id#0L AS d#5L, id#0L AS b#3L, id#0L AS a#2L, id#0L AS c#4L] +- *(1) Range (0, 10, step=1, splits=10) Expected plan: == Physical Plan == *(4) Project [e#924L, f#925L] +- Window [sum(d#43L) windowspecdefinition(b#41L, a#40L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS e#924L], [b#41L, a#40L] +- *(3) Sort [b#41L ASC NULLS FIRST, a#40L ASC NULLS FIRST], false, 0 +- *(3) Project [d#43L, b#41L, a#40L, f#925L] +- Window [sum(c#42L) windowspecdefinition(a#40L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS f#925L], [a#40L] +- *(2) Sort [a#40L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(a#40L, 200), true, [id=#282] +- *(1) Project [id#38L AS d#43L, id#38L AS b#41L, id#38L AS a#40L, id#38L AS c#42L] +- *(1) Range (0, 10, step=1, splits=10) Also the permutations method has a O(n!) time complexity, which is very expensive when there are many partition columns, we could try to optimize it. ### Why are the changes needed? We could apply the rule for more cases, which could improve the execution performance by eliminate unnecessary shuffle, and by reducing the time complexity from O(n!) to O(n2), the performance for the rule itself could improve ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? UT Closes #35334 from constzhou/SPARK-38034_optimize_transpose_window_rule. Authored-by: xzhou <15210830305@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? Optimize the TransposeWindow rule to extend applicable cases and optimize time complexity. TransposeWindow rule will try to eliminate unnecessary shuffle: but the function compatiblePartitions will only take the first n elements of the window2 partition sequence, for some cases, this will not take effect, like the case below: val df = spark.range(10).selectExpr("id AS a", "id AS b", "id AS c", "id AS d") df.selectExpr( "sum(`d`) OVER(PARTITION BY `b`,`a`) as e", "sum(`c`) OVER(PARTITION BY `a`) as f" ).explain Current plan == Physical Plan == *(5) Project [e#10L, f#11L] +- Window [sum(c#4L) windowspecdefinition(a#2L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS f#11L], [a#2L] +- *(4) Sort [a#2L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(a#2L, 200), true, [id=#41] +- *(3) Project [a#2L, c#4L, e#10L] +- Window [sum(d#5L) windowspecdefinition(b#3L, a#2L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS e#10L], [b#3L, a#2L] +- *(2) Sort [b#3L ASC NULLS FIRST, a#2L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(b#3L, a#2L, 200), true, [id=#33] +- *(1) Project [id#0L AS d#5L, id#0L AS b#3L, id#0L AS a#2L, id#0L AS c#4L] +- *(1) Range (0, 10, step=1, splits=10) Expected plan: == Physical Plan == *(4) Project [e#924L, f#925L] +- Window [sum(d#43L) windowspecdefinition(b#41L, a#40L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS e#924L], [b#41L, a#40L] +- *(3) Sort [b#41L ASC NULLS FIRST, a#40L ASC NULLS FIRST], false, 0 +- *(3) Project [d#43L, b#41L, a#40L, f#925L] +- Window [sum(c#42L) windowspecdefinition(a#40L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS f#925L], [a#40L] +- *(2) Sort [a#40L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(a#40L, 200), true, [id=#282] +- *(1) Project [id#38L AS d#43L, id#38L AS b#41L, id#38L AS a#40L, id#38L AS c#42L] +- *(1) Range (0, 10, step=1, splits=10) Also the permutations method has a O(n!) time complexity, which is very expensive when there are many partition columns, we could try to optimize it. ### Why are the changes needed? We could apply the rule for more cases, which could improve the execution performance by eliminate unnecessary shuffle, and by reducing the time complexity from O(n!) to O(n2), the performance for the rule itself could improve ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? UT Closes #35334 from constzhou/SPARK-38034_optimize_transpose_window_rule. Authored-by: xzhou <15210830305@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 0cc331d) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? Optimize the TransposeWindow rule to extend applicable cases and optimize time complexity. TransposeWindow rule will try to eliminate unnecessary shuffle: but the function compatiblePartitions will only take the first n elements of the window2 partition sequence, for some cases, this will not take effect, like the case below: val df = spark.range(10).selectExpr("id AS a", "id AS b", "id AS c", "id AS d") df.selectExpr( "sum(`d`) OVER(PARTITION BY `b`,`a`) as e", "sum(`c`) OVER(PARTITION BY `a`) as f" ).explain Current plan == Physical Plan == *(5) Project [e#10L, f#11L] +- Window [sum(c#4L) windowspecdefinition(a#2L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS f#11L], [a#2L] +- *(4) Sort [a#2L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(a#2L, 200), true, [id=apache#41] +- *(3) Project [a#2L, c#4L, e#10L] +- Window [sum(d#5L) windowspecdefinition(b#3L, a#2L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS e#10L], [b#3L, a#2L] +- *(2) Sort [b#3L ASC NULLS FIRST, a#2L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(b#3L, a#2L, 200), true, [id=apache#33] +- *(1) Project [id#0L AS d#5L, id#0L AS b#3L, id#0L AS a#2L, id#0L AS c#4L] +- *(1) Range (0, 10, step=1, splits=10) Expected plan: == Physical Plan == *(4) Project [e#924L, f#925L] +- Window [sum(d#43L) windowspecdefinition(b#41L, a#40L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS e#924L], [b#41L, a#40L] +- *(3) Sort [b#41L ASC NULLS FIRST, a#40L ASC NULLS FIRST], false, 0 +- *(3) Project [d#43L, b#41L, a#40L, f#925L] +- Window [sum(c#42L) windowspecdefinition(a#40L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS f#925L], [a#40L] +- *(2) Sort [a#40L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(a#40L, 200), true, [id=apache#282] +- *(1) Project [id#38L AS d#43L, id#38L AS b#41L, id#38L AS a#40L, id#38L AS c#42L] +- *(1) Range (0, 10, step=1, splits=10) Also the permutations method has a O(n!) time complexity, which is very expensive when there are many partition columns, we could try to optimize it. ### Why are the changes needed? We could apply the rule for more cases, which could improve the execution performance by eliminate unnecessary shuffle, and by reducing the time complexity from O(n!) to O(n2), the performance for the rule itself could improve ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? UT Closes apache#35334 from constzhou/SPARK-38034_optimize_transpose_window_rule. Authored-by: xzhou <15210830305@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 0cc331d) Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 0f609ff)

tgravescs added 8 commits January 19, 2014 13:38

Add Security to Spark - Akka, Http, ConnectionManager, UI to use serv…

f351763

…lets

update AkkaUtilsSuite test for the actorSelection changes, fix typos …

5721c5a

…based on comments, and remove extra lines I missed in rebase from AkkaUtils

Convert SaslClient and SaslServer to scala, change spark.authenticate…

6f7ddf3

….ui to spark.ui.acls.enable, and fix up various other things from review comments

Add security.md

ed3d1c1

Fix reference to config

b514bec

Fix spacing and formatting

ecbfb65

fix header in SecurityManager

50dd9f2

pwendell reviewed Feb 27, 2014
View reviewed changes

JoshRosen reviewed Aug 6, 2014
View reviewed changes

JasonMWhite pushed a commit to JasonMWhite/spark that referenced this pull request Dec 2, 2015

Merge pull request apache#33 from Shopify/fix_datadog_metric_names

609915c

Fix Datadog metric name sanitization

sameeragarwal pushed a commit to sameeragarwal/spark that referenced this pull request Mar 18, 2016

Merge pull request apache#33 from sameeragarwal/fix

40037e6

resolve merge conflicts in vectorized parquet reader

sjrand pushed a commit to sjrand/spark that referenced this pull request Nov 14, 2016

cleanup (apache#33)

7e92ae8

bzhaoopenstack referenced this pull request in bzhaoopenstack/spark Sep 11, 2019

Merge pull request theopenlab#33 from theopenlab/enable-lbaas

d2063d8

Support separate LBaaS tests for terraform-openstack-provider

holdenk mentioned this pull request May 1, 2020

[SPARK-20732][CORE] Decommission cache blocks to other executors when an executor is decommissioned #28370

Closed

redsanket pushed a commit to redsanket/spark that referenced this pull request Feb 16, 2021

Merge pull request apache#33 from schintap/YSPARK-1595-followup

89bda14

[YSPARK-1595][FOLLOWUP] Add extraClassPath option to the rhel8 images for oozie workflow

constzhou mentioned this pull request Feb 11, 2022

[SPARK-38034][SQL] Optimize TransposeWindow rule #35334

Closed

SirOibaf pushed a commit to SirOibaf/spark that referenced this pull request Nov 21, 2022

[HOPSWORKS-3233][append] Stylecheck (apache#33)

b3a305f

wangyum pushed a commit that referenced this pull request May 26, 2023

[CARMEL-3545] Shuffle memory usage improvement (#33)

4bcf178

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SPARK-1189: Add Security to Spark - Akka, Http, ConnectionManager, UI use servlets #33

SPARK-1189: Add Security to Spark - Akka, Http, ConnectionManager, UI use servlets #33

Uh oh!

tgravescs commented Feb 27, 2014

Uh oh!

AmplabJenkins commented Feb 27, 2014

Uh oh!

AmplabJenkins commented Feb 27, 2014

Uh oh!

AmplabJenkins commented Feb 27, 2014

Uh oh!

AmplabJenkins commented Feb 27, 2014

Uh oh!

AmplabJenkins commented Feb 27, 2014

Uh oh!

pwendell commented Feb 27, 2014

Uh oh!

tgravescs commented Feb 27, 2014

Uh oh!

pwendell Feb 27, 2014

Uh oh!

JoshRosen Aug 6, 2014

Uh oh!

tgravescs Aug 6, 2014

Uh oh!

Uh oh!

SPARK-1189: Add Security to Spark - Akka, Http, ConnectionManager, UI use servlets #33

SPARK-1189: Add Security to Spark - Akka, Http, ConnectionManager, UI use servlets #33

Uh oh!

Conversation

tgravescs commented Feb 27, 2014

Uh oh!

AmplabJenkins commented Feb 27, 2014

Uh oh!

AmplabJenkins commented Feb 27, 2014

Uh oh!

AmplabJenkins commented Feb 27, 2014

Uh oh!

AmplabJenkins commented Feb 27, 2014

Uh oh!

AmplabJenkins commented Feb 27, 2014

Uh oh!

pwendell commented Feb 27, 2014

Uh oh!

tgravescs commented Feb 27, 2014

Uh oh!

pwendell Feb 27, 2014

Choose a reason for hiding this comment

Uh oh!

JoshRosen Aug 6, 2014

Choose a reason for hiding this comment

Uh oh!

tgravescs Aug 6, 2014

Choose a reason for hiding this comment

Uh oh!

Uh oh!