[CORE] Can not destroy broadcast in spark2.4.3 #26408

wfxxh · 2019-11-06T08:37:08Z

When I submit my program in spark2.4.3 the broadcast can not be destroyed .But submit in cdh spark2 version 2.1.0.cloudera2 it can be destroyed

My code like this

val batch = Seq(1 to 9999: *)
val strSeq = batch.map(i => s"xxh-$i")
val rdd = sc.parallelize(strSeq)
rdd.cache()
batch.foreach( => {
val broc = sc.broadcast(strSeq)
rdd.map(id => {
broc.value.contains(id)
}).collect()
broc.destroy()
})

…nsSuite ## What changes were proposed in this pull request? update `HiveExternalCatalogVersionsSuite` to test 2.4.2, as 2.4.1 will be removed from Mirror Network soon. ## How was this patch tested? N/A Closes #24452 from cloud-fan/release. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit b7f9830) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

## What changes were proposed in this pull request? Right now Kafka source v2 doesn't support null values. The issue is in org.apache.spark.sql.kafka010.KafkaRecordToUnsafeRowConverter.toUnsafeRow which doesn't handle null values. ## How was this patch tested? add new unit tests Closes #24441 from uncleGen/SPARK-27494. Authored-by: uncleGen <hustyugm@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit d2656aa) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…in HiveExternalCatalogVersionsSuite ## What changes were proposed in this pull request? We can get the latest downloadable Spark versions from https://dist.apache.org/repos/dist/release/spark/ ## How was this patch tested? manually. Closes #24454 from cloud-fan/test. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…ackendSuite ## What changes were proposed in this pull request? The test "RequestExecutors reflects node blacklist and is serializable" is flaky because of multi threaded access of the mock task scheduler. For details check [Mockito FAQ (occasional exceptions like: WrongTypeOfReturnValue)](https://github.com/mockito/mockito/wiki/FAQ#is-mockito-thread-safe). So instead of mocking the task scheduler in the test TaskSchedulerImpl is simply subclassed. This multithreaded access of the `nodeBlacklist()` method is coming from: 1) the unit test thread via calling of the method `prepareRequestExecutors()` 2) the `DriverEndpoint.onStart` which runs a periodic task that ends up calling this method ## How was this patch tested? Existing unittest. (cherry picked from commit e4e4e2b) Closes #24474 from attilapiros/SPARK-26891-branch-2.4. Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…mons-crypto. The commons-crypto library does some questionable error handling internally, which can lead to JVM crashes if some call into native code fails and cleans up state it should not. While the library is not fixed, this change adds some workarounds in Spark code so that when an error is detected in the commons-crypto side, Spark avoids calling into the library further. Tested with existing and added unit tests. Closes #24476 from vanzin/SPARK-25535-2.4. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

… count This PR consists of the `test` components of #23665 only, minus the associated patch from that PR. It adds a new unit test to `JsonSuite` which verifies that the `count()` returned from a `DataFrame` loaded from JSON containing empty lines does not include those empty lines in the record count. The test runs `count` prior to otherwise reading data from the `DataFrame`, so as to catch future cases where a pre-parsing optimization might result in `count` results inconsistent with existing behavior. This PR is intended to be deployed alongside #23667; `master` currently causes the test to fail, as described in [SPARK-26745](https://issues.apache.org/jira/browse/SPARK-26745). Manual testing, existing `JsonSuite` unit tests. Closes #23674 from sumitsu/json_emptyline_count_test. Authored-by: Branden Smith <branden.smith@publicismedia.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 63bced9) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…H in Hive UDAF adapter ## What changes were proposed in this pull request? This is a followup of #24144 . #24144 missed one case: when hash aggregate fallback to sort aggregate, the life cycle of UDAF is: INIT -> UPDATE -> MERGE -> FINISH. However, not all Hive UDAF can support it. Hive UDAF knows the aggregation mode when creating the aggregation buffer, so that it can create different buffers for different inputs: the original data or the aggregation buffer. Please see an example in the [sketches library](https://github.com/DataSketches/sketches-hive/blob/7f9e76e9e03807277146291beb2c7bec40e8672b/src/main/java/com/yahoo/sketches/hive/cpc/DataToSketchUDAF.java#L107). The buffer for UPDATE may not support MERGE. This PR updates the Hive UDAF adapter in Spark to support INIT -> UPDATE -> MERGE -> FINISH, by turning it to INIT -> UPDATE -> FINISH + IINIT -> MERGE -> FINISH. ## How was this patch tested? a new test case Closes #24459 from cloud-fan/hive-udaf. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 7432e7d) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…2.9.8 ## What changes were proposed in this pull request? This reverts commit 6f394a2. In general, we need to be very cautious about the Jackson upgrade in the patch releases, especially when this upgrade could break the existing behaviors of the external packages or data sources, and generate different results after the upgrade. The external packages and data sources need to change their source code to keep the original behaviors. The upgrade requires more discussions before releasing it, I think. In the previous PR #22071, we turned off `spark.master.rest.enabled` by default and added the following claim in our security doc: > The Rest Submission Server and the MesosClusterDispatcher do not support authentication. You should ensure that all network access to the REST API & MesosClusterDispatcher (port 6066 and 7077 respectively by default) are restricted to hosts that are trusted to submit jobs. We need to understand whether this Jackson CVE applies to Spark. Before officially releasing it, we need more inputs from all of you. Currently, I would suggest to revert this upgrade from the upcoming 2.4.3 release, which is trying to fix the accidental default Scala version changes in pre-built artifacts. ## How was this patch tested? N/A Closes #24493 from gatorsmile/revert24418. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

… 2.4 release script ## What changes were proposed in this pull request? This PR is to cherry-pick all the missing and relevant commits that were merged to master but not to branch-2.4. Previously, dbtsai used the release script in the branch 2.4 to release 2.4.1. After more investigation, I found it is risky to make a 2.4 release by using the release script in the master branch since the release script has various changes. It could easily introduce unnoticeable issues, like what we did for 2.4.2. Thus, I would cherry-pick all the missing fixes and use the updated release script to release 2.4.3 ## How was this patch tested? N/A Closes #24503 from gatorsmile/upgradeReleaseScript. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: gatorsmile <gatorsmile@gmail.com> Co-authored-by: wright <wright@semmle.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>

…h shell env Although we use shebang `#!/usr/bin/env bash`, `minikube docker-env` returns invalid commands in `non-bash` environment and causes failures at `eval` because it only recognizes the default shell. We had better add `--shell bash` option explicitly in our `bash` script. ```bash $ bash -c 'eval $(minikube docker-env)' bash: line 0: set: -g: invalid option set: usage: set [-abefhkmnptuvxBCHP] [-o option-name] [--] [arg ...] bash: line 0: set: -g: invalid option set: usage: set [-abefhkmnptuvxBCHP] [-o option-name] [--] [arg ...] bash: line 0: set: -g: invalid option set: usage: set [-abefhkmnptuvxBCHP] [-o option-name] [--] [arg ...] bash: line 0: set: -g: invalid option set: usage: set [-abefhkmnptuvxBCHP] [-o option-name] [--] [arg ...] $ bash -c 'eval $(minikube docker-env --shell bash)' ``` Manual. Run the script with non-bash shell environment. ``` bin/docker-image-tool.sh -m -t testing build ``` Closes #24517 from dongjoon-hyun/SPARK-27626. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 6c2d351) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…s such as loss only during fitting phase ## What changes were proposed in this pull request? When transform(...) method is called on a LinearRegressionModel created directly with the coefficients and intercepts, the following exception is encountered. ``` java.util.NoSuchElementException: Failed to find a default value for loss at org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780) at org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.ml.param.Params$class.getOrDefault(params.scala:779) at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:42) at org.apache.spark.ml.param.Params$class.$(params.scala:786) at org.apache.spark.ml.PipelineStage.$(Pipeline.scala:42) at org.apache.spark.ml.regression.LinearRegressionParams$class.validateAndTransformSchema(LinearRegression.scala:111) at org.apache.spark.ml.regression.LinearRegressionModel.validateAndTransformSchema(LinearRegression.scala:637) at org.apache.spark.ml.PredictionModel.transformSchema(Predictor.scala:192) at org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:311) at org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:311) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186) at org.apache.spark.ml.PipelineModel.transformSchema(Pipeline.scala:311) at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74) at org.apache.spark.ml.PipelineModel.transform(Pipeline.scala:305) ``` This is because validateAndTransformSchema() is called both during training and scoring phases, but the checks against the training related params like loss should really be performed during training phase only, I think, please correct me if I'm missing anything :) This issue was first reported for mleap (combust/mleap#455) because basically when we serialize the Spark transformers for mleap, we only serialize the params that are relevant for scoring. We do have the option to de-serialize the serialized transformers back into Spark for scoring again, but in that case, we no longer have all the training params. ## How was this patch tested? Added a unit test to check this scenario. Please let me know if there's anything additional required, this is the first PR that I've raised in this project. Closes #24509 from ancasarb/linear_regression_params_fix. Authored-by: asarb <asarb@expedia.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit 4241a72) Signed-off-by: Sean Owen <sean.owen@databricks.com>

…tabase ## What changes were proposed in this pull request? **Description from JIRA** For the JDBC option `query`, we use the identifier name to start with underscore: s"(${subquery}) _SPARK_GEN_JDBC_SUBQUERY_NAME${curId.getAndIncrement()}". This is not supported by Oracle. The Oracle doesn't seem to support identifier name to start with non-alphabet character (unless it is quoted) and has length restrictions as well. [link](https://docs.oracle.com/cd/B19306_01/server.102/b14200/sql_elements008.htm) In this PR, the generated alias name 'SPARK_GEN_JDBC_SUBQUERY_NAME<int value>' is fixed to remove "_" prefix and also the alias name is shortened to not exceed the identifier length limit. ## How was this patch tested? Tests are added for MySql, Postgress, Oracle and DB2 to ensure enough coverage. Closes #24532 from dilipbiswal/SPARK-27596. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com> (cherry picked from commit 6001d47) Signed-off-by: gatorsmile <gatorsmile@gmail.com>

…cationMetrics ## What changes were proposed in this pull request? Choose the last record in chunks when calculating metrics with downsampling in `BinaryClassificationMetrics`. ## How was this patch tested? A new unit test is added to verify thresholds from downsampled records. Closes #24470 from shishaochen/spark-mllib-binary-metrics. Authored-by: Shaochen Shi <shishaochen@bytedance.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit d5308cd) Signed-off-by: Sean Owen <sean.owen@databricks.com>

…rrectly ## What changes were proposed in this pull request? If the interval is `0`, it doesn't show both the value `0` and the unit at all. For example, this happens in the explain plans and Spark Web UI on `EventTimeWatermark` diagram. **BEFORE** ```scala scala> spark.readStream.schema("ts timestamp").parquet("/tmp/t").withWatermark("ts", "1 microsecond").explain == Physical Plan == EventTimeWatermark ts#0: timestamp, interval 1 microseconds +- StreamingRelation FileSource[/tmp/t], [ts#0] scala> spark.readStream.schema("ts timestamp").parquet("/tmp/t").withWatermark("ts", "0 microsecond").explain == Physical Plan == EventTimeWatermark ts#3: timestamp, interval +- StreamingRelation FileSource[/tmp/t], [ts#3] ``` **AFTER** ```scala scala> spark.readStream.schema("ts timestamp").parquet("/tmp/t").withWatermark("ts", "1 microsecond").explain == Physical Plan == EventTimeWatermark ts#0: timestamp, interval 1 microseconds +- StreamingRelation FileSource[/tmp/t], [ts#0] scala> spark.readStream.schema("ts timestamp").parquet("/tmp/t").withWatermark("ts", "0 microsecond").explain == Physical Plan == EventTimeWatermark ts#3: timestamp, interval 0 microseconds +- StreamingRelation FileSource[/tmp/t], [ts#3] ``` ## How was this patch tested? Pass the Jenkins with the updated test case. Closes #24516 from dongjoon-hyun/SPARK-27624. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 614a5cc) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

## What changes were proposed in this pull request? When following the example for using `spark.streams().awaitAnyTermination()` a valid pyspark code will output the following error: ``` Traceback (most recent call last): File "pyspark_app.py", line 182, in <module> spark.streams().awaitAnyTermination() TypeError: 'StreamingQueryManager' object is not callable ``` Docs URL: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#managing-streaming-queries This changes the documentation line to properly call the method under the StreamingQueryManager Class https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.streaming.StreamingQueryManager ## How was this patch tested? After changing the syntax, error no longer occurs and pyspark application works This is only docs change Closes #24547 from asaf400/patch-1. Authored-by: Asaf Levy <asaf400@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 09422f5) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…cutor in PythonRunner ## What changes were proposed in this pull request? Backport #24542 to 2.4. ## How was this patch tested? existing tests Closes #24552 from jiangxb1987/SPARK-25139-2.4. Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

## What changes were proposed in this pull request? - the accumulator warning is too verbose - when a test fails with schema mismatch, you never see the error message / exception Closes #24549 from ericl/test-nits. Lead-authored-by: Eric Liang <ekl@databricks.com> Co-authored-by: Eric Liang <ekhliang@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 80de449) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

## What changes were proposed in this pull request? This PR adds since information to the all string expressions below: SPARK-8241 ConcatWs SPARK-16276 Elt SPARK-1995 Upper / Lower SPARK-20750 StringReplace SPARK-8266 StringTranslate SPARK-8244 FindInSet SPARK-8253 StringTrimLeft SPARK-8260 StringTrimRight SPARK-8267 StringTrim SPARK-8247 StringInstr SPARK-8264 SubstringIndex SPARK-8249 StringLocate SPARK-8252 StringLPad SPARK-8259 StringRPad SPARK-16281 ParseUrl SPARK-9154 FormatString SPARK-8269 Initcap SPARK-8257 StringRepeat SPARK-8261 StringSpace SPARK-8263 Substring SPARK-21007 Right SPARK-21007 Left SPARK-8248 Length SPARK-20749 BitLength SPARK-20749 OctetLength SPARK-8270 Levenshtein SPARK-8271 SoundEx SPARK-8238 Ascii SPARK-20748 Chr SPARK-8239 Base64 SPARK-8268 UnBase64 SPARK-8242 Decode SPARK-8243 Encode SPARK-8245 format_number SPARK-16285 Sentences ## How was this patch tested? N/A Closes #24578 from HyukjinKwon/SPARK-27672. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 3442fca) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

We should add since info to all expressions. SPARK-7886 Rand / Randn af3746c RLike, Like (I manually checked that it exists from 1.0.0) SPARK-8262 Split SPARK-8256 RegExpReplace SPARK-8255 RegExpExtract 9aadcff Coalesce / IsNull / IsNotNull (I manually checked that it exists from 1.0.0) SPARK-14541 IfNull / NullIf / Nvl / Nvl2 SPARK-9080 IsNaN SPARK-9168 NaNvl N/A Closes #24579 from HyukjinKwon/SPARK-27673. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit c71f217) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…asks ## What changes were proposed in this pull request? This patch fixes a bug where `--supervised` Spark jobs would retry multiple times whenever an agent would crash, come back, and re-register even when those jobs had already relaunched on a different agent. That is: ``` - supervised driver is running on agent1 - agent1 crashes - driver is relaunched on another agent as `<task-id>-retry-1` - agent1 comes back online and re-registers with scheduler - spark relaunches the same job as `<task-id>-retry-2` - now there are two jobs running simultaneously ``` This is because when an agent would come back and re-register it would send a status update `TASK_FAILED` for its old driver-task. Previous logic would indiscriminately remove the `submissionId` from Zookeeper's `launchedDrivers` node and add it to `retryList` node. Then, when a new offer came in, it would relaunch another `-retry-` task even though one was previously running. For example logs, scroll to bottom ## How was this patch tested? - Added a unit test to simulate behavior described above - Tested manually on a DC/OS cluster by ``` - launching a --supervised spark job - dcos node ssh <to the agent with the running spark-driver> - systemctl stop dcos-mesos-slave - docker kill <driver-container-id> - [ wait until spark job is relaunched ] - systemctl start dcos-mesos-slave - [ observe spark driver is not relaunched as `-retry-2` ] ``` Log snippets included below. Notice the `-retry-1` task is running when status update for the old task comes in afterward: ``` 19/01/15 19:21:38 TRACE MesosClusterScheduler: Received offers from Mesos: ... [offers] ... 19/01/15 19:21:39 TRACE MesosClusterScheduler: Using offer 5d421001-0630-4214-9ecb-d5838a2ec149-O2532 to launch driver driver-20190115192138-0001 with taskId: value: "driver-20190115192138-0001" ... 19/01/15 19:21:42 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001 state=TASK_STARTING message='' 19/01/15 19:21:43 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001 state=TASK_RUNNING message='' ... 19/01/15 19:29:12 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001 state=TASK_LOST message='health check timed out' reason=REASON_SLAVE_REMOVED ... 19/01/15 19:31:12 TRACE MesosClusterScheduler: Using offer 5d421001-0630-4214-9ecb-d5838a2ec149-O2681 to launch driver driver-20190115192138-0001 with taskId: value: "driver-20190115192138-0001-retry-1" ... 19/01/15 19:31:15 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001-retry-1 state=TASK_STARTING message='' 19/01/15 19:31:16 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001-retry-1 state=TASK_RUNNING message='' ... 19/01/15 19:33:45 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001 state=TASK_FAILED message='Unreachable agent re-reregistered' ... 19/01/15 19:33:45 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001 state=TASK_FAILED message='Abnormal executor termination: unknown container' reason=REASON_EXECUTOR_TERMINATED 19/01/15 19:33:45 ERROR MesosClusterScheduler: Unable to find driver with driver-20190115192138-0001 in status update ... 19/01/15 19:33:47 TRACE MesosClusterScheduler: Using offer 5d421001-0630-4214-9ecb-d5838a2ec149-O2729 to launch driver driver-20190115192138-0001 with taskId: value: "driver-20190115192138-0001-retry-2" ... 19/01/15 19:33:50 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001-retry-2 state=TASK_STARTING message='' 19/01/15 19:33:51 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001-retry-2 state=TASK_RUNNING message='' ``` Closes #24276 from samvantran/SPARK-27347-duplicate-retries. Authored-by: Sam Tran <stran@mesosphere.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit bcd3b61) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

When a null in a nested field in struct, casting from the struct throws error, currently. ```scala scala> sql("select cast(struct(1, null) as struct<a:int,b:int>)").show scala.MatchError: NullType (of class org.apache.spark.sql.types.NullType$) at org.apache.spark.sql.catalyst.expressions.Cast.castToInt(Cast.scala:447) at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:635) at org.apache.spark.sql.catalyst.expressions.Cast.$anonfun$castStruct$1(Cast.scala:603) ``` Similarly, inline table, which casts null in nested field under the hood, also throws an error. ```scala scala> sql("select * FROM VALUES (('a', (10, null))), (('b', (10, 50))), (('c', null)) AS tab(x, y)").show org.apache.spark.sql.AnalysisException: failed to evaluate expression named_struct('col1', 10, 'col2', NULL): NullType (of class org.apache.spark.sql.t ypes.NullType$); line 1 pos 14 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:47) at org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$convert$6(ResolveInlineTables.scala:106) ``` This fixes the issue. Added tests. Closes #24576 from viirya/cast-null. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 8b0bdaa) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…ex datatypes in Union ## What changes were proposed in this pull request? When there is a `Union`, the reported output datatypes are the ones of the first plan and the nullability is updated according to all the plans. For complex types, though, the nullability of their elements is not updated using the types from the other plans. This means that the nullability of the inner elements is the one of the first plan. If this is not compatible with the one of other plans, errors can happen (as reported in the JIRA). The PR proposes to update the nullability of the inner elements of complex datatypes according to most permissive value of all the plans. ## How was this patch tested? added UT Closes #24604 from mgaido91/SPARK-26812_2.4. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…ation`. ## What changes were proposed in this pull request? `StructuredSessionization` comment contains duplicate 'add', I think it should be changed. ## How was this patch tested? Exists UT. Closes #24589 from beliefer/remove-duplicate-add-in-comment. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 7dd2dd5) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…in SS Some APIs in Structured Streaming requires the user to specify an interval. Right now these APIs don't accept upper-case strings. This PR adds a new method `fromCaseInsensitiveString` to `CalendarInterval` to support paring upper-case strings, and fixes all APIs that need to parse an interval string. The new unit test. Closes #24619 from zsxwing/SPARK-27735. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 6a317c8) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…rollup, grouping and grouping_id) ## What changes were proposed in this pull request? Both look added as of 2.0 (see SPARK-12541 and SPARK-12706). I referred existing docs and examples in other API docs. ## How was this patch tested? Manually built the documentation and, by running examples, by running `DESCRIBE FUNCTION EXTENDED`. Closes #24642 from HyukjinKwon/SPARK-27771. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 2431ab0) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

## What changes were proposed in this pull request? Don't use internal Spark logging in user examples, because users shouldn't / can't use it directly anyway. These examples already use println in some cases. Note that the usage in StreamingExamples is on purpose. ## How was this patch tested? N/A Closes #24649 from srowen/ExampleLog. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit db24b04) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

## What changes were proposed in this pull request? Documentation has an error, https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html#hive-tables. The example: ```scala scala> val dataDir = "/tmp/parquet_data" dataDir: String = /tmp/parquet_data scala> spark.range(10).write.parquet(dataDir) scala> sql(s"CREATE EXTERNAL TABLE hive_ints(key int) STORED AS PARQUET LOCATION '$dataDir'") res6: org.apache.spark.sql.DataFrame = [] scala> sql("SELECT * FROM hive_ints").show() +----+ | key| +----+ |null| |null| |null| |null| |null| |null| |null| |null| |null| |null| +----+ ``` Range does not emit `key`, but `id` instead. Closes #24657 from ScrapCodes/fix_hive_example. Lead-authored-by: Prashant Sharma <prashant@apache.org> Co-authored-by: Prashant Sharma <prashsh1@in.ibm.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 5f4b505) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

Minor version bump of Netty to patch reported CVE. Patches: https://www.cvedetails.com/cve/CVE-2019-16869/ No Compiled locally using `mvn clean install -DskipTests` Closes #26099 from Fokko/SPARK-29445. Authored-by: Fokko Driesprong <fokko@apache.org> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit b5b1b69) Signed-off-by: Sean Owen <sean.owen@databricks.com>

### What changes were proposed in this pull request? This PR aims to update the validation check on `length` from `length >= 0` to `length >= -1` in order to allow set `-1` to keep the default value. ### Why are the changes needed? At Apache Spark 2.2.0, [SPARK-18702](https://github.com/apache/spark/pull/16133/files#diff-2c5519b1cf4308d77d6f12212971544fR27-R38) adds `class FileBlock` with the default `length` value, `-1`, initially. There is no way to set `filePath` only while keeping `length` is `-1`. ```scala def set(filePath: String, startOffset: Long, length: Long): Unit = { require(filePath != null, "filePath cannot be null") require(startOffset >= 0, s"startOffset ($startOffset) cannot be negative") require(length >= 0, s"length ($length) cannot be negative") inputBlock.set(new FileBlock(UTF8String.fromString(filePath), startOffset, length)) } ``` For compressed files (like GZ), the size of split can be set to -1. This was allowed till Spark 2.1 but regressed starting with spark 2.2.x. Please note that split length of -1 also means the length was unknown - a valid scenario. Thus, split length of -1 should be acceptable like pre Spark 2.2. ### Does this PR introduce any user-facing change? No ### How was this patch tested? This is updating the corner case on the requirement check. Manually check the code. Closes #26123 from praneetsharma/fix-SPARK-27259. Authored-by: prasha2 <prasha22@mail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 57edb42) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

# What changes were proposed in this pull request? Backport of #26093 to `branch-2.4` ### Why are the changes needed? https://issues.apache.org/jira/browse/SPARK-27812 https://issues.apache.org/jira/browse/SPARK-27927 We need this fix fabric8io/kubernetes-client#1768 that was released on version 4.6 of the client. The root cause of the problem is better explained in #25785 ### Does this PR introduce any user-facing change? No ### How was this patch tested? This patch was tested manually using a simple pyspark job ```python from pyspark.sql import SparkSession if __name__ == '__main__': spark = SparkSession.builder.getOrCreate() ``` The expected behaviour of this "job" is that both python's and jvm's process exit automatically after the main runs. This is the case for spark versions <= 2.4. On version 2.4.3, the jvm process hangs because there's a non daemon thread running ``` "OkHttp WebSocket https://10.96.0.1/..." #121 prio=5 os_prio=0 tid=0x00007fb27c005800 nid=0x24b waiting on condition [0x00007fb300847000] "OkHttp WebSocket https://10.96.0.1/..." #117 prio=5 os_prio=0 tid=0x00007fb28c004000 nid=0x247 waiting on condition [0x00007fb300e4b000] ``` This is caused by a bug on `kubernetes-client` library, which is fixed on the version that we are upgrading to. When the mentioned job is run with this patch applied, the behaviour from spark <= 2.4.0 is restored and both processes terminate successfully Closes #26152 from igorcalabria/k8s-client-update-2.4. Authored-by: igor.calabria <igor.calabria@ubee.in> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

… string to timestamp ### What changes were proposed in this pull request? * Adding an additional check in `stringToTimestamp` to handle cases where the input has trailing ':' * Added a test to make sure this works. ### Why are the changes needed? In a couple of scenarios while converting from String to Timestamp `DateTimeUtils.stringToTimestamp` throws an array out of bounds exception if there is trailing ':'. The behavior of this method requires it to return `None` in case the format of the string is incorrect. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added a test in the `DateTimeTestUtils` suite to test if my fix works. Closes #26143 from rahulsmahadev/SPARK-29494. Lead-authored-by: Rahul Mahadev <rahul.mahadev@databricks.com> Co-authored-by: Rahul Shivu Mahadev <51690557+rahulsmahadev@users.noreply.github.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit 4cfce3e) Signed-off-by: Sean Owen <sean.owen@databricks.com>

…nverting string to timestamp" This reverts commit 4d476ed.

…rting string to timestamp ### What changes were proposed in this pull request? * Adding an additional check in `stringToTimestamp` to handle cases where the input has trailing ':' * Added a test to make sure this works. ### Why are the changes needed? In a couple of scenarios while converting from String to Timestamp `DateTimeUtils.stringToTimestamp` throws an array out of bounds exception if there is trailing ':'. The behavior of this method requires it to return `None` in case the format of the string is incorrect. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added a test in the `DateTimeTestUtils` suite to test if my fix works. Closes #26171 from rahulsmahadev/araryOB. Authored-by: Rahul Mahadev <rahul.mahadev@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>

… older releases ### What changes were proposed in this pull request? Fall back to archive.apache.org in `build/mvn` to download Maven, in case the ASF mirrors no longer have an older release. ### Why are the changes needed? If an older release's specified Maven doesn't exist in the mirrors, {{build/mvn}} will fail. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manually tested different paths and failures by commenting in/out parts of the script and modifying it directly. Closes #25667 from srowen/SPARK-28963. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit df39855) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…rrorServlet ### What changes were proposed in this pull request? Don't include `$path` from user query in the error response. ### Why are the changes needed? The path could contain input that is then rendered as HTML in the error response. It's not clear whether it's exploitable, but better safe than sorry as the path info really isn't that important in this context. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing tests. Closes #26211 from srowen/SPARK-29556. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 8009468) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

This add `typesafe` bintray repo for `sbt-mima-plugin`. Since Oct 21, the following plugin causes [Jenkins failures](https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.4-test-sbt-hadoop-2.6/611/console ) due to the missing jar. - `branch-2.4`: `sbt-mima-plugin:0.1.17` is missing. - `master`: `sbt-mima-plugin:0.3.0` is missing. These versions of `sbt-mima-plugin` seems to be removed from the old repo. ``` $ rm -rf ~/.ivy2/ $ build/sbt scalastyle test:scalastyle ... [warn] :::::::::::::::::::::::::::::::::::::::::::::: [warn] :: UNRESOLVED DEPENDENCIES :: [warn] :::::::::::::::::::::::::::::::::::::::::::::: [warn] :: com.typesafe#sbt-mima-plugin;0.1.17: not found [warn] :::::::::::::::::::::::::::::::::::::::::::::: ``` No. Check `GitHub Action` linter result. This PR should pass. Or, manual check. (Note that Jenkins PR builder didn't fail until now due to the local cache.) Closes #26217 from dongjoon-hyun/SPARK-29560. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit f23c5d7) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

### What changes were proposed in this pull request? We shall have a new mechanism that the downstream operators may notify its parents that they may release the output data stream. In this PR, we implement the mechanism as below: - Add function named `cleanupResources` in SparkPlan, which default call children's `cleanupResources` function, the operator which need a resource cleanup should rewrite this with the self cleanup and also call `super.cleanupResources`, like SortExec in this PR. - Add logic support on the trigger side, in this PR is SortMergeJoinExec, which make sure and call the `cleanupResources` to do the cleanup job for all its upstream(children) operator. ### Why are the changes needed? Bugfix for SortMergeJoin memory leak, and implement a general framework for SparkPlan resource cleanup. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? UT: Add new test suite JoinWithResourceCleanSuite to check both standard and code generation scenario. Integrate Test: Test with driver/executor default memory set 1g, local mode 10 thread. The below test(thanks taosaildrone for providing this test [here](#23762 (comment))) will pass with this PR. ``` from pyspark.sql.functions import rand, col spark.conf.set("spark.sql.join.preferSortMergeJoin", "true") spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1) r1 = spark.range(1, 1001).select(col("id").alias("timestamp1")) r1 = r1.withColumn('value', rand()) r2 = spark.range(1000, 1001).select(col("id").alias("timestamp2")) r2 = r2.withColumn('value2', rand()) joined = r1.join(r2, r1.timestamp1 == r2.timestamp2, "inner") joined = joined.coalesce(1) joined.explain() joined.show() ``` Closes #26210 from xuanyuanking/SPARK-21492-backport. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…database style iterator ### What changes were proposed in this pull request? Reimplement the iterator in UnsafeExternalRowSorter in database style. This can be done by reusing the `RowIterator` in our code base. ### Why are the changes needed? During the job in #26164, after involving a var `isReleased` in `hasNext`, there's possible that `isReleased` is false when calling `hasNext`, but it becomes true before calling `next`. A safer way is using database-style iterator: `advanceNext` and `getRow`. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing UT. Closes #26229 from xuanyuanking/SPARK-21492-follow-up. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 9e77d48) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? Remove the requirement of fetch_size>=0 from JDBCOptions to allow negative fetch size. ### Why are the changes needed? Namely, to allow data fetch in stream manner (row-by-row fetch) against MySQL database. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit test (JDBCSuite) This closes #26230 . Closes #26244 from fuwhu/SPARK-21287-FIX. Authored-by: fuwhu <bestwwg@163.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 92b2529) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

### What changes were proposed in this pull request? SparkSession.sql() method parse process not under current sparksession's conf, so some configuration about parser is not valid in multi-thread situation. In this pr, we add a SQLConf parameter to AbstractSqlParser and initial it with SessionState's conf. Then for each SparkSession's parser process. It will use's it's own SessionState's SQLConf and to be thread safe ### Why are the changes needed? Fix bug ### Does this PR introduce any user-facing change? NO ### How was this patch tested? NO Closes #26240 from AngersZhuuuu/SPARK-29530-V2.4. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…he table's ownership ### What changes were proposed in this pull request? This PR backport #26160 to branch-2.4. ### Why are the changes needed? Backport from master. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? unit test Closes #26248 from wangyum/SPARK-29498-branch-2.4. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…s " for big endian architecture continuation to #24861 As mentioned by srowen, added changes specific to s390x on branch-2.4. > The change is in the master branch so it will be released with Spark 3.0 right now. It is not in branch-2.4 or others. I'm saying it's fine to open a PR for 2.4 too, as it needs a very slightly different change, and then it could be part of 2.4. Yes, you need a new PR. This one is merged and does not cherry-pick into 2.4. Please review. Closes #26254 from vibhutisawant/branch-2.4. Authored-by: vibhutisawant <41043754+vibhutisawant@users.noreply.github.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>

…in/tini` ### What changes were proposed in this pull request? Change entrypoint.sh script for the kubernetes manager image to reference /usr/sbin/tini ### Why are the changes needed? This makes running commands like /bin/bash via pass-through work. This was missing from #26046 ### Does this PR introduce any user-facing change? It makes pass-through work. ### How was this patch tested? I built an image and verified that the following worked: `docker run -it --rm image:version /bin/bash` Closes #26296 from jkleckner/fix-pass-through-38938. Authored-by: Jim Kleckner <jim@cloudphysics.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

Starting from Spark 2.3, the SHS REST API endpoint `/applications/<app_id>/jobs/` is not including `description` in the JobData returned. This is not the case until Spark 2.2. In this PR I've added the mentioned field. Yes. Old API response: ``` [ { "jobId" : 0, "name" : "foreach at <console>:26", "submissionTime" : "2019-10-28T12:41:54.301GMT", "completionTime" : "2019-10-28T12:41:54.731GMT", "stageIds" : [ 0 ], "jobGroup" : "test", "status" : "SUCCEEDED", "numTasks" : 1, "numActiveTasks" : 0, "numCompletedTasks" : 1, "numSkippedTasks" : 0, "numFailedTasks" : 0, "numKilledTasks" : 0, "numCompletedIndices" : 1, "numActiveStages" : 0, "numCompletedStages" : 1, "numSkippedStages" : 0, "numFailedStages" : 0, "killedTasksSummary" : { } } ] ``` New API response: ``` [ { "jobId" : 0, "name" : "foreach at <console>:26", "description" : "job", <= This is the addition here "submissionTime" : "2019-10-28T13:37:24.107GMT", "completionTime" : "2019-10-28T13:37:24.613GMT", "stageIds" : [ 0 ], "jobGroup" : "test", "status" : "SUCCEEDED", "numTasks" : 1, "numActiveTasks" : 0, "numCompletedTasks" : 1, "numSkippedTasks" : 0, "numFailedTasks" : 0, "numKilledTasks" : 0, "numCompletedIndices" : 1, "numActiveStages" : 0, "numCompletedStages" : 1, "numSkippedStages" : 0, "numFailedStages" : 0, "killedTasksSummary" : { } } ] ``` Extended + existing unit tests. Manually: * Open spark-shell ``` scala> sc.setJobGroup("test", "job", false); scala> val foo = sc.textFile("/user/foo.txt"); foo: org.apache.spark.rdd.RDD[String] = /user/foo.txt MapPartitionsRDD[1] at textFile at <console>:24 scala> foo.foreach(println); ``` * Access REST API `http://SHS-host:port/api/v1/applications/<app-id>/jobs/` Closes #26295 from gaborgsomogyi/SPARK-29637. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit 9c817a8) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

…lint-python ### What changes were proposed in this pull request? Currently, GitHub Action on `branch-2.4` is broken because `branch-2.4` is incompatible with Python 3.7. This PR aims to recover the GitHub Action `lint-python` first. - https://github.com/apache/spark/commits/branch-2.4 ### Why are the changes needed? This recovers GitHub Action for the other PRs and commits. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? The GitHub Action on this PR should passed. Closes #26308 from dongjoon-hyun/GHA-PYTHON-3.7. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…zing HiveClient in SparkSQLEnv ### What changes were proposed in this pull request? This patch fixes the issue that external listeners are not initialized properly when `spark.sql.hive.metastore.jars` is set to either "maven" or custom list of jar. ("builtin" is not a case here - all jars in Spark classloader are also available in separate classloader) The culprit is lazy initialization (lazy val or passing builder function) & thread context classloader. HiveClient leverages IsolatedClientLoader to properly load Hive and relevant libraries without issue - to not mess up with Spark classpath it uses separate classloader with leveraging thread context classloader. But there's a messed-up case - SessionState is being initialized while HiveClient changed the thread context classloader from Spark classloader to Hive isolated one, and streaming query listeners are loaded from changed classloader while initializing SessionState. This patch forces initializing SessionState in SparkSQLEnv to avoid such case. ### Why are the changes needed? ClassNotFoundException could occur in spark-sql with specific configuration, as explained above. ### Does this PR introduce any user-facing change? No, as I don't think end users assume the classloader of external listeners is only containing jars for Hive client. ### How was this patch tested? New UT added which fails on master branch and passes with the patch. The error message with master branch when running UT: ``` java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':; org.apache.spark.sql.AnalysisException: java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':; at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:109) at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:221) at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:147) at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:137) at org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:59) at org.apache.spark.sql.hive.thriftserver.SparkSQLEnvSuite.$anonfun$new$2(SparkSQLEnvSuite.scala:44) at org.apache.spark.sql.hive.thriftserver.SparkSQLEnvSuite.withSystemProperties(SparkSQLEnvSuite.scala:61) at org.apache.spark.sql.hive.thriftserver.SparkSQLEnvSuite.$anonfun$new$1(SparkSQLEnvSuite.scala:43) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:149) at org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184) at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:286) at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196) at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178) at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:56) at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221) at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214) at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:56) at org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229) at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:393) at scala.collection.immutable.List.foreach(List.scala:392) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:381) at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:376) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:458) at org.scalatest.FunSuiteLike.runTests(FunSuiteLike.scala:229) at org.scalatest.FunSuiteLike.runTests$(FunSuiteLike.scala:228) at org.scalatest.FunSuite.runTests(FunSuite.scala:1560) at org.scalatest.Suite.run(Suite.scala:1124) at org.scalatest.Suite.run$(Suite.scala:1106) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560) at org.scalatest.FunSuiteLike.$anonfun$run$1(FunSuiteLike.scala:233) at org.scalatest.SuperEngine.runImpl(Engine.scala:518) at org.scalatest.FunSuiteLike.run(FunSuiteLike.scala:233) at org.scalatest.FunSuiteLike.run$(FunSuiteLike.scala:232) at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:56) at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213) at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:56) at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:45) at org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13(Runner.scala:1349) at org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13$adapted(Runner.scala:1343) at scala.collection.immutable.List.foreach(List.scala:392) at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:1343) at org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24(Runner.scala:1033) at org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24$adapted(Runner.scala:1011) at org.scalatest.tools.Runner$.withClassLoaderAndDispatchReporter(Runner.scala:1509) at org.scalatest.tools.Runner$.runOptionallyWithPassFailReporter(Runner.scala:1011) at org.scalatest.tools.Runner$.run(Runner.scala:850) at org.scalatest.tools.Runner.run(Runner.scala) at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.runScalaTest2(ScalaTestRunner.java:133) at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.main(ScalaTestRunner.java:27) Caused by: java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder': at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1054) at org.apache.spark.sql.SparkSession.$anonfun$sessionState$2(SparkSession.scala:156) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:154) at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:151) at org.apache.spark.sql.SparkSession.$anonfun$new$3(SparkSession.scala:105) at scala.Option.map(Option.scala:230) at org.apache.spark.sql.SparkSession.$anonfun$new$1(SparkSession.scala:105) at org.apache.spark.sql.internal.SQLConf$.get(SQLConf.scala:164) at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:183) at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:127) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:300) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:421) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:314) at org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:68) at org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:67) at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$databaseExists$1(HiveExternalCatalog.scala:221) at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99) ... 58 more Caused by: java.lang.ClassNotFoundException: test.custom.listener.DummyQueryExecutionListener at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.spark.util.Utils$.classForName(Utils.scala:206) at org.apache.spark.util.Utils$.$anonfun$loadExtensions$1(Utils.scala:2746) at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245) at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108) at org.apache.spark.util.Utils$.loadExtensions(Utils.scala:2744) at org.apache.spark.sql.util.ExecutionListenerManager.$anonfun$new$1(QueryExecutionListener.scala:83) at org.apache.spark.sql.util.ExecutionListenerManager.$anonfun$new$1$adapted(QueryExecutionListener.scala:82) at scala.Option.foreach(Option.scala:407) at org.apache.spark.sql.util.ExecutionListenerManager.<init>(QueryExecutionListener.scala:82) at org.apache.spark.sql.internal.BaseSessionStateBuilder.$anonfun$listenerManager$2(BaseSessionStateBuilder.scala:293) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.internal.BaseSessionStateBuilder.listenerManager(BaseSessionStateBuilder.scala:293) at org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:320) at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1051) ... 80 more ``` Closes #26316 from HeartSaVioR/SPARK-29604-branch-2.4. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

`release-build.sh` fail to publish release under dry run mode with the following error message: ``` /opt/spark-rm/release-build.sh: line 429: pushd: spark-repo-g4MBm/org/apache/spark: No such file or directory ``` We need to at least run the `mvn clean install` command once to create the `$tmp_repo` path, but now those steps are all skipped under dry-run mode. This PR fixes the issue. Tested locally. Closes #26329 from jiangxb1987/dryrun. Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 155a67d) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

### What changes were proposed in this pull request? In the PR, I propose to extract parsing of the seconds interval units to the private method `parseNanos` in `CalendarInterval` and modify the code to correctly parse the fractional part of the seconds unit of intervals in the cases: - When the fractional part has less than 9 digits - The seconds unit is negative This is a back port of the commit 3206a99. ### Why are the changes needed? The changes are needed to fix the issues: ```sql spark-sql> select interval 10.123456 seconds; interval 10 seconds 123 microseconds ``` The correct result must be `interval 10 seconds 123 milliseconds 456 microseconds` ```sql spark-sql> select interval -10.123456789 seconds; interval -9 seconds -876 milliseconds -544 microseconds ``` but the whole interval should be negated, and the result must be `interval -10 seconds -123 milliseconds -456 microseconds`, taking into account the truncation to microseconds. ### Does this PR introduce any user-facing change? Yes. After changes: ```sql spark-sql> select interval 10.123456 seconds; interval 10 seconds 123 milliseconds 456 microseconds spark-sql> select interval -10.123456789 seconds; interval -10 seconds -123 milliseconds -456 microseconds ``` ### How was this patch tested? By existing test suite, `literals.sql` and new tests in `ExpressionParserSuite`. Closes #26355 from MaxGekk/fix-interval-nanos-parsing-2.4. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? Upgrading the amazon-kinesis-client dependency to 1.12.0. ### Why are the changes needed? The current amazon-kinesis-client version is 1.8.10. This version depends on the use of `describeStream`, which has a hard limit on an AWS account (10 reqs / second). Versions 1.9.0 and up leverage `listShards`, which has no such limit. For large customers, this can be a major problem. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing tests Closes #26333 from etspaceman/kclUpgrade. Authored-by: Eric Meisel <eric.steven.meisel@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit be022d9) Signed-off-by: Sean Owen <sean.owen@databricks.com>

### What changes were proposed in this pull request? This PR aims to remove `check-cran` from `run-tests.sh`. We had better add an independent Jenkins job to run `check-cran`. ### Why are the changes needed? CRAN instability has been a blocker for our daily dev process. The following simple check causes consecutive failures in 4 of 9 Jenkins jobs + PR builder. ``` * checking CRAN incoming feasibility ...Error in .check_package_CRAN_incoming(pkgdir) : dims [product 24] do not match the length of object [0] ``` - spark-branch-2.4-test-sbt-hadoop-2.6 - spark-branch-2.4-test-sbt-hadoop-2.7 - spark-master-test-sbt-hadoop-2.7 - spark-master-test-sbt-hadoop-3.2 - PRBuilder ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Currently, PR builder is failing due to the above issue. This PR should pass the Jenkins. Closes #26375 from dongjoon-hyun/SPARK-24152. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 91d9901) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

### What changes were proposed in this pull request? This reverts commit 91d9901. ### Why are the changes needed? CRAN check is pretty important for R package, we should enable it. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit tests. Closes #26381 from viirya/revert-SPARK-24152. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit e726324) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

Back ports commit 1e1b730 from master ### What changes were proposed in this pull request? I propose that we change the example code documentation to call the proper function . For example, under the `foreachBatch` function, the example code was calling the `foreach()` function by mistake. ### Why are the changes needed? I suppose it could confuse some people, and it is a typo ### Does this PR introduce any user-facing change? No, there is no "meaningful" code being change, simply the documentation ### How was this patch tested? I made the changes on a fork, and had pushed to master earlier Closes #26363 from mstill3/patch-2. Authored-by: Matt Stillwell <18670089+mstill3@users.noreply.github.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…ild is `SampleExec` has a bug that it sets `needCopyResult` to false as long as the `withReplacement` parameter is false. This causes problems if its child needs to copy the result, e.g. a join. to fix a correctness issue Yes, the result will be corrected. a new test Closes #26387 from cloud-fan/sample-bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 326b789) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

AmplabJenkins · 2019-11-06T08:39:59Z

Can one of the admins verify this patch?

HyukjinKwon · 2019-11-06T11:57:47Z

Please file a JIRA for an issue or ask a question to mailing list.

cloud-fan and others added 30 commits April 25, 2019 10:27

add missing import and fix compilation

ed0739a

Preparing Spark release v2.4.3-rc1

c3e32bf

Preparing development version 2.4.4-SNAPSHOT

5ac2014

Fokko and others added 26 commits October 12, 2019 09:45

Revert "[SPARK-29494][SQL] Fix for ArrayOutofBoundsException while co…

b094774

…nverting string to timestamp" This reverts commit 4d476ed.

wfxxh changed the title ~~Branch 2.4~~ [CORE] Can not destroy broadcast in spark2.4.3 Nov 6, 2019

HyukjinKwon closed this Nov 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CORE] Can not destroy broadcast in spark2.4.3 #26408

[CORE] Can not destroy broadcast in spark2.4.3 #26408

Uh oh!

wfxxh commented Nov 6, 2019 •

edited

Loading

Uh oh!

AmplabJenkins commented Nov 6, 2019

Uh oh!

HyukjinKwon commented Nov 6, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

[CORE] Can not destroy broadcast in spark2.4.3 #26408

[CORE] Can not destroy broadcast in spark2.4.3 #26408

Uh oh!

Conversation

wfxxh commented Nov 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

When I submit my program in spark2.4.3 the broadcast can not be destroyed .But submit in cdh spark2 version 2.1.0.cloudera2 it can be destroyed

Uh oh!

AmplabJenkins commented Nov 6, 2019

Uh oh!

HyukjinKwon commented Nov 6, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

wfxxh commented Nov 6, 2019 •

edited

Loading