Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Branch 1.6 #21507

Closed
wants to merge 975 commits into from
Closed

Branch 1.6 #21507

wants to merge 975 commits into from

Conversation

deepaksonu
Copy link

What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

cloud-fan and others added 30 commits February 3, 2016 16:13
…ld not fail analysis of encoder

nullability should only be considered as an optimization rather than part of the type system, so instead of failing analysis for mismatch nullability, we should pass analysis and add runtime null check.

backport #11035 to 1.6

Author: Wenchen Fan <wenchen@databricks.com>

Closes #11042 from cloud-fan/branch-1.6.
minor fix for api link in ml onevsrest

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #11068 from hhbyyh/onevsrestDoc.

(cherry picked from commit c2c956b)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
…ot set but timeoutThreshold is defined

Check the state Existence before calling get.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #11081 from zsxwing/SPARK-13195.

(cherry picked from commit 8e2f296)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
Author: Bill Chambers <bill@databricks.com>

Closes #11094 from anabranch/dynamic-docs.

(cherry picked from commit 66e1383)
Signed-off-by: Andrew Or <andrew@databricks.com>
There is a bug when we try to grow the buffer, OOM is ignore wrongly (the assert also skipped by JVM), then we try grow the array again, this one will trigger spilling free the current page, the current record we inserted will be invalid.

The root cause is that JVM has less free memory than MemoryManager thought, it will OOM when allocate a page without trigger spilling. We should catch the OOM, and acquire memory again to trigger spilling.

And also, we could not grow the array in `insertRecord` of `InMemorySorter` (it was there just for easy testing).

Author: Davies Liu <davies@databricks.com>

Closes #11095 from davies/fix_expand.
…ters with Jackson 2.2.3

Patch to

1. Shade jackson 2.x in spark-yarn-shuffle JAR: core, databind, annotation
2. Use maven antrun to verify the JAR has the renamed classes

Being Maven-based, I don't know if the verification phase kicks in on an SBT/jenkins build. It will on a `mvn install`

Author: Steve Loughran <stevel@hortonworks.com>

Closes #10780 from steveloughran/stevel/patches/SPARK-12807-master-shuffle.

(cherry picked from commit 34d0b70)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
JIRA: https://issues.apache.org/jira/browse/SPARK-10524

Currently we use the hard prediction (`ImpurityCalculator.predict`) to order categories' bins. But we should use the soft prediction.

Author: Liang-Chi Hsieh <viirya@gmail.com>
Author: Liang-Chi Hsieh <viirya@appier.com>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes #8734 from viirya/dt-soft-centroids.

(cherry picked from commit 9267bc6)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
… SpecificParquetRecordReaderBase

This is a minor followup to #10843 to fix one remaining place where we forgot to use reflective access of TaskAttemptContext methods.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #11131 from JoshRosen/SPARK-12921-take-2.
Update Aggregator links to point to #org.apache.spark.sql.expressions.Aggregator

Author: raela <raela@databricks.com>

Closes #11158 from raelawang/master.

(cherry picked from commit 719973b)
Signed-off-by: Reynold Xin <rxin@databricks.com>
…e system besides HDFS

jkbradley I tried to improve the function to export a model. When I tried to export a model to S3 under Spark 1.6, we couldn't do that. So, it should offer S3 besides HDFS. Can you review it when you have time? Thanks!

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #11151 from yu-iskw/SPARK-13265.

(cherry picked from commit efb65e0)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
…n error

Pyspark Params class has a method `hasParam(paramName)` which returns `True` if the class has a parameter by that name, but throws an `AttributeError` otherwise. There is not currently a way of getting a Boolean to indicate if a class has a parameter. With Spark 2.0 we could modify the existing behavior of `hasParam` or add an additional method with this functionality.

In Python:
```python
from pyspark.ml.classification import NaiveBayes
nb = NaiveBayes()
print nb.hasParam("smoothing")
print nb.hasParam("notAParam")
```
produces:
> True
> AttributeError: 'NaiveBayes' object has no attribute 'notAParam'

However, in Scala:
```scala
import org.apache.spark.ml.classification.NaiveBayes
val nb  = new NaiveBayes()
nb.hasParam("smoothing")
nb.hasParam("notAParam")
```
produces:
> true
> false

cc holdenk

Author: sethah <seth.hendrickson16@gmail.com>

Closes #10962 from sethah/SPARK-13047.

(cherry picked from commit b354673)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
…alue parameter

Fix this defect by check default value exist or not.

yanboliang Please help to review.

Author: Tommy YU <tummyyu@163.com>

Closes #11043 from Wenpei/spark-13153-handle-param-withnodefaultvalue.

(cherry picked from commit d3e2e20)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
… Windows

Due to being on a Windows platform I have been unable to run the tests as described in the "Contributing to Spark" instructions. As the change is only to two lines of code in the Web UI, which I have manually built and tested, I am submitting this pull request anyway. I hope this is OK.

Is it worth considering also including this fix in any future 1.5.x releases (if any)?

I confirm this is my own original work and license it to the Spark project under its open source license.

Author: markpavey <mark.pavey@thefilter.com>

Closes #11135 from markpavey/JIRA_SPARK-13142_WindowsWebUILogFix.

(cherry picked from commit 374c4b2)
Signed-off-by: Sean Owen <sowen@cloudera.com>
…ailed test

JIRA: https://issues.apache.org/jira/browse/SPARK-12363

This issue is pointed by yanboliang. When `setRuns` is removed from PowerIterationClustering, one of the tests will be failed. I found that some `dstAttr`s of the normalized graph are not correct values but 0.0. By setting `TripletFields.All` in `mapTriplets` it can work.

Author: Liang-Chi Hsieh <viirya@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #10539 from viirya/fix-poweriter.

(cherry picked from commit e3441e3)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
Looks like pygments.rb gem is also required for jekyll build to work. At least on Ubuntu/RHEL I could not do build without this dependency. So added this to steps.

Author: Amit Dev <amitdev@gmail.com>

Closes #11180 from amitdev/master.

(cherry picked from commit 331293c)
Signed-off-by: Sean Owen <sowen@cloudera.com>
…-guide

Response to JIRA https://issues.apache.org/jira/browse/SPARK-13312.

This contribution is my original work and I license the work to this project.

Author: JeremyNixon <jnixon2@gmail.com>

Closes #11199 from JeremyNixon/update_train_val_split_example.

(cherry picked from commit adb5483)
Signed-off-by: Sean Owen <sowen@cloudera.com>
There's a small typo in the SparseVector.parse docstring (which says that it returns a DenseVector rather than a SparseVector), which seems to be incorrect.

Author: Miles Yucht <miles@databricks.com>

Closes #11213 from mgyucht/fix-sparsevector-docs.

(cherry picked from commit 827ed1c)
Signed-off-by: Sean Owen <sowen@cloudera.com>
This commit removes an unnecessary duplicate check in addPendingTask that meant
that scheduling a task set took time proportional to (# tasks)^2.

Author: Sital Kedia <skedia@fb.com>

Closes #11175 from sitalkedia/fix_stuck_driver.

(cherry picked from commit 1e1e31e)
Signed-off-by: Kay Ousterhout <kayousterhout@gmail.com>
… default is "python2.7"

Author: Christopher C. Aycock <chris@chrisaycock.com>

Closes #11239 from chrisaycock/master.

(cherry picked from commit a7c74d7)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
…pares Option and String directly.

## What changes were proposed in this pull request?

Fix some comparisons between unequal types that cause IJ warnings and in at least one case a likely bug (TaskSetManager)

## How was the this patch tested?

Running Jenkins tests

Author: Sean Owen <sowen@cloudera.com>

Closes #11253 from srowen/SPARK-13371.

(cherry picked from commit 7856253)
Signed-off-by: Andrew Or <andrew@databricks.com>
A common problem that users encounter with Spark 1.6.0 is that writing to a partitioned parquet table OOMs.  The root cause is that parquet allocates a significant amount of memory that is not accounted for by our own mechanisms.  As a workaround, we can ensure that only a single file is open per task unless the user explicitly asks for more.

Author: Michael Armbrust <michael@databricks.com>

Closes #11308 from marmbrus/parquetWriteOOM.

(cherry picked from commit 173aa94)
Signed-off-by: Michael Armbrust <michael@databricks.com>
…ome special character

## What changes were proposed in this pull request?

When there are some special characters (e.g., `"`, `\`) in `label`, DAG will be broken. This patch just escapes `label` to avoid DAG being broken by some special characters

## How was the this patch tested?

Jenkins tests

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #11309 from zsxwing/SPARK-13298.

(cherry picked from commit a11b399)
Signed-off-by: Andrew Or <andrew@databricks.com>
In SparkSQLCLI, we have created a `CliSessionState`, but then we call `SparkSQLEnv.init()`, which will start another `SessionState`. This would lead to exception because `processCmd` need to get the `CliSessionState` instance by calling `SessionState.get()`, but the return value would be a instance of `SessionState`. See the exception below.

spark-sql> !echo "test";
Exception in thread "main" java.lang.ClassCastException: org.apache.hadoop.hive.ql.session.SessionState cannot be cast to org.apache.hadoop.hive.cli.CliSessionState
	at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:112)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:301)
	at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:242)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:691)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #9589 from adrian-wang/clicommand.

(cherry picked from commit 5d80fac)
Signed-off-by: Michael Armbrust <michael@databricks.com>

Conflicts:
	sql/hive/src/main/scala/org/apache/spark/sql/hive/client/ClientWrapper.scala
…false) fix for branch-1.6

https://issues.apache.org/jira/browse/SPARK-13359

Author: Earthson Lu <Earthson.Lu@gmail.com>

Closes #11237 from Earthson/SPARK-13359.
`GraphImpl.fromExistingRDDs` expects preprocessed vertex RDD as input. We call it in LDA without validating this requirement. So it might introduce errors. Replacing it by `Graph.apply` would be safer and more proper because it is a public API. The tests still pass. So maybe it is safe to use `fromExistingRDDs` here (though it doesn't seem so based on the implementation) or the test cases are special. jkbradley ankurdave

Author: Xiangrui Meng <meng@databricks.com>

Closes #11226 from mengxr/SPARK-13355.

(cherry picked from commit 764ca18)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
## What changes were proposed in this pull request?

This PR adds equality operators to UDT classes so that they can be correctly tested for dataType equality during union operations.

This was previously causing `"AnalysisException: u"unresolved operator 'Union;""` when trying to unionAll two dataframes with UDT columns as below.

```
from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT
from pyspark.sql import types

schema = types.StructType([types.StructField("point", PythonOnlyUDT(), True)])

a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema)
b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema)

c = a.unionAll(b)
```

## How was the this patch tested?

Tested using two unit tests in sql/test.py and the DataFrameSuite.

Additional information here : https://issues.apache.org/jira/browse/SPARK-13410

rxin

Author: Franklyn D'souza <franklynd@gmail.com>

Closes #11333 from damnMeddlingKid/udt-union-patch.
…q is not Serializable

## What changes were proposed in this pull request?

`scala.collection.Iterator`'s methods (e.g., map, filter) will return an `AbstractIterator` which is not Serializable. E.g.,
```Scala
scala> val iter = Array(1, 2, 3).iterator.map(_ + 1)
iter: Iterator[Int] = non-empty iterator

scala> println(iter.isInstanceOf[Serializable])
false
```
If we call something like `Iterator.map(...).toSeq`, it will create a `Stream` that contains a non-serializable `AbstractIterator` field and make the `Stream` be non-serializable.

This PR uses `toArray` instead of `toSeq` to fix such issue in `def createDataFrame(data: java.util.List[_], beanClass: Class[_]): DataFrame`.

## How was the this patch tested?

Jenkins tests.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #11334 from zsxwing/SPARK-13390.
clockfly and others added 25 commits September 21, 2016 16:57
…ult on double value

## What changes were proposed in this pull request?

Remainder(%) expression's `eval()` returns incorrect result when the dividend is a big double. The reason is that Remainder converts the double dividend to decimal to do "%", and that lose precision.

This bug only affects the `eval()` that is used by constant folding, the codegen path is not impacted.

### Before change
```
scala> -5083676433652386516D % 10
res2: Double = -6.0

scala> spark.sql("select -5083676433652386516D % 10 as a").show
+---+
|  a|
+---+
|0.0|
+---+
```

### After change
```
scala> spark.sql("select -5083676433652386516D % 10 as a").show
+----+
|   a|
+----+
|-6.0|
+----+
```

## How was this patch tested?

Unit test.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #15171 from clockfly/SPARK-17617.

(cherry picked from commit 3977223)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…shed

This patch updates the `kinesis-asl-assembly` build to prevent that module from being published as part of Maven releases and snapshot builds.

The `kinesis-asl-assembly` includes classes from the Kinesis Client Library (KCL) and Kinesis Producer Library (KPL), both of which are licensed under the Amazon Software License and are therefore prohibited from being distributed in Apache releases.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #15167 from JoshRosen/stop-publishing-kinesis-assembly.
…ng entire job (branch-1.6 backport)

This patch is a branch-1.6 backport of #15037:

## What changes were proposed in this pull request?

In Spark's `RDD.getOrCompute` we first try to read a local copy of a cached RDD block, then a remote copy, and only fall back to recomputing the block if no cached copy (local or remote) can be read. This logic works correctly in the case where no remote copies of the block exist, but if there _are_ remote copies and reads of those copies fail (due to network issues or internal Spark bugs) then the BlockManager will throw a `BlockFetchException` that will fail the task (and which could possibly fail the whole job if the read failures keep occurring).

In the cases of TorrentBroadcast and task result fetching we really do want to fail the entire job in case no remote blocks can be fetched, but this logic is inappropriate for reads of cached RDD blocks because those can/should be recomputed in case cached blocks are unavailable.

Therefore, I think that the `BlockManager.getRemoteBytes()` method should never throw on remote fetch errors and, instead, should handle failures by returning `None`.

## How was this patch tested?

Block manager changes should be covered by modified tests in `BlockManagerSuite`: the old tests expected exceptions to be thrown on failed remote reads, while the modified tests now expect `None` to be returned from the `getRemote*` method.

I also manually inspected all usages of `BlockManager.getRemoteValues()`, `getRemoteBytes()`, and `get()` to verify that they correctly pattern-match on the result and handle `None`. Note that these `None` branches are already exercised because the old `getRemoteBytes` returned `None` when no remote locations for the block could be found (which could occur if an executor died and its block manager de-registered with the master).

Author: Josh Rosen <joshrosen@databricks.com>

Closes #15186 from JoshRosen/SPARK-17485-branch-1.6-backport.
…nousListenerBus (branch 1.6)

## What changes were proposed in this pull request?

Backport #15220 to 1.6.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #15226 from zsxwing/SPARK-17649-branch-1.6.
… formats

## What changes were proposed in this pull request?

This patch addresses a correctness bug in Spark 1.6.x in where `coalesce()` declares that it can process `UnsafeRows` but mis-declares that it always outputs safe rows. If UnsafeRow and other Row types are compared for equality then we will get spurious `false` comparisons, leading to wrong answers in operators which perform whole-row comparison (such as `distinct()` or `except()`). An example of a query impacted by this bug is given in the [JIRA ticket](https://issues.apache.org/jira/browse/SPARK-17618).

The problem is that the validity of our row format conversion rules depends on operators which handle `unsafeRows` (signalled by overriding `canProcessUnsafeRows`) correctly reporting their output row format (which is done by overriding `outputsUnsafeRows`). In #9024, we overrode `canProcessUnsafeRows` but forgot to override `outputsUnsafeRows`, leading to the incorrect `equals()` comparison.

Our interface design is flawed because correctness depends on operators correctly overriding multiple methods this problem could have been prevented by a design which coupled row format methods / metadata into a single method / class so that all three methods had to be overridden at the same time.

This patch addresses this issue by adding missing `outputsUnsafeRows` overrides. In order to ensure that bugs in this logic are uncovered sooner, I have modified `UnsafeRow.equals()` to throw an `IllegalArgumentException` if it is called with an object that is not an `UnsafeRow`.

## How was this patch tested?

I believe that the stronger misuse-checking in `UnsafeRow.equals()` is sufficient to detect and prevent this class of bug.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #15185 from JoshRosen/SPARK-17618.
From the original commit message:

This PR also fixes a regression caused by [SPARK-10987] whereby submitting a shutdown causes a race between the local shutdown procedure and the notification of the scheduler driver disconnection. If the scheduler driver disconnection wins the race, the coarse executor incorrectly exits with status 1 (instead of the proper status 0)

Author: Charles Allen <charlesallen-net.com>

(cherry picked from commit 2eaeafe)

Author: Charles Allen <charles@allen-net.com>

Closes #15270 from vanzin/SPARK-17696.
…atrix with SparseVector

Backport PR of changes relevant to mllib only, but otherwise identical to #15296

jkbradley

Author: Bjarne Fruergaard <bwahlgreen@gmail.com>

Closes #15311 from bwahlgreen/bugfix-spark-17721-1.6.
This backports 733cbaa
to Branch 1.6. It's a pretty simple patch, and would be nice to have for Spark 1.6.3.

Unit tests

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #15380 from brkyvz/bp-SPARK-15062.

Signed-off-by: Michael Armbrust <michael@databricks.com>
## What changes were proposed in this pull request?

This is the patch for 1.6. It only adds Spark conf `spark.files.ignoreCorruptFiles` because SQL just uses HadoopRDD directly in 1.6. `spark.files.ignoreCorruptFiles` is `true` by default.

## How was this patch tested?

The added test.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #15454 from zsxwing/SPARK-17850-1.6.
…cala-2.11 repl

## What changes were proposed in this pull request?

Spark 1.6 Scala-2.11 repl doesn't honor "spark.replClassServer.port" configuration, so user cannot set a fixed port number through "spark.replClassServer.port".

## How was this patch tested?

N/A

Author: jerryshao <sshao@hortonworks.com>

Closes #15253 from jerryshao/SPARK-17678.
…m empty string to interval type

## What changes were proposed in this pull request?
This change adds a check in castToInterval method of Cast expression , such that if converted value is null , then isNull variable should be set to true.

Earlier, the expression Cast(Literal(), CalendarIntervalType) was throwing NullPointerException because of the above mentioned reason.

## How was this patch tested?
Added test case in CastSuite.scala

jira entry for detail: https://issues.apache.org/jira/browse/SPARK-17884

Author: prigarg <prigarg@adobe.com>

Closes #15479 from priyankagargnitk/cast_empty_string_bug.
…ld not depends on local timezone

## What changes were proposed in this pull request?

Back-port of #13784 to `branch-1.6`

## How was this patch tested?

Existing tests.

Author: Davies Liu <davies@databricks.com>

Closes #15554 from srowen/SPARK-16078.
…executor loss

## What changes were proposed in this pull request?

_This is the master branch-1.6 version of #15986; the original description follows:_

This patch fixes a critical resource leak in the TaskScheduler which could cause RDDs and ShuffleDependencies to be kept alive indefinitely if an executor with running tasks is permanently lost and the associated stage fails.

This problem was originally identified by analyzing the heap dump of a driver belonging to a cluster that had run out of shuffle space. This dump contained several `ShuffleDependency` instances that were retained by `TaskSetManager`s inside the scheduler but were not otherwise referenced. Each of these `TaskSetManager`s was considered a "zombie" but had no running tasks and therefore should have been cleaned up. However, these zombie task sets were still referenced by the `TaskSchedulerImpl.taskIdToTaskSetManager` map.

Entries are added to the `taskIdToTaskSetManager` map when tasks are launched and are removed inside of `TaskScheduler.statusUpdate()`, which is invoked by the scheduler backend while processing `StatusUpdate` messages from executors. The problem with this design is that a completely dead executor will never send a `StatusUpdate`. There is [some code](https://github.com/apache/spark/blob/072f4c518cdc57d705beec6bcc3113d9a6740819/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L338) in `statusUpdate` which handles tasks that exit with the `TaskState.LOST` state (which is supposed to correspond to a task failure triggered by total executor loss), but this state only seems to be used in Mesos fine-grained mode. There doesn't seem to be any code which performs per-task state cleanup for tasks that were running on an executor that completely disappears without sending any sort of final death message. The `executorLost` and [`removeExecutor`](https://github.com/apache/spark/blob/072f4c518cdc57d705beec6bcc3113d9a6740819/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L527) methods don't appear to perform any cleanup of the `taskId -> *` mappings, causing the leaks observed here.

This patch's fix is to maintain a `executorId -> running task id` mapping so that these `taskId -> *` maps can be properly cleaned up following an executor loss.

There are some potential corner-case interactions that I'm concerned about here, especially some details in [the comment](https://github.com/apache/spark/blob/072f4c518cdc57d705beec6bcc3113d9a6740819/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L523) in `removeExecutor`, so I'd appreciate a very careful review of these changes.

## How was this patch tested?

I added a new unit test to `TaskSchedulerImplSuite`.

/cc kayousterhout and markhamstra, who reviewed #15986.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #16070 from JoshRosen/fix-leak-following-total-executor-loss-1.6.
…functions

No tests done for JDBCRDD#compileFilter.

Author: Takeshi YAMAMURO <linguin.m.sgmail.com>

Closes #10409 from maropu/AddTestsInJdbcRdd.

(cherry picked from commit 8c1b867)

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #16124 from dongjoon-hyun/SPARK-12446-BRANCH-1.6.
## What changes were proposed in this pull request?

This fix is related to be bug: https://issues.apache.org/jira/browse/SPARK-18372 .
The insertIntoHiveTable would generate a .staging directory, but this directory  fail to be removed in the end.

This is backport from spark 2.0.x code, and is related to PR #12770

## How was this patch tested?
manual tests

Author: Mingjie Tang <mtanghortonworks.com>

Author: Mingjie Tang <mtang@hortonworks.com>
Author: Mingjie Tang <mtang@HW12398.local>

Closes #15819 from merlintang/branch-1.6.
The Hive client library is not smart enough to notice that the current
user is a proxy user; so when using a proxy user, it fails to fetch
delegation tokens from the metastore because of a missing kerberos
TGT for the current user.

To fix it, just run the code that fetches the delegation token as the
real logged in user.

Tested on a kerberos cluster both submitting normally and with a proxy
user; Hive and HBase tokens are retrieved correctly in both cases.

Author: Marcelo Vanzin <vanzincloudera.com>

Closes #11358 from vanzin/SPARK-13478.

(cherry picked from commit c7fccb5)

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #16665 from vanzin/SPARK-13478_1.6.
## What changes were proposed in this pull request?

This PR backports PR #16866 to branch-1.6

## How was this patch tested?

Existing tests.

Author: Cheng Lian <lian@databricks.com>

Closes #16917 from liancheng/spark-19529-1.6-backport.
… beyond 64 KB

## What changes were proposed in this pull request?

This is a backport pr of #15480 into `branch-1.6`.

## How was this patch tested?

Existing tests.

Author: Liwei Lin <lwlin7@gmail.com>

Closes #17158 from ueshin/issues/SPARK-16845_1.6.
…e` and port cloudpickle changes for PySpark to work with Python 3.6.0

## What changes were proposed in this pull request?

This PR proposes to backports #16429 to branch-1.6 so that Python 3.6.0 works with Spark 1.6.x.

## How was this patch tested?

Manually, via

```
./run-tests --python-executables=python3.6
```

```
Finished test(python3.6): pyspark.conf (5s)
Finished test(python3.6): pyspark.broadcast (7s)
Finished test(python3.6): pyspark.accumulators (9s)
Finished test(python3.6): pyspark.rdd (16s)
Finished test(python3.6): pyspark.shuffle (0s)
Finished test(python3.6): pyspark.serializers (11s)
Finished test(python3.6): pyspark.profiler (5s)
Finished test(python3.6): pyspark.context (21s)
Finished test(python3.6): pyspark.ml.clustering (12s)
Finished test(python3.6): pyspark.ml.feature (16s)
Finished test(python3.6): pyspark.ml.classification (16s)
Finished test(python3.6): pyspark.ml.recommendation (16s)
Finished test(python3.6): pyspark.ml.tuning (14s)
Finished test(python3.6): pyspark.ml.regression (16s)
Finished test(python3.6): pyspark.ml.evaluation (12s)
Finished test(python3.6): pyspark.ml.tests (17s)
Finished test(python3.6): pyspark.mllib.classification (18s)
Finished test(python3.6): pyspark.mllib.evaluation (12s)
Finished test(python3.6): pyspark.mllib.feature (19s)
Finished test(python3.6): pyspark.mllib.linalg.__init__ (0s)
Finished test(python3.6): pyspark.mllib.fpm (12s)
Finished test(python3.6): pyspark.mllib.clustering (31s)
Finished test(python3.6): pyspark.mllib.random (8s)
Finished test(python3.6): pyspark.mllib.linalg.distributed (17s)
Finished test(python3.6): pyspark.mllib.recommendation (23s)
Finished test(python3.6): pyspark.mllib.stat.KernelDensity (0s)
Finished test(python3.6): pyspark.mllib.stat._statistics (13s)
Finished test(python3.6): pyspark.mllib.regression (22s)
Finished test(python3.6): pyspark.mllib.util (9s)
Finished test(python3.6): pyspark.mllib.tree (14s)
Finished test(python3.6): pyspark.sql.types (9s)
Finished test(python3.6): pyspark.sql.context (16s)
Finished test(python3.6): pyspark.sql.column (14s)
Finished test(python3.6): pyspark.sql.group (16s)
Finished test(python3.6): pyspark.sql.dataframe (25s)
Finished test(python3.6): pyspark.tests (164s)
Finished test(python3.6): pyspark.sql.window (6s)
Finished test(python3.6): pyspark.sql.functions (19s)
Finished test(python3.6): pyspark.streaming.util (0s)
Finished test(python3.6): pyspark.sql.readwriter (24s)
Finished test(python3.6): pyspark.sql.tests (38s)
Finished test(python3.6): pyspark.mllib.tests (133s)
Finished test(python3.6): pyspark.streaming.tests (189s)
Tests passed in 380 seconds
```

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #17375 from HyukjinKwon/SPARK-19019-backport-1.6.
…om checkpoint.

## What changes were proposed in this pull request?

Reload the `spark.yarn.credentials.file` property when restarting a streaming application from checkpoint.

## How was this patch tested?

Manual tested with 1.6.3 and 2.1.1.
I didn't test this with master because of some compile problems, but I think it will be the same result.

## Notice

This should be merged into maintenance branches too.

jira: [SPARK-21008](https://issues.apache.org/jira/browse/SPARK-21008)

Author: saturday_s <shi.indetail@gmail.com>

Closes #18230 from saturday-shi/SPARK-21008.

(cherry picked from commit e92ffe6)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
@kiszk
Copy link
Member

kiszk commented Jun 7, 2018

@deepaksonu Would it be possible to close this PR?

@HyukjinKwon
Copy link
Member

ping @deepaksonu close this PR please.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@srowen srowen mentioned this pull request Jul 3, 2018
@asfgit asfgit closed this in 5bf95f2 Jul 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment