Branch 1.6 #21507

deepaksonu · 2018-06-07T18:29:26Z

What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

…ld not fail analysis of encoder nullability should only be considered as an optimization rather than part of the type system, so instead of failing analysis for mismatch nullability, we should pass analysis and add runtime null check. backport #11035 to 1.6 Author: Wenchen Fan <wenchen@databricks.com> Closes #11042 from cloud-fan/branch-1.6.

minor fix for api link in ml onevsrest Author: Yuhao Yang <hhbyyh@gmail.com> Closes #11068 from hhbyyh/onevsrestDoc. (cherry picked from commit c2c956b) Signed-off-by: Xiangrui Meng <meng@databricks.com>

…ot set but timeoutThreshold is defined Check the state Existence before calling get. Author: Shixiong Zhu <shixiong@databricks.com> Closes #11081 from zsxwing/SPARK-13195. (cherry picked from commit 8e2f296) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

Author: Bill Chambers <bill@databricks.com> Closes #11094 from anabranch/dynamic-docs. (cherry picked from commit 66e1383) Signed-off-by: Andrew Or <andrew@databricks.com>

There is a bug when we try to grow the buffer, OOM is ignore wrongly (the assert also skipped by JVM), then we try grow the array again, this one will trigger spilling free the current page, the current record we inserted will be invalid. The root cause is that JVM has less free memory than MemoryManager thought, it will OOM when allocate a page without trigger spilling. We should catch the OOM, and acquire memory again to trigger spilling. And also, we could not grow the array in `insertRecord` of `InMemorySorter` (it was there just for easy testing). Author: Davies Liu <davies@databricks.com> Closes #11095 from davies/fix_expand.

…ters with Jackson 2.2.3 Patch to 1. Shade jackson 2.x in spark-yarn-shuffle JAR: core, databind, annotation 2. Use maven antrun to verify the JAR has the renamed classes Being Maven-based, I don't know if the verification phase kicks in on an SBT/jenkins build. It will on a `mvn install` Author: Steve Loughran <stevel@hortonworks.com> Closes #10780 from steveloughran/stevel/patches/SPARK-12807-master-shuffle. (cherry picked from commit 34d0b70) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

JIRA: https://issues.apache.org/jira/browse/SPARK-10524 Currently we use the hard prediction (`ImpurityCalculator.predict`) to order categories' bins. But we should use the soft prediction. Author: Liang-Chi Hsieh <viirya@gmail.com> Author: Liang-Chi Hsieh <viirya@appier.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #8734 from viirya/dt-soft-centroids. (cherry picked from commit 9267bc6) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>

… SpecificParquetRecordReaderBase This is a minor followup to #10843 to fix one remaining place where we forgot to use reflective access of TaskAttemptContext methods. Author: Josh Rosen <joshrosen@databricks.com> Closes #11131 from JoshRosen/SPARK-12921-take-2.

Update Aggregator links to point to #org.apache.spark.sql.expressions.Aggregator Author: raela <raela@databricks.com> Closes #11158 from raelawang/master. (cherry picked from commit 719973b) Signed-off-by: Reynold Xin <rxin@databricks.com>

…e system besides HDFS jkbradley I tried to improve the function to export a model. When I tried to export a model to S3 under Spark 1.6, we couldn't do that. So, it should offer S3 besides HDFS. Can you review it when you have time? Thanks! Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #11151 from yu-iskw/SPARK-13265. (cherry picked from commit efb65e0) Signed-off-by: Xiangrui Meng <meng@databricks.com>

…n error Pyspark Params class has a method `hasParam(paramName)` which returns `True` if the class has a parameter by that name, but throws an `AttributeError` otherwise. There is not currently a way of getting a Boolean to indicate if a class has a parameter. With Spark 2.0 we could modify the existing behavior of `hasParam` or add an additional method with this functionality. In Python: ```python from pyspark.ml.classification import NaiveBayes nb = NaiveBayes() print nb.hasParam("smoothing") print nb.hasParam("notAParam") ``` produces: > True > AttributeError: 'NaiveBayes' object has no attribute 'notAParam' However, in Scala: ```scala import org.apache.spark.ml.classification.NaiveBayes val nb = new NaiveBayes() nb.hasParam("smoothing") nb.hasParam("notAParam") ``` produces: > true > false cc holdenk Author: sethah <seth.hendrickson16@gmail.com> Closes #10962 from sethah/SPARK-13047. (cherry picked from commit b354673) Signed-off-by: Xiangrui Meng <meng@databricks.com>

…alue parameter Fix this defect by check default value exist or not. yanboliang Please help to review. Author: Tommy YU <tummyyu@163.com> Closes #11043 from Wenpei/spark-13153-handle-param-withnodefaultvalue. (cherry picked from commit d3e2e20) Signed-off-by: Xiangrui Meng <meng@databricks.com>

… Windows Due to being on a Windows platform I have been unable to run the tests as described in the "Contributing to Spark" instructions. As the change is only to two lines of code in the Web UI, which I have manually built and tested, I am submitting this pull request anyway. I hope this is OK. Is it worth considering also including this fix in any future 1.5.x releases (if any)? I confirm this is my own original work and license it to the Spark project under its open source license. Author: markpavey <mark.pavey@thefilter.com> Closes #11135 from markpavey/JIRA_SPARK-13142_WindowsWebUILogFix. (cherry picked from commit 374c4b2) Signed-off-by: Sean Owen <sowen@cloudera.com>

…ailed test JIRA: https://issues.apache.org/jira/browse/SPARK-12363 This issue is pointed by yanboliang. When `setRuns` is removed from PowerIterationClustering, one of the tests will be failed. I found that some `dstAttr`s of the normalized graph are not correct values but 0.0. By setting `TripletFields.All` in `mapTriplets` it can work. Author: Liang-Chi Hsieh <viirya@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Closes #10539 from viirya/fix-poweriter. (cherry picked from commit e3441e3) Signed-off-by: Xiangrui Meng <meng@databricks.com>

Looks like pygments.rb gem is also required for jekyll build to work. At least on Ubuntu/RHEL I could not do build without this dependency. So added this to steps. Author: Amit Dev <amitdev@gmail.com> Closes #11180 from amitdev/master. (cherry picked from commit 331293c) Signed-off-by: Sean Owen <sowen@cloudera.com>

…-guide Response to JIRA https://issues.apache.org/jira/browse/SPARK-13312. This contribution is my original work and I license the work to this project. Author: JeremyNixon <jnixon2@gmail.com> Closes #11199 from JeremyNixon/update_train_val_split_example. (cherry picked from commit adb5483) Signed-off-by: Sean Owen <sowen@cloudera.com>

There's a small typo in the SparseVector.parse docstring (which says that it returns a DenseVector rather than a SparseVector), which seems to be incorrect. Author: Miles Yucht <miles@databricks.com> Closes #11213 from mgyucht/fix-sparsevector-docs. (cherry picked from commit 827ed1c) Signed-off-by: Sean Owen <sowen@cloudera.com>

This commit removes an unnecessary duplicate check in addPendingTask that meant that scheduling a task set took time proportional to (# tasks)^2. Author: Sital Kedia <skedia@fb.com> Closes #11175 from sitalkedia/fix_stuck_driver. (cherry picked from commit 1e1e31e) Signed-off-by: Kay Ousterhout <kayousterhout@gmail.com>

… default is "python2.7" Author: Christopher C. Aycock <chris@chrisaycock.com> Closes #11239 from chrisaycock/master. (cherry picked from commit a7c74d7) Signed-off-by: Josh Rosen <joshrosen@databricks.com>

…pares Option and String directly. ## What changes were proposed in this pull request? Fix some comparisons between unequal types that cause IJ warnings and in at least one case a likely bug (TaskSetManager) ## How was the this patch tested? Running Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #11253 from srowen/SPARK-13371. (cherry picked from commit 7856253) Signed-off-by: Andrew Or <andrew@databricks.com>

A common problem that users encounter with Spark 1.6.0 is that writing to a partitioned parquet table OOMs. The root cause is that parquet allocates a significant amount of memory that is not accounted for by our own mechanisms. As a workaround, we can ensure that only a single file is open per task unless the user explicitly asks for more. Author: Michael Armbrust <michael@databricks.com> Closes #11308 from marmbrus/parquetWriteOOM. (cherry picked from commit 173aa94) Signed-off-by: Michael Armbrust <michael@databricks.com>

…ome special character ## What changes were proposed in this pull request? When there are some special characters (e.g., `"`, `\`) in `label`, DAG will be broken. This patch just escapes `label` to avoid DAG being broken by some special characters ## How was the this patch tested? Jenkins tests Author: Shixiong Zhu <shixiong@databricks.com> Closes #11309 from zsxwing/SPARK-13298. (cherry picked from commit a11b399) Signed-off-by: Andrew Or <andrew@databricks.com>

In SparkSQLCLI, we have created a `CliSessionState`, but then we call `SparkSQLEnv.init()`, which will start another `SessionState`. This would lead to exception because `processCmd` need to get the `CliSessionState` instance by calling `SessionState.get()`, but the return value would be a instance of `SessionState`. See the exception below. spark-sql> !echo "test"; Exception in thread "main" java.lang.ClassCastException: org.apache.hadoop.hive.ql.session.SessionState cannot be cast to org.apache.hadoop.hive.cli.CliSessionState at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:112) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:301) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:242) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:691) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #9589 from adrian-wang/clicommand. (cherry picked from commit 5d80fac) Signed-off-by: Michael Armbrust <michael@databricks.com> Conflicts: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/ClientWrapper.scala

…false) fix for branch-1.6 https://issues.apache.org/jira/browse/SPARK-13359 Author: Earthson Lu <Earthson.Lu@gmail.com> Closes #11237 from Earthson/SPARK-13359.

`GraphImpl.fromExistingRDDs` expects preprocessed vertex RDD as input. We call it in LDA without validating this requirement. So it might introduce errors. Replacing it by `Graph.apply` would be safer and more proper because it is a public API. The tests still pass. So maybe it is safe to use `fromExistingRDDs` here (though it doesn't seem so based on the implementation) or the test cases are special. jkbradley ankurdave Author: Xiangrui Meng <meng@databricks.com> Closes #11226 from mengxr/SPARK-13355. (cherry picked from commit 764ca18) Signed-off-by: Xiangrui Meng <meng@databricks.com>

## What changes were proposed in this pull request? This PR adds equality operators to UDT classes so that they can be correctly tested for dataType equality during union operations. This was previously causing `"AnalysisException: u"unresolved operator 'Union;""` when trying to unionAll two dataframes with UDT columns as below. ``` from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT from pyspark.sql import types schema = types.StructType([types.StructField("point", PythonOnlyUDT(), True)]) a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema) b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema) c = a.unionAll(b) ``` ## How was the this patch tested? Tested using two unit tests in sql/test.py and the DataFrameSuite. Additional information here : https://issues.apache.org/jira/browse/SPARK-13410 rxin Author: Franklyn D'souza <franklynd@gmail.com> Closes #11333 from damnMeddlingKid/udt-union-patch.

…q is not Serializable ## What changes were proposed in this pull request? `scala.collection.Iterator`'s methods (e.g., map, filter) will return an `AbstractIterator` which is not Serializable. E.g., ```Scala scala> val iter = Array(1, 2, 3).iterator.map(_ + 1) iter: Iterator[Int] = non-empty iterator scala> println(iter.isInstanceOf[Serializable]) false ``` If we call something like `Iterator.map(...).toSeq`, it will create a `Stream` that contains a non-serializable `AbstractIterator` field and make the `Stream` be non-serializable. This PR uses `toArray` instead of `toSeq` to fix such issue in `def createDataFrame(data: java.util.List[_], beanClass: Class[_]): DataFrame`. ## How was the this patch tested? Jenkins tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #11334 from zsxwing/SPARK-13390.

…ult on double value ## What changes were proposed in this pull request? Remainder(%) expression's `eval()` returns incorrect result when the dividend is a big double. The reason is that Remainder converts the double dividend to decimal to do "%", and that lose precision. This bug only affects the `eval()` that is used by constant folding, the codegen path is not impacted. ### Before change ``` scala> -5083676433652386516D % 10 res2: Double = -6.0 scala> spark.sql("select -5083676433652386516D % 10 as a").show +---+ | a| +---+ |0.0| +---+ ``` ### After change ``` scala> spark.sql("select -5083676433652386516D % 10 as a").show +----+ | a| +----+ |-6.0| +----+ ``` ## How was this patch tested? Unit test. Author: Sean Zhong <seanzhong@databricks.com> Closes #15171 from clockfly/SPARK-17617. (cherry picked from commit 3977223) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…shed This patch updates the `kinesis-asl-assembly` build to prevent that module from being published as part of Maven releases and snapshot builds. The `kinesis-asl-assembly` includes classes from the Kinesis Client Library (KCL) and Kinesis Producer Library (KPL), both of which are licensed under the Amazon Software License and are therefore prohibited from being distributed in Apache releases. Author: Josh Rosen <joshrosen@databricks.com> Closes #15167 from JoshRosen/stop-publishing-kinesis-assembly.

…ng entire job (branch-1.6 backport) This patch is a branch-1.6 backport of #15037: ## What changes were proposed in this pull request? In Spark's `RDD.getOrCompute` we first try to read a local copy of a cached RDD block, then a remote copy, and only fall back to recomputing the block if no cached copy (local or remote) can be read. This logic works correctly in the case where no remote copies of the block exist, but if there _are_ remote copies and reads of those copies fail (due to network issues or internal Spark bugs) then the BlockManager will throw a `BlockFetchException` that will fail the task (and which could possibly fail the whole job if the read failures keep occurring). In the cases of TorrentBroadcast and task result fetching we really do want to fail the entire job in case no remote blocks can be fetched, but this logic is inappropriate for reads of cached RDD blocks because those can/should be recomputed in case cached blocks are unavailable. Therefore, I think that the `BlockManager.getRemoteBytes()` method should never throw on remote fetch errors and, instead, should handle failures by returning `None`. ## How was this patch tested? Block manager changes should be covered by modified tests in `BlockManagerSuite`: the old tests expected exceptions to be thrown on failed remote reads, while the modified tests now expect `None` to be returned from the `getRemote*` method. I also manually inspected all usages of `BlockManager.getRemoteValues()`, `getRemoteBytes()`, and `get()` to verify that they correctly pattern-match on the result and handle `None`. Note that these `None` branches are already exercised because the old `getRemoteBytes` returned `None` when no remote locations for the block could be found (which could occur if an executor died and its block manager de-registered with the master). Author: Josh Rosen <joshrosen@databricks.com> Closes #15186 from JoshRosen/SPARK-17485-branch-1.6-backport.

…nousListenerBus (branch 1.6) ## What changes were proposed in this pull request? Backport #15220 to 1.6. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #15226 from zsxwing/SPARK-17649-branch-1.6.

… formats ## What changes were proposed in this pull request? This patch addresses a correctness bug in Spark 1.6.x in where `coalesce()` declares that it can process `UnsafeRows` but mis-declares that it always outputs safe rows. If UnsafeRow and other Row types are compared for equality then we will get spurious `false` comparisons, leading to wrong answers in operators which perform whole-row comparison (such as `distinct()` or `except()`). An example of a query impacted by this bug is given in the [JIRA ticket](https://issues.apache.org/jira/browse/SPARK-17618). The problem is that the validity of our row format conversion rules depends on operators which handle `unsafeRows` (signalled by overriding `canProcessUnsafeRows`) correctly reporting their output row format (which is done by overriding `outputsUnsafeRows`). In #9024, we overrode `canProcessUnsafeRows` but forgot to override `outputsUnsafeRows`, leading to the incorrect `equals()` comparison. Our interface design is flawed because correctness depends on operators correctly overriding multiple methods this problem could have been prevented by a design which coupled row format methods / metadata into a single method / class so that all three methods had to be overridden at the same time. This patch addresses this issue by adding missing `outputsUnsafeRows` overrides. In order to ensure that bugs in this logic are uncovered sooner, I have modified `UnsafeRow.equals()` to throw an `IllegalArgumentException` if it is called with an object that is not an `UnsafeRow`. ## How was this patch tested? I believe that the stronger misuse-checking in `UnsafeRow.equals()` is sufficient to detect and prevent this class of bug. Author: Josh Rosen <joshrosen@databricks.com> Closes #15185 from JoshRosen/SPARK-17618.

From the original commit message: This PR also fixes a regression caused by [SPARK-10987] whereby submitting a shutdown causes a race between the local shutdown procedure and the notification of the scheduler driver disconnection. If the scheduler driver disconnection wins the race, the coarse executor incorrectly exits with status 1 (instead of the proper status 0) Author: Charles Allen <charlesallen-net.com> (cherry picked from commit 2eaeafe) Author: Charles Allen <charles@allen-net.com> Closes #15270 from vanzin/SPARK-17696.

…atrix with SparseVector Backport PR of changes relevant to mllib only, but otherwise identical to #15296 jkbradley Author: Bjarne Fruergaard <bwahlgreen@gmail.com> Closes #15311 from bwahlgreen/bugfix-spark-17721-1.6.

This backports 733cbaa to Branch 1.6. It's a pretty simple patch, and would be nice to have for Spark 1.6.3. Unit tests Author: Burak Yavuz <brkyvz@gmail.com> Closes #15380 from brkyvz/bp-SPARK-15062. Signed-off-by: Michael Armbrust <michael@databricks.com>

## What changes were proposed in this pull request? This is the patch for 1.6. It only adds Spark conf `spark.files.ignoreCorruptFiles` because SQL just uses HadoopRDD directly in 1.6. `spark.files.ignoreCorruptFiles` is `true` by default. ## How was this patch tested? The added test. Author: Shixiong Zhu <shixiong@databricks.com> Closes #15454 from zsxwing/SPARK-17850-1.6.

…cala-2.11 repl ## What changes were proposed in this pull request? Spark 1.6 Scala-2.11 repl doesn't honor "spark.replClassServer.port" configuration, so user cannot set a fixed port number through "spark.replClassServer.port". ## How was this patch tested? N/A Author: jerryshao <sshao@hortonworks.com> Closes #15253 from jerryshao/SPARK-17678.

…m empty string to interval type ## What changes were proposed in this pull request? This change adds a check in castToInterval method of Cast expression , such that if converted value is null , then isNull variable should be set to true. Earlier, the expression Cast(Literal(), CalendarIntervalType) was throwing NullPointerException because of the above mentioned reason. ## How was this patch tested? Added test case in CastSuite.scala jira entry for detail: https://issues.apache.org/jira/browse/SPARK-17884 Author: prigarg <prigarg@adobe.com> Closes #15479 from priyankagargnitk/cast_empty_string_bug.

…ld not depends on local timezone ## What changes were proposed in this pull request? Back-port of #13784 to `branch-1.6` ## How was this patch tested? Existing tests. Author: Davies Liu <davies@databricks.com> Closes #15554 from srowen/SPARK-16078.

…executor loss ## What changes were proposed in this pull request? _This is the master branch-1.6 version of #15986; the original description follows:_ This patch fixes a critical resource leak in the TaskScheduler which could cause RDDs and ShuffleDependencies to be kept alive indefinitely if an executor with running tasks is permanently lost and the associated stage fails. This problem was originally identified by analyzing the heap dump of a driver belonging to a cluster that had run out of shuffle space. This dump contained several `ShuffleDependency` instances that were retained by `TaskSetManager`s inside the scheduler but were not otherwise referenced. Each of these `TaskSetManager`s was considered a "zombie" but had no running tasks and therefore should have been cleaned up. However, these zombie task sets were still referenced by the `TaskSchedulerImpl.taskIdToTaskSetManager` map. Entries are added to the `taskIdToTaskSetManager` map when tasks are launched and are removed inside of `TaskScheduler.statusUpdate()`, which is invoked by the scheduler backend while processing `StatusUpdate` messages from executors. The problem with this design is that a completely dead executor will never send a `StatusUpdate`. There is [some code](https://github.com/apache/spark/blob/072f4c518cdc57d705beec6bcc3113d9a6740819/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L338) in `statusUpdate` which handles tasks that exit with the `TaskState.LOST` state (which is supposed to correspond to a task failure triggered by total executor loss), but this state only seems to be used in Mesos fine-grained mode. There doesn't seem to be any code which performs per-task state cleanup for tasks that were running on an executor that completely disappears without sending any sort of final death message. The `executorLost` and [`removeExecutor`](https://github.com/apache/spark/blob/072f4c518cdc57d705beec6bcc3113d9a6740819/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L527) methods don't appear to perform any cleanup of the `taskId -> *` mappings, causing the leaks observed here. This patch's fix is to maintain a `executorId -> running task id` mapping so that these `taskId -> *` maps can be properly cleaned up following an executor loss. There are some potential corner-case interactions that I'm concerned about here, especially some details in [the comment](https://github.com/apache/spark/blob/072f4c518cdc57d705beec6bcc3113d9a6740819/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L523) in `removeExecutor`, so I'd appreciate a very careful review of these changes. ## How was this patch tested? I added a new unit test to `TaskSchedulerImplSuite`. /cc kayousterhout and markhamstra, who reviewed #15986. Author: Josh Rosen <joshrosen@databricks.com> Closes #16070 from JoshRosen/fix-leak-following-total-executor-loss-1.6.

…functions No tests done for JDBCRDD#compileFilter. Author: Takeshi YAMAMURO <linguin.m.sgmail.com> Closes #10409 from maropu/AddTestsInJdbcRdd. (cherry picked from commit 8c1b867) Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #16124 from dongjoon-hyun/SPARK-12446-BRANCH-1.6.

## What changes were proposed in this pull request? This fix is related to be bug: https://issues.apache.org/jira/browse/SPARK-18372 . The insertIntoHiveTable would generate a .staging directory, but this directory fail to be removed in the end. This is backport from spark 2.0.x code, and is related to PR #12770 ## How was this patch tested? manual tests Author: Mingjie Tang <mtanghortonworks.com> Author: Mingjie Tang <mtang@hortonworks.com> Author: Mingjie Tang <mtang@HW12398.local> Closes #15819 from merlintang/branch-1.6.

The Hive client library is not smart enough to notice that the current user is a proxy user; so when using a proxy user, it fails to fetch delegation tokens from the metastore because of a missing kerberos TGT for the current user. To fix it, just run the code that fetches the delegation token as the real logged in user. Tested on a kerberos cluster both submitting normally and with a proxy user; Hive and HBase tokens are retrieved correctly in both cases. Author: Marcelo Vanzin <vanzincloudera.com> Closes #11358 from vanzin/SPARK-13478. (cherry picked from commit c7fccb5) Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #16665 from vanzin/SPARK-13478_1.6.

## What changes were proposed in this pull request? This PR backports PR #16866 to branch-1.6 ## How was this patch tested? Existing tests. Author: Cheng Lian <lian@databricks.com> Closes #16917 from liancheng/spark-19529-1.6-backport.

… beyond 64 KB ## What changes were proposed in this pull request? This is a backport pr of #15480 into `branch-1.6`. ## How was this patch tested? Existing tests. Author: Liwei Lin <lwlin7@gmail.com> Closes #17158 from ueshin/issues/SPARK-16845_1.6.

…e` and port cloudpickle changes for PySpark to work with Python 3.6.0 ## What changes were proposed in this pull request? This PR proposes to backports #16429 to branch-1.6 so that Python 3.6.0 works with Spark 1.6.x. ## How was this patch tested? Manually, via ``` ./run-tests --python-executables=python3.6 ``` ``` Finished test(python3.6): pyspark.conf (5s) Finished test(python3.6): pyspark.broadcast (7s) Finished test(python3.6): pyspark.accumulators (9s) Finished test(python3.6): pyspark.rdd (16s) Finished test(python3.6): pyspark.shuffle (0s) Finished test(python3.6): pyspark.serializers (11s) Finished test(python3.6): pyspark.profiler (5s) Finished test(python3.6): pyspark.context (21s) Finished test(python3.6): pyspark.ml.clustering (12s) Finished test(python3.6): pyspark.ml.feature (16s) Finished test(python3.6): pyspark.ml.classification (16s) Finished test(python3.6): pyspark.ml.recommendation (16s) Finished test(python3.6): pyspark.ml.tuning (14s) Finished test(python3.6): pyspark.ml.regression (16s) Finished test(python3.6): pyspark.ml.evaluation (12s) Finished test(python3.6): pyspark.ml.tests (17s) Finished test(python3.6): pyspark.mllib.classification (18s) Finished test(python3.6): pyspark.mllib.evaluation (12s) Finished test(python3.6): pyspark.mllib.feature (19s) Finished test(python3.6): pyspark.mllib.linalg.__init__ (0s) Finished test(python3.6): pyspark.mllib.fpm (12s) Finished test(python3.6): pyspark.mllib.clustering (31s) Finished test(python3.6): pyspark.mllib.random (8s) Finished test(python3.6): pyspark.mllib.linalg.distributed (17s) Finished test(python3.6): pyspark.mllib.recommendation (23s) Finished test(python3.6): pyspark.mllib.stat.KernelDensity (0s) Finished test(python3.6): pyspark.mllib.stat._statistics (13s) Finished test(python3.6): pyspark.mllib.regression (22s) Finished test(python3.6): pyspark.mllib.util (9s) Finished test(python3.6): pyspark.mllib.tree (14s) Finished test(python3.6): pyspark.sql.types (9s) Finished test(python3.6): pyspark.sql.context (16s) Finished test(python3.6): pyspark.sql.column (14s) Finished test(python3.6): pyspark.sql.group (16s) Finished test(python3.6): pyspark.sql.dataframe (25s) Finished test(python3.6): pyspark.tests (164s) Finished test(python3.6): pyspark.sql.window (6s) Finished test(python3.6): pyspark.sql.functions (19s) Finished test(python3.6): pyspark.streaming.util (0s) Finished test(python3.6): pyspark.sql.readwriter (24s) Finished test(python3.6): pyspark.sql.tests (38s) Finished test(python3.6): pyspark.mllib.tests (133s) Finished test(python3.6): pyspark.streaming.tests (189s) Tests passed in 380 seconds ``` Author: hyukjinkwon <gurwls223@gmail.com> Closes #17375 from HyukjinKwon/SPARK-19019-backport-1.6.

…om checkpoint. ## What changes were proposed in this pull request? Reload the `spark.yarn.credentials.file` property when restarting a streaming application from checkpoint. ## How was this patch tested? Manual tested with 1.6.3 and 2.1.1. I didn't test this with master because of some compile problems, but I think it will be the same result. ## Notice This should be merged into maintenance branches too. jira: [SPARK-21008](https://issues.apache.org/jira/browse/SPARK-21008) Author: saturday_s <shi.indetail@gmail.com> Closes #18230 from saturday-shi/SPARK-21008. (cherry picked from commit e92ffe6) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

kiszk · 2018-06-07T21:34:46Z

@deepaksonu Would it be possible to close this PR?

HyukjinKwon · 2018-06-08T02:10:07Z

ping @deepaksonu close this PR please.

AmplabJenkins · 2018-06-09T00:02:01Z

Can one of the admins verify this patch?

cloud-fan and others added 30 commits February 3, 2016 16:13

[ML][DOC] fix wrong api link in ml onevsrest

2f390d3

minor fix for api link in ml onevsrest Author: Yuhao Yang <hhbyyh@gmail.com> Closes #11068 from hhbyyh/onevsrestDoc. (cherry picked from commit c2c956b) Signed-off-by: Xiangrui Meng <meng@databricks.com>

[SPARK-13214][DOCS] update dynamicAllocation documentation

3ca5dc3

Author: Bill Chambers <bill@databricks.com> Closes #11094 from anabranch/dynamic-docs. (cherry picked from commit 66e1383) Signed-off-by: Andrew Or <andrew@databricks.com>

[SPARK-13350][DOCS] Config doc updated to state that PYSPARK_PYTHON's…

66106a6

… default is "python2.7" Author: Christopher C. Aycock <chris@chrisaycock.com> Closes #11239 from chrisaycock/master. (cherry picked from commit a7c74d7) Signed-off-by: Josh Rosen <joshrosen@databricks.com>

Update branch-1.6 for 1.6.1 release

40d11d0

Preparing Spark release v1.6.1-rc1

152252f

Preparing development version 1.6.1-SNAPSHOT

2902798

[SPARK-12746][ML] ArrayType(_, true) should also accept ArrayType(_, …

d31854d

…false) fix for branch-1.6 https://issues.apache.org/jira/browse/SPARK-13359 Author: Earthson Lu <Earthson.Lu@gmail.com> Closes #11237 from Earthson/SPARK-13359.

clockfly and others added 25 commits September 21, 2016 16:57

Prepare branch-1.6 for 1.6.3 release.

0f57785

Preparing Spark release v1.6.3

7375bb0

Preparing development version 1.6.4-SNAPSHOT

b95ac0d

Preparing Spark release v1.6.3-rc2

1e86074

Preparing development version 1.6.4-SNAPSHOT

9136e26

srowen mentioned this pull request Jul 3, 2018

[BUILD] Close stale PRs #21708

Closed

asfgit closed this in 5bf95f2 Jul 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Branch 1.6 #21507

Branch 1.6 #21507

deepaksonu commented Jun 7, 2018

kiszk commented Jun 7, 2018

HyukjinKwon commented Jun 8, 2018

AmplabJenkins commented Jun 9, 2018

Branch 1.6 #21507

Branch 1.6 #21507

Conversation

deepaksonu commented Jun 7, 2018

What changes were proposed in this pull request?

How was this patch tested?

kiszk commented Jun 7, 2018

HyukjinKwon commented Jun 8, 2018

AmplabJenkins commented Jun 9, 2018