Branch 1.6 #11668

This is a follow-up PR for #10259 Author: Davies Liu <davies@databricks.com> Closes #10266 from davies/null_udf2. (cherry picked from commit c119a34) Signed-off-by: Davies Liu <davies.liu@gmail.com>

…iles * ```jsonFile``` should support multiple input files, such as: ```R jsonFile(sqlContext, c(“path1”, “path2”)) # character vector as arguments jsonFile(sqlContext, “path1,path2”) ``` * Meanwhile, ```jsonFile``` has been deprecated by Spark SQL and will be removed at Spark 2.0. So we mark ```jsonFile``` deprecated and use ```read.json``` at SparkR side. * Replace all ```jsonFile``` with ```read.json``` at test_sparkSQL.R, but still keep jsonFile test case. * If this PR is accepted, we should also make almost the same change for ```parquetFile```. cc felixcheung sun-rui shivaram Author: Yanbo Liang <ybliang8@gmail.com> Closes #10145 from yanboliang/spark-12146. (cherry picked from commit 0fb9825) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Adding in Pipeline Import and Export Documentation. Author: anabranch <wac.chambers@gmail.com> Author: Bill Chambers <wchambers@ischool.berkeley.edu> Closes #10179 from anabranch/master. (cherry picked from commit aa305dc) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>

…rasure Issue As noted in PR #9441, implementing `tallSkinnyQR` uncovered a bug with our PySpark `RowMatrix` constructor. As discussed on the dev list [here](http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html), there appears to be an issue with type erasure with RDDs coming from Java, and by extension from PySpark. Although we are attempting to construct a `RowMatrix` from an `RDD[Vector]` in [PythonMLlibAPI](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115), the `Vector` type is erased, resulting in an `RDD[Object]`. Thus, when calling Scala's `tallSkinnyQR` from PySpark, we get a Java `ClassCastException` in which an `Object` cannot be cast to a Spark `Vector`. As noted in the aforementioned dev list thread, this issue was also encountered with `DecisionTrees`, and the fix involved an explicit `retag` of the RDD with a `Vector` type. `IndexedRowMatrix` and `CoordinateMatrix` do not appear to have this issue likely due to their related helper functions in `PythonMLlibAPI` creating the RDDs explicitly from DataFrames with pattern matching, thus preserving the types. This PR currently contains that retagging fix applied to the `createRowMatrix` helper function in `PythonMLlibAPI`. This PR blocks #9441, so once this is merged, the other can be rebased. cc holdenk Author: Mike Dusenberry <mwdusenb@us.ibm.com> Closes #9458 from dusenberrymw/SPARK-11497_PySpark_RowMatrix_Constructor_Has_Type_Erasure_Issue. (cherry picked from commit 1b82203) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>

Added a paragraph regarding StringIndexer#setHandleInvalid to the ml-features documentation. I wonder if I should also add a snippet to the code example, input welcome. Author: BenFradet <benjamin.fradet@gmail.com> Closes #10257 from BenFradet/SPARK-12217. (cherry picked from commit aea676c) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>

…o dataframe_example.py Since ```Dataset``` has a new meaning in Spark 1.6, we should rename it to avoid confusion. #9873 finished the work of Scala example, here we focus on the Python one. Move dataset_example.py to ```examples/ml``` and rename to ```dataframe_example.py```. BTW, fix minor missing issues of #9873. cc mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes #9957 from yanboliang/SPARK-11978. (cherry picked from commit a0ff6d1) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>

Modifies the String overload to call the Column overload and ensures this is called in a test. Author: Ankur Dave <ankurdave@gmail.com> Closes #10271 from ankurdave/SPARK-12298. (cherry picked from commit 1e799d6) Signed-off-by: Yin Huai <yhuai@databricks.com>

…est cases The existing sample functions miss the parameter `seed`, however, the corresponding function interface in `generics` has such a parameter. Thus, although the function caller can call the function with the 'seed', we are not using the value. This could cause SparkR unit tests failed. For example, I hit it in another PR: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47213/consoleFull Author: gatorsmile <gatorsmile@gmail.com> Closes #10160 from gatorsmile/sampleR. (cherry picked from commit 1e3526c) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

…rait in order to avoid ClassCastException due to KryoSerializer in KinesisReceiver Author: Jean-Baptiste Onofré <jbonofre@apache.org> Closes #10203 from jbonofre/SPARK-11193. (cherry picked from commit 03138b6) Signed-off-by: Sean Owen <sowen@cloudera.com>

https://issues.apache.org/jira/browse/SPARK-12199 Follow-up PR of SPARK-11551. Fix some errors in ml-features.md mengxr Author: Xusen Yin <yinxusen@gmail.com> Closes #10193 from yinxusen/SPARK-12199. (cherry picked from commit 98b212d) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>

…ct disconnetion message Author: Shixiong Zhu <shixiong@databricks.com> Closes #10261 from zsxwing/SPARK-12267. (cherry picked from commit 8af2f8c) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

… in the shutdown hook 1. Make sure workers and masters exit so that no worker or master will still be running when triggering the shutdown hook. 2. Set ExecutorState to FAILED if it's still RUNNING when executing the shutdown hook. This should fix the potential exceptions when exiting a local cluster ``` java.lang.AssertionError: assertion failed: executor 4 state transfer from RUNNING to RUNNING is illegal at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.deploy.master.Master$$anonfun$receive$1.applyOrElse(Master.scala:260) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) java.lang.IllegalStateException: Shutdown hooks cannot be modified during shutdown. at org.apache.spark.util.SparkShutdownHookManager.add(ShutdownHookManager.scala:246) at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:191) at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:180) at org.apache.spark.deploy.worker.ExecutorRunner.start(ExecutorRunner.scala:73) at org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:474) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) ``` Author: Shixiong Zhu <shixiong@databricks.com> Closes #10269 from zsxwing/executor-state. (cherry picked from commit 2aecda2) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

When SparkStrategies.BasicOperators's "case BroadcastHint(child) => apply(child)" is hit, it only recursively invokes BasicOperators.apply with this "child". It makes many strategies have no change to process this plan, which probably leads to "No plan" issue, so we use planLater to go through all strategies. https://issues.apache.org/jira/browse/SPARK-12275 Author: yucai <yucai.yu@intel.com> Closes #10265 from yucai/broadcast_hint. (cherry picked from commit ed87f6d) Signed-off-by: Yin Huai <yhuai@databricks.com>

Follow-up of [SPARK-12199](https://issues.apache.org/jira/browse/SPARK-12199) and #10193 where a broken link has been left as is. Author: BenFradet <benjamin.fradet@gmail.com> Closes #10282 from BenFradet/SPARK-12199. (cherry picked from commit e25f1fe) Signed-off-by: Sean Owen <sowen@cloudera.com>

cc yhuai felixcheung shaneknapp Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #10300 from shivaram/comment-lintr-disable. (cherry picked from commit fb3778d) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

cc\ tdas zsxwing , please review. Thanks a lot. Author: jerryshao <sshao@hortonworks.com> Closes #10305 from jerryshao/fix-typo-state-impl. (cherry picked from commit bc1ff9f) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

Author: Michael Armbrust <michael@databricks.com> Closes #10317 from marmbrus/versions.

…ling setConf This is continuation of SPARK-12056 where change is applied to SqlNewHadoopRDD.scala andrewor14 FYI Author: tedyu <yuzhihong@gmail.com> Closes #10164 from tedyu/master. (cherry picked from commit f725b2e) Signed-off-by: Andrew Or <andrew@databricks.com>

…sos cluster mode. Adding more documentation about submitting jobs with mesos cluster mode. Author: Timothy Chen <tnachen@gmail.com> Closes #10086 from tnachen/mesos_supervise_docs. (cherry picked from commit c2de99a) Signed-off-by: Andrew Or <andrew@databricks.com>

ExternalBlockStore.scala Author: Naveen <naveenminchu@gmail.com> Closes #10313 from naveenminchu/branch-fix-SPARK-9886. (cherry picked from commit 8a215d2) Signed-off-by: Andrew Or <andrew@databricks.com>

… completes This change builds the event history of completed apps asynchronously so the RPC thread will not be blocked and allow new workers to register/remove if the event log history is very large and takes a long time to rebuild. Author: Bryan Cutler <bjcutler@us.ibm.com> Closes #10284 from BryanCutler/async-MasterUI-SPARK-12062. (cherry picked from commit c5b6b39) Signed-off-by: Andrew Or <andrew@databricks.com>

…lity Author: Wenchen Fan <cloud0fan@outlook.com> Closes #8645 from cloud-fan/test. (cherry picked from commit a89e8b6) Signed-off-by: Andrew Or <andrew@databricks.com>

This fixes the sidebar, using a pure CSS mechanism to hide it when the browser's viewport is too narrow. Credit goes to the original author Titan-C (mentioned in the NOTICE). Note that I am not a CSS expert, so I can only address comments up to some extent. Default view: <img width="936" alt="screen shot 2015-12-14 at 12 46 39 pm" src="https://cloud.githubusercontent.com/assets/7594753/11793597/6d1d6eda-a261-11e5-836b-6eb2054e9054.png"> When collapsed manually by the user: <img width="1004" alt="screen shot 2015-12-14 at 12 54 02 pm" src="https://cloud.githubusercontent.com/assets/7594753/11793669/c991989e-a261-11e5-8bf6-aecf3bdb6319.png"> Disappears when column is too narrow: <img width="697" alt="screen shot 2015-12-14 at 12 47 22 pm" src="https://cloud.githubusercontent.com/assets/7594753/11793607/7754dbcc-a261-11e5-8b15-e0d074b0e47c.png"> Can still be opened by the user if necessary: <img width="651" alt="screen shot 2015-12-14 at 12 51 15 pm" src="https://cloud.githubusercontent.com/assets/7594753/11793612/7bf82968-a261-11e5-9cc3-e827a7a6b2b0.png"> Author: Timothy Hunter <timhunter@databricks.com> Closes #10297 from thunterdb/12324. (cherry picked from commit a6325fc) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>

Add ```write.json``` and ```write.parquet``` for SparkR, and deprecated ```saveAsParquetFile```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #10281 from yanboliang/spark-12310. (cherry picked from commit 22f6cd8) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

cc jkbradley Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #10244 from yu-iskw/SPARK-12215. (cherry picked from commit 26d70bd) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>

shivaram Please help review. Author: Jeff Zhang <zjffdu@apache.org> Closes #10290 from zjffdu/SPARK-12318. (cherry picked from commit 2eb5af5) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

…h Mesos cluster mode. SPARK_HOME is now causing problem with Mesos cluster mode since spark-submit script has been changed recently to take precendence when running spark-class scripts to look in SPARK_HOME if it's defined. We should skip passing SPARK_HOME from the Spark client in cluster mode with Mesos, since Mesos shouldn't use this configuration but should use spark.executor.home instead. Author: Timothy Chen <tnachen@gmail.com> Closes #10332 from tnachen/scheduler_ui. (cherry picked from commit ad8c1f0) Signed-off-by: Andrew Or <andrew@databricks.com>

… bisecting k-means This PR includes only an example code in order to finish it quickly. I'll send another PR for the docs soon. Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #9952 from yu-iskw/SPARK-6518. (cherry picked from commit 7b6dc29) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>

No known breaking changes, but some deprecations and changes of behavior. CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #10235 from jkbradley/mllib-guide-update-1.6. (cherry picked from commit 8148cc7) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>

We have DataFrame example for SparkR, we also need to add ML example under ```examples/src/main/r```. cc mengxr jkbradley shivaram Author: Yanbo Liang <ybliang8@gmail.com> Closes #10324 from yanboliang/spark-12364. (cherry picked from commit 1a8b2a1) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>

MLlib should use SQLContext.getOrCreate() instead of creating new SQLContext. Author: Davies Liu <davies@databricks.com> Closes #10338 from davies/create_context. (cherry picked from commit 27b98e9) Signed-off-by: Davies Liu <davies.liu@gmail.com>

``` Exception in thread "main" org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in ${timeout.duration}. This timeout is controlled by spark.rpc.askTimeout at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) ``` Author: Andrew Or <andrew@databricks.com> Closes #10334 from andrewor14/rpc-typo. (cherry picked from commit 861549a) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

`DAGSchedulerEventLoop` normally only logs errors (so it can continue to process more events, from other jobs). However, this is not desirable in the tests -- the tests should be able to easily detect any exception, and also shouldn't silently succeed if there is an exception. This was suggested by mateiz on #7699. It may have already turned up an issue in "zero split job". Author: Imran Rashid <irashid@cloudera.com> Closes #8466 from squito/SPARK-10248. (cherry picked from commit 38d9795) Signed-off-by: Andrew Or <andrew@databricks.com>

…addShutdownHook() is called SPARK-9886 fixed ExternalBlockStore.scala This PR fixes the remaining references to Runtime.getRuntime.addShutdownHook() Author: tedyu <yuzhihong@gmail.com> Closes #10325 from ted-yu/master. (cherry picked from commit f590178) Signed-off-by: Andrew Or <andrew@databricks.com> Conflicts: sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala

…ry string when redirecting. Author: Rohit Agarwal <rohita@qubole.com> Closes #10180 from mindprince/SPARK-12186. (cherry picked from commit fdb3822) Signed-off-by: Andrew Or <andrew@databricks.com>

Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #10339 from vanzin/SPARK-12386. (cherry picked from commit d1508dd) Signed-off-by: Andrew Or <andrew@databricks.com>

This PR makes JSON parser and schema inference handle more cases where we have unparsed records. It is based on #10043. The last commit fixes the failed test and updates the logic of schema inference. Regarding the schema inference change, if we have something like ``` {"f1":1} [1,2,3] ``` originally, we will get a DF without any column. After this change, we will get a DF with columns `f1` and `_corrupt_record`. Basically, for the second row, `[1,2,3]` will be the value of `_corrupt_record`. When merge this PR, please make sure that the author is simplyianm. JIRA: https://issues.apache.org/jira/browse/SPARK-12057 Closes #10043 Author: Ian Macalinao <me@ian.pw> Author: Yin Huai <yhuai@databricks.com> Closes #10288 from yhuai/handleCorruptJson. (cherry picked from commit 9d66c42) Signed-off-by: Reynold Xin <rxin@databricks.com>

This commit is to resolve SPARK-12396. Author: echo2mei <534384876@qq.com> Closes #10354 from echoTomei/master. (cherry picked from commit 5a514b6) Signed-off-by: Davies Liu <davies.liu@gmail.com>

…er." This reverts commit da7542f.

For API DataFrame.join(right, usingColumns, joinType), if the joinType is right_outer or full_outer, the resulting join columns could be wrong (will be null). The order of columns had been changed to match that with MySQL and PostgreSQL [1]. This PR also fix the nullability of output for outer join. [1] http://www.postgresql.org/docs/9.2/static/queries-table-expressions.html Author: Davies Liu <davies@databricks.com> Closes #10353 from davies/fix_join. (cherry picked from commit a170d34) Signed-off-by: Davies Liu <davies.liu@gmail.com>

Since we rename the column name from ```text``` to ```value``` for DataFrame load by ```SQLContext.read.text```, we need to update doc. Author: Yanbo Liang <ybliang8@gmail.com> Closes #10349 from yanboliang/text-value. (cherry picked from commit 6e07716) Signed-off-by: Reynold Xin <rxin@databricks.com>

…pecial characters This PR encodes and decodes the file name to fix the issue. Author: Shixiong Zhu <shixiong@databricks.com> Closes #10208 from zsxwing/uri. (cherry picked from commit 86e405f) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

… server Fix problem with #10332, this one should fix Cluster mode on Mesos Author: Iulian Dragos <jaguarul@gmail.com> Closes #10359 from dragos/issue/fix-spark-12345-one-more-time. (cherry picked from commit 8184568) Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

No change in functionality is intended. This only changes internal API. Author: Andrew Or <andrew@databricks.com> Closes #10343 from andrewor14/clean-bm-serializer. Conflicts: core/src/main/scala/org/apache/spark/storage/BlockManager.scala

…split String.split accepts a regular expression, so we should escape "." and "|". Author: Shixiong Zhu <shixiong@databricks.com> Closes #10361 from zsxwing/reg-bug. (cherry picked from commit 540b5ae) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

…are not found Point users to spark-packages.org to find them. Author: Reynold Xin <rxin@databricks.com> Closes #10351 from rxin/SPARK-12397. (cherry picked from commit e096a65) Signed-off-by: Michael Armbrust <michael@databricks.com>

…erInvariantEquals method org.apache.spark.streaming.Java8APISuite.java is failing due to trying to sort immutable list in assertOrderInvariantEquals method. Author: Evan Chen <chene@us.ibm.com> Closes #10336 from evanyc15/SPARK-12376-StreamingJavaAPISuite.

…en recovering from checkpoint data Add a transient flag `DStream.restoredFromCheckpointData` to control the restore processing in DStream to avoid duplicate works: check this flag first in `DStream.restoreCheckpointData`, only when `false`, the restore process will be executed. Author: jhu-chang <gt.hu.chang@gmail.com> Closes #9765 from jhu-chang/SPARK-11749. (cherry picked from commit f4346f6) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

I believe this fixes SPARK-12413. I'm currently running an integration test to verify. Author: Michael Gummelt <mgummelt@mesosphere.io> Closes #10366 from mgummelt/fix-zk-mesos. (cherry picked from commit 2bebaa3) Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

…a Source filter API JIRA: https://issues.apache.org/jira/browse/SPARK-12218 When creating filters for Parquet/ORC, we should not push nested AND expressions partially. Author: Yin Huai <yhuai@databricks.com> Closes #10362 from yhuai/SPARK-12218. (cherry picked from commit 41ee7c5) Signed-off-by: Yin Huai <yhuai@databricks.com>

…Runtime.addShutdownHook() is called" This reverts commit 4af6438.

Now `StaticInvoke` receives `Any` as a object and `StaticInvoke` can be serialized but sometimes the object passed is not serializable. For example, following code raises Exception because `RowEncoder#extractorsFor` invoked indirectly makes `StaticInvoke`. ``` case class TimestampContainer(timestamp: java.sql.Timestamp) val rdd = sc.parallelize(1 to 2).map(_ => TimestampContainer(System.currentTimeMillis)) val df = rdd.toDF val ds = df.as[TimestampContainer] val rdd2 = ds.rdd <----------------- invokes extractorsFor indirectory ``` I'll add test cases. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Author: Michael Armbrust <michael@databricks.com> Closes #10357 from sarutak/SPARK-12404. (cherry picked from commit 6eba655) Signed-off-by: Michael Armbrust <michael@databricks.com>

- Provide example on `message handler` - Provide bit on KPL record de-aggregation - Fix typos Author: Burak Yavuz <brkyvz@gmail.com> Closes #9970 from brkyvz/kinesis-docs. (cherry picked from commit 2377b70) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

Fix mistake doc of join type for ```dataframe.join```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #10378 from yanboliang/leftsemi. (cherry picked from commit a073a73) Signed-off-by: Reynold Xin <rxin@databricks.com>

Author: pshearer <pshearer@massmutual.com> Closes #10414 from pshearer/patch-1. (cherry picked from commit fc6dbcc) Signed-off-by: Andrew Or <andrew@databricks.com>

``` [info] ReplayListenerSuite: [info] - Simple replay (58 milliseconds) java.lang.NullPointerException at org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:982) at org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:980) ``` https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-Master-SBT/4316/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=spark-test/consoleFull This was introduced in #10284. It's harmless because the NPE is caused by a race that occurs mainly in `local-cluster` tests (but don't actually fail the tests). Tested locally to verify that the NPE is gone. Author: Andrew Or <andrew@databricks.com> Closes #10417 from andrewor14/fix-harmless-npe. (cherry picked from commit d655d37) Signed-off-by: Andrew Or <andrew@databricks.com>

Author: Shixiong Zhu <shixiong@databricks.com> Closes #10424 from zsxwing/typo. (cherry picked from commit 93da856) Signed-off-by: Reynold Xin <rxin@databricks.com>

…ryServerSuite This patch fixes a flaky "test jdbc cancel" test in HiveThriftBinaryServerSuite. This test is prone to a race-condition which causes it to block indefinitely with while waiting for an extremely slow query to complete, which caused many Jenkins builds to time out. For more background, see my comments on #6207 (the PR which introduced this test). Author: Josh Rosen <joshrosen@databricks.com> Closes #10425 from JoshRosen/SPARK-11823. (cherry picked from commit 2235cd4) Signed-off-by: Josh Rosen <joshrosen@databricks.com>

Author: Shixiong Zhu <shixiong@databricks.com> Closes #10439 from zsxwing/kafka-message-handler-doc. (cherry picked from commit 93db50d) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

…or Streaming This PR adds Scala, Java and Python examples to show how to use Accumulator and Broadcast in Spark Streaming to support checkpointing. Author: Shixiong Zhu <shixiong@databricks.com> Closes #10385 from zsxwing/accumulator-broadcast-example. (cherry picked from commit 20591af) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

…ay fields Accessing null elements in an array field fails when tungsten is enabled. It works in Spark 1.3.1, and in Spark > 1.5 with Tungsten disabled. This PR solves this by checking if the accessed element in the array field is null, in the generated code. Example: ``` // Array of String case class AS( as: Seq[String] ) val dfAS = sc.parallelize( Seq( AS ( Seq("a",null,"b") ) ) ).toDF dfAS.registerTempTable("T_AS") for (i <- 0 to 2) { println(i + " = " + sqlContext.sql(s"select as[$i] from T_AS").collect.mkString(","))} ``` With Tungsten disabled: ``` 0 = [a] 1 = [null] 2 = [b] ``` With Tungsten enabled: ``` 0 = [a] 15/12/22 09:32:50 ERROR Executor: Exception in task 7.0 in stage 1.0 (TID 15) java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.UnsafeRowWriters$UTF8StringWriter.getSize(UnsafeRowWriters.java:90) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:90) at org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:88) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) ``` Author: pierre-borckmans <pierre.borckmans@realimpactanalytics.com> Closes #10429 from pierre-borckmans/SPARK-12477_Tungsten-Projection-Null-Element-In-Array. (cherry picked from commit 43b2a63) Signed-off-by: Reynold Xin <rxin@databricks.com>

allow the user to override MAVEN_OPTS (2GB wasn't sufficient for me) Author: Adrian Bridgett <adrian@smop.co.uk> Closes #10448 from abridgett/feature/do_not_force_maven_opts. (cherry picked from commit ead6abf) Signed-off-by: Josh Rosen <joshrosen@databricks.com>

…tbeat interval Previously, the rpc timeout was the default network timeout, which is the same value the driver uses to determine dead executors. This means if there is a network issue, the executor is determined dead after one heartbeat attempt. There is a separate config for the heartbeat interval which is a better value to use for the heartbeat RPC. With this change, the executor will make multiple heartbeat attempts even with RPC issues. Author: Nong Li <nong@databricks.com> Closes #10365 from nongli/spark-12411.

…a is used fix an exception with IBM JDK by removing update field from a JavaVersion tuple. This is because IBM JDK does not have information on update '_xx' Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #10463 from kiszk/SPARK-12502. (cherry picked from commit 9e85bb7) Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

…NSERT syntax In the past Spark JDBC write only worked with technologies which support the following INSERT statement syntax (JdbcUtils.scala: insertStatement()): INSERT INTO $table VALUES ( ?, ?, ..., ? ) But some technologies require a list of column names: INSERT INTO $table ( $colNameList ) VALUES ( ?, ?, ..., ? ) This was blocking the use of e.g. the Progress JDBC Driver for Cassandra. Another limitation is that syntax 1 relies no the dataframe field ordering match that of the target table. This works fine, as long as the target table has been created by writer.jdbc(). If the target table contains more columns (not created by writer.jdbc()), then the insert fails due mismatch of number of columns or their data types. This PR switches to the recommended second INSERT syntax. Column names are taken from datafram field names. Author: CK50 <christian.kurz@oracle.com> Closes #10380 from CK50/master-SPARK-12010-2. (cherry picked from commit 502476e) Signed-off-by: Sean Owen <sowen@cloudera.com>

…i-Join After reading the JIRA https://issues.apache.org/jira/browse/SPARK-12520, I double checked the code. For example, users can do the Equi-Join like ```df.join(df2, 'name', 'outer').select('name', 'height').collect()``` - There exists a bug in 1.5 and 1.4. The code just ignores the third parameter (join type) users pass. However, the join type we called is `Inner`, even if the user-specified type is the other type (e.g., `Outer`). - After a PR: #8600, the 1.6 does not have such an issue, but the description has not been updated. Plan to submit another PR to fix 1.5 and issue an error message if users specify a non-inner join type when using Equi-Join. Author: gatorsmile <gatorsmile@gmail.com> Closes #10477 from gatorsmile/pyOuterJoin.

The feature was first added at commit: 7b877b2 but was later removed (probably by mistake) at commit: fc8b581. This change sets the default path of RDDs created via sc.textFile(...) to the path argument. Here is the symptom: * Using spark-1.5.2-bin-hadoop2.6: scala> sc.textFile("/home/root/.bashrc").name res5: String = null scala> sc.binaryFiles("/home/root/.bashrc").name res6: String = /home/root/.bashrc * while using Spark 1.3.1: scala> sc.textFile("/home/root/.bashrc").name res0: String = /home/root/.bashrc scala> sc.binaryFiles("/home/root/.bashrc").name res1: String = /home/root/.bashrc Author: Yaron Weinsberg <wyaron@gmail.com> Author: yaron <yaron@il.ibm.com> Closes #10456 from wyaron/master. (cherry picked from commit 73b70f0) Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

ParamMap#filter uses `mutable.Map#filterKeys`. The return type of `filterKey` is collection.Map, not mutable.Map but the result is casted to mutable.Map using `asInstanceOf` so we get `ClassCastException`. Also, the return type of Map#filterKeys is not Serializable. It's the issue of Scala (https://issues.scala-lang.org/browse/SI-6654). Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #10381 from sarutak/SPARK-12424. (cherry picked from commit 07165ca) Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

…hrow Buffer underflow exception Since we only need to implement `def skipBytes(n: Int)`, code in #10213 could be simplified. davies scwf Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #10253 from adrian-wang/kryo. (cherry picked from commit a6d3853) Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Include the following changes: 1. Close `java.sql.Statement` 2. Fix incorrect `asInstanceOf`. 3. Remove unnecessary `synchronized` and `ReentrantLock`. Author: Shixiong Zhu <shixiong@databricks.com> Closes #10440 from zsxwing/findbugs. (cherry picked from commit 710b411) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

…es in postgresql If DataFrame has BYTE types, throws an exception: org.postgresql.util.PSQLException: ERROR: type "byte" does not exist Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #9350 from maropu/FixBugInPostgreJdbc. (cherry picked from commit 73862a1) Signed-off-by: Yin Huai <yhuai@databricks.com>

…umn as value `ifelse`, `when`, `otherwise` is unable to take `Column` typed S4 object as values. For example: ```r ifelse(lit(1) == lit(1), lit(2), lit(3)) ifelse(df$mpg > 0, df$mpg, 0) ``` will both fail with ```r attempt to replicate an object of type 'environment' ``` The PR replaces `ifelse` calls with `if ... else ...` inside the function implementations to avoid attempt to vectorize(i.e. `rep()`). It remains to be discussed whether we should instead support vectorization in these functions for consistency because `ifelse` in base R is vectorized but I cannot foresee any scenarios these functions will want to be vectorized in SparkR. For reference, added test cases which trigger failures: ```r . Error: when(), otherwise() and ifelse() with column on a DataFrame ---------- error in evaluating the argument 'x' in selecting a method for function 'collect': error in evaluating the argument 'col' in selecting a method for function 'select': attempt to replicate an object of type 'environment' Calls: when -> when -> ifelse -> ifelse 1: withCallingHandlers(eval(code, new_test_environment), error = capture_calls, message = function(c) invokeRestart("muffleMessage")) 2: eval(code, new_test_environment) 3: eval(expr, envir, enclos) 4: expect_equal(collect(select(df, when(df$a > 1 & df$b > 2, lit(1))))[, 1], c(NA, 1)) at test_sparkSQL.R:1126 5: expect_that(object, equals(expected, label = expected.label, ...), info = info, label = label) 6: condition(object) 7: compare(actual, expected, ...) 8: collect(select(df, when(df$a > 1 & df$b > 2, lit(1)))) Error: Test failures Execution halted ``` Author: Forest Fang <forest.fang@outlook.com> Closes #10481 from saurfang/spark-12526. (cherry picked from commit d80cc90) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Current schema inference for local python collections halts as soon as there are no NullTypes. This is different than when we specify a sampling ratio of 1.0 on a distributed collection. This could result in incomplete schema information. Author: Holden Karau <holden@us.ibm.com> Closes #10275 from holdenk/SPARK-12300-fix-schmea-inferance-on-local-collections. (cherry picked from commit d1ca634) Signed-off-by: Davies Liu <davies.liu@gmail.com>

…ith an unknown app Id I got an exception when accessing the below REST API with an unknown application Id. `http://<server-url>:18080/api/v1/applications/xxx/jobs` Instead of an exception, I expect an error message "no such app: xxx" which is a similar error message when I access `/api/v1/applications/xxx` ``` org.spark-project.guava.util.concurrent.UncheckedExecutionException: java.util.NoSuchElementException: no app with key xxx at org.spark-project.guava.cache.LocalCache$Segment.get(LocalCache.java:2263) at org.spark-project.guava.cache.LocalCache.get(LocalCache.java:4000) at org.spark-project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) at org.spark-project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) at org.apache.spark.deploy.history.HistoryServer.getSparkUI(HistoryServer.scala:116) at org.apache.spark.status.api.v1.UIRoot$class.withSparkUI(ApiRootResource.scala:226) at org.apache.spark.deploy.history.HistoryServer.withSparkUI(HistoryServer.scala:46) at org.apache.spark.status.api.v1.ApiRootResource.getJobs(ApiRootResource.scala:66) ``` Author: Carson Wang <carson.wang@intel.com> Closes #10352 from carsonwang/unknownAppFix. (cherry picked from commit b244297) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

shivaram Author: felixcheung <felixcheung_m@hotmail.com> Closes #10408 from felixcheung/rcodecomment. (cherry picked from commit c3d5056) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

…ame to be called value Author: Xiu Guo <xguo27@gmail.com> Closes #10515 from xguo27/SPARK-12562. (cherry picked from commit 84f8492) Signed-off-by: Reynold Xin <rxin@databricks.com>

…sible. This patch updates the ExecutorRunner's terminate path to use the new java 8 API to terminate processes more forcefully if possible. If the executor is unhealthy, it would previously ignore the destroy() call. Presumably, the new java API was added to handle cases like this. We could update the termination path in the future to use OS specific commands for older java versions. Author: Nong Li <nong@databricks.com> Closes #10438 from nongli/spark-12486-executors. (cherry picked from commit 8f65939) Signed-off-by: Andrew Or <andrew@databricks.com>

also only allocate required buffer size Author: Pete Robbins <robbinspg@gmail.com> Closes #10421 from robbinspg/master. (cherry picked from commit b504b6a) Signed-off-by: Davies Liu <davies.liu@gmail.com> Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeRowJoiner.scala

Spark SQL's JDBC data source allows users to specify an explicit JDBC driver to load (using the `driver` argument), but in the current code it's possible that the user-specified driver will not be used when it comes time to actually create a JDBC connection. In a nutshell, the problem is that you might have multiple JDBC drivers on the classpath that claim to be able to handle the same subprotocol, so simply registering the user-provided driver class with the our `DriverRegistry` and JDBC's `DriverManager` is not sufficient to ensure that it's actually used when creating the JDBC connection. This patch addresses this issue by first registering the user-specified driver with the DriverManager, then iterating over the driver manager's loaded drivers in order to obtain the correct driver and use it to create a connection (previously, we just called `DriverManager.getConnection()` directly). If a user did not specify a JDBC driver to use, then we call `DriverManager.getDriver` to figure out the class of the driver to use, then pass that class's name to executors; this guards against corner-case bugs in situations where the driver and executor JVMs might have different sets of JDBC drivers on their classpaths (previously, there was the (rare) potential for `DriverManager.getConnection()` to use different drivers on the driver and executors if the user had not explicitly specified a JDBC driver class and the classpaths were different). This patch is inspired by a similar patch that I made to the `spark-redshift` library (databricks/spark-redshift#143), which contains its own modified fork of some of Spark's JDBC data source code (for cross-Spark-version compatibility reasons). Author: Josh Rosen <joshrosen@databricks.com> Closes #10519 from JoshRosen/jdbc-driver-precedence. (cherry picked from commit 6c83d93) Signed-off-by: Yin Huai <yhuai@databricks.com>

This is the related thread: http://search-hadoop.com/m/q3RTtO3ReeJ1iF02&subj=Re+partitioning+json+data+in+spark Michael suggested fixing the doc. Please review. Author: tedyu <yuzhihong@gmail.com> Closes #10499 from ted-yu/master. (cherry picked from commit 40d0396) Signed-off-by: Michael Armbrust <michael@databricks.com>

…he row length. The reader was previously not setting the row length meaning it was wrong if there were variable length columns. This problem does not manifest usually, since the value in the column is correct and projecting the row fixes the issue. Author: Nong Li <nong@databricks.com> Closes #10576 from nongli/spark-12589. (cherry picked from commit 34de24a) Signed-off-by: Yin Huai <yhuai@databricks.com> Conflicts: sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java

checked that the change is in Spark 1.6.0. shivaram Author: felixcheung <felixcheung_m@hotmail.com> Closes #10574 from felixcheung/rwritemodedoc. (cherry picked from commit 8896ec9) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Author: Michael Armbrust <michael@databricks.com> Closes #10516 from marmbrus/datasetCleanup. (cherry picked from commit 53beddc) Signed-off-by: Michael Armbrust <michael@databricks.com>

…termining the number of reducers: aggregate operator change expected partition sizes Author: Pete Robbins <robbinspg@gmail.com> Closes #10599 from robbinspg/branch-1.6.

This patch added Py4jCallbackConnectionCleaner to clean the leak sockets of Py4J every 30 seconds. This is a workaround before Py4J fixes the leak issue py4j/py4j#187 Author: Shixiong Zhu <shixiong@databricks.com> Closes #10579 from zsxwing/SPARK-12617. (cherry picked from commit 047a31b) Signed-off-by: Davies Liu <davies.liu@gmail.com>

…erializer is called only once There is an issue that Py4J's PythonProxyHandler.finalize blocks forever. (py4j/py4j#184) Py4j will create a PythonProxyHandler in Java for "transformer_serializer" when calling "registerSerializer". If we call "registerSerializer" twice, the second PythonProxyHandler will override the first one, then the first one will be GCed and trigger "PythonProxyHandler.finalize". To avoid that, we should not call"registerSerializer" more than once, so that "PythonProxyHandler" in Java side won't be GCed. Author: Shixiong Zhu <shixiong@databricks.com> Closes #10514 from zsxwing/SPARK-12511. (cherry picked from commit 6cfe341) Signed-off-by: Davies Liu <davies.liu@gmail.com>

SPARK-12450 . Un-persist broadcasted variables in KMeans. Author: RJ Nowling <rnowling@gmail.com> Closes #10415 from rnowling/spark-12450. (cherry picked from commit 78015a8) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>

Successfully ran kinesis demo on a live, aws hosted kinesis stream against master and 1.6 branches. For reasons I don't entirely understand it required a manual merge to 1.5 which I did as shown here: BrianLondon@075c22e The demo ran successfully on the 1.5 branch as well. According to `mvn dependency:tree` it is still pulling a fairly old version of the aws-java-sdk (1.9.37), but this appears to have fixed the kinesis regression in 1.5.2. Author: BrianLondon <brian@seatgeek.com> Closes #10492 from BrianLondon/remove-only. (cherry picked from commit ff89975) Signed-off-by: Sean Owen <sowen@cloudera.com>

Add ```read.text``` and ```write.text``` for SparkR. cc sun-rui felixcheung shivaram Author: Yanbo Liang <ybliang8@gmail.com> Closes #10348 from yanboliang/spark-12393. (cherry picked from commit d1fea41) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

If initial model passed to GMM is not empty it causes `net.razorvine.pickle.PickleException`. It can be fixed by converting `initialModel.weights` to `list`. Author: zero323 <matthew.szymkiewicz@gmail.com> Closes #9986 from zero323/SPARK-12006. (cherry picked from commit fcd013c) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>

Move Py4jCallbackConnectionCleaner to Streaming because the callback server starts only in StreamingContext. Author: Shixiong Zhu <shixiong@databricks.com> Closes #10621 from zsxwing/SPARK-12617-2. (cherry picked from commit 1e6648d) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

…lt root path to gain the streaming batch url. Author: huangzhaowei <carlmartinmax@gmail.com> Closes #10617 from SaintBacchus/SPARK-12672.

…of default root path to gain the streaming batch url." This reverts commit 8f0ead3. Will merge #10618 instead.

… pyspark JIRA: https://issues.apache.org/jira/browse/SPARK-12016 We should not directly use Word2VecModel in pyspark. We need to wrap it in a Word2VecModelWrapper when loading it in pyspark. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #10100 from viirya/fix-load-py-wordvecmodel. (cherry picked from commit b51a4cd) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>

Otherwise the url will be failed to proxy to the right one if in YARN mode. Here is the screenshot: ![screen shot 2016-01-06 at 5 28 26 pm](https://cloud.githubusercontent.com/assets/850797/12139632/bbe78ecc-b49c-11e5-8932-94e8b3622a09.png) Author: jerryshao <sshao@hortonworks.com> Closes #10618 from jerryshao/SPARK-12673. (cherry picked from commit 174e72c) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

MapPartitionsRDD was keeping a reference to `prev` after a call to `clearDependencies` which could lead to memory leak. Author: Guillaume Poulin <poulin.guillaume@gmail.com> Closes #10623 from gpoulin/map_partition_deps. (cherry picked from commit b673852) Signed-off-by: Reynold Xin <rxin@databricks.com>

…not None" This reverts commit fcd013c. Author: Yin Huai <yhuai@databricks.com> Closes #10632 from yhuai/pythonStyle. (cherry picked from commit e5cde7a) Signed-off-by: Yin Huai <yhuai@databricks.com>

modify 'spark.memory.offHeap.enabled' default value to false Author: zzcclp <xm_zzc@sina.com> Closes #10633 from zzcclp/fix_spark.memory.offHeap.enabled_default_value. (cherry picked from commit 84e77a1) Signed-off-by: Reynold Xin <rxin@databricks.com>

If initial model passed to GMM is not empty it causes net.razorvine.pickle.PickleException. It can be fixed by converting initialModel.weights to list. Author: zero323 <matthew.szymkiewicz@gmail.com> Closes #10644 from zero323/SPARK-12006. (cherry picked from commit 592f649) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>

…pping splits https://issues.apache.org/jira/browse/SPARK-12662 cc yhuai Author: Sameer Agarwal <sameer@databricks.com> Closes #10626 from sameeragarwal/randomsplit. (cherry picked from commit f194d99) Signed-off-by: Reynold Xin <rxin@databricks.com>

There is a bug in the calculation of ```maxSplitSize```. The ```totalLen``` should be divided by ```minPartitions``` and not by ```files.size```. Author: Darek Blasiak <darek.blasiak@640labs.com> Closes #10546 from datafarmer/setminpartitionsbug. (cherry picked from commit 8346518) Signed-off-by: Sean Owen <sowen@cloudera.com>

…owBatching configurations for Streaming /cc tdas brkyvz Author: Shixiong Zhu <shixiong@databricks.com> Closes #10453 from zsxwing/streaming-conf. (cherry picked from commit c94199e) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

…branch 1.6) backport #10609 to branch 1.6 Author: Shixiong Zhu <shixiong@databricks.com> Closes #10656 from zsxwing/SPARK-12591-branch-1.6.

spark.shuffle.service.enabled is spark application related configuration, it is not necessary to set it in yarn-site.xml Author: Jeff Zhang <zjffdu@apache.org> Closes #10657 from zjffdu/doc-fix. (cherry picked from commit 00d9261) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

Author: Udo Klein <git@blinkenlight.net> Closes #10642 from udoklein/patch-2. (cherry picked from commit 8c70cb4) Signed-off-by: Sean Owen <sowen@cloudera.com>

…s on secure Hadoop https://issues.apache.org/jira/browse/SPARK-12654 So the bug here is that WholeTextFileRDD.getPartitions has: val conf = getConf in getConf if the cloneConf=true it creates a new Hadoop Configuration. Then it uses that to create a new newJobContext. The newJobContext will copy credentials around, but credentials are only present in a JobConf not in a Hadoop Configuration. So basically when it is cloning the hadoop configuration its changing it from a JobConf to Configuration and dropping the credentials that were there. NewHadoopRDD just uses the conf passed in for the getPartitions (not getConf) which is why it works. Author: Thomas Graves <tgraves@staydecay.corp.gq1.yahoo.com> Closes #10651 from tgravescs/SPARK-12654. (cherry picked from commit 553fd7b) Signed-off-by: Tom Graves <tgraves@yahoo-inc.com>

We've fixed a lot of bugs in master, and since this is experimental in 1.6 we should consider back porting the fixes. The only thing that is obviously risky to me is 0e07ed3, we might try to remove that. Author: Wenchen Fan <wenchen@databricks.com> Author: gatorsmile <gatorsmile@gmail.com> Author: Liang-Chi Hsieh <viirya@gmail.com> Author: Cheng Lian <lian@databricks.com> Author: Nong Li <nong@databricks.com> Closes #10650 from marmbrus/dataset-backports.

Add ```hash``` function for SparkR ```DataFrame```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #10597 from yanboliang/spark-12645. (cherry picked from commit 3d77cff) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

… branch-1.6 This patch backports the `dev/test-dependencies` script (from #10461) to branch-1.6. Author: Josh Rosen <joshrosen@databricks.com> Closes #10680 from JoshRosen/test-deps-16-backport.

…to branch-1.6 This patch backports the Netty exclusion fixes from #10672 to branch-1.6. Author: Josh Rosen <joshrosen@databricks.com> Closes #10691 from JoshRosen/netty-exclude-16-backport.

According to the documentation the sortByKey method does not take a lambda as an argument, thus the example is flawed. Removed the argument completely as this will default to ascending sort. Author: Udo Klein <git@blinkenlight.net> Closes #10640 from udoklein/patch-1. (cherry picked from commit bd723bd) Signed-off-by: Sean Owen <sowen@cloudera.com>

Author: Jacek Laskowski <jacek@japila.pl> Closes #10698 from jaceklaskowski/streaming-kafka-typo-fixes. (cherry picked from commit b313bad) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

…er install in dep tests This patch fixes a build/test issue caused by the combination of #10672 and a latent issue in the original `dev/test-dependencies` script. First, changes which _only_ touched build files were not triggering full Jenkins runs, making it possible for a build change to be merged even though it could cause failures in other tests. The `root` build module now depends on `build`, so all tests will now be run whenever a build-related file is changed. I also added a `clean` step to the Maven install step in `dev/test-dependencies` in order to address an issue where the dummy JARs stuck around and caused "multiple assembly JARs found" errors in tests. /cc zsxwing Author: Josh Rosen <joshrosen@databricks.com> Closes #10704 from JoshRosen/fix-build-test-problems. (cherry picked from commit a449914) Signed-off-by: Josh Rosen <joshrosen@databricks.com>

…ampType casting Warning users about casting changes. Author: Brandon Bradley <bradleytastic@gmail.com> Closes #10708 from blbradley/spark-12758. (cherry picked from commit a767ee8) Signed-off-by: Michael Armbrust <michael@databricks.com>

https://issues.apache.org/jira/browse/SPARK-11823 This test often hangs and times out, leaving hanging processes. Let's ignore it for now and improve the test. Author: Yin Huai <yhuai@databricks.com> Closes #10715 from yhuai/SPARK-11823-ignore. (cherry picked from commit aaa2c3b) Signed-off-by: Josh Rosen <joshrosen@databricks.com>

…d function "aggregate" Currently, RDD function aggregate's parameter doesn't explain well, especially parameter "zeroValue". It's helpful to let junior scala user know that "zeroValue" attend both "seqOp" and "combOp" phase. Author: Tommy YU <tummyyu@163.com> Closes #10587 from Wenpei/rdd_aggregate_doc. (cherry picked from commit 9f0995b) Signed-off-by: Sean Owen <sowen@cloudera.com>

[SPARK-12582][Test] IndexShuffleBlockResolverSuite fails in windows * IndexShuffleBlockResolverSuite fails in windows due to file is not closed. * mv IndexShuffleBlockResolverSuite.scala from "test/java" to "test/scala". https://issues.apache.org/jira/browse/SPARK-12582 Author: Yucai Yu <yucai.yu@intel.com> Closes #10526 from yucai/master. (cherry picked from commit 7e15044) Signed-off-by: Sean Owen <sowen@cloudera.com>

…gression Use a much smaller step size in LinearRegressionWithSGD MLlib examples to achieve a reasonable RMSE. Our training folks hit this exact same issue when concocting an example and had the same solution. Author: Sean Owen <sowen@cloudera.com> Closes #10675 from srowen/SPARK-5273. (cherry picked from commit 9c7f34a) Signed-off-by: Sean Owen <sowen@cloudera.com>

…orm equals to zero Cosine similarity with 0 vector should be 0 Related to #10152 Author: Sean Owen <sowen@cloudera.com> Closes #10696 from srowen/SPARK-7615. (cherry picked from commit c48f2a3) Signed-off-by: Sean Owen <sowen@cloudera.com>

This reverts commit 8b5f230.

#10311 introduces some rare, non-deterministic flakiness for hive udf tests, see #10311 (comment) I can't reproduce it locally, and may need more time to investigate, a quick solution is: bypass hive tests for json serialization. Author: Wenchen Fan <wenchen@databricks.com> Closes #10430 from cloud-fan/hot-fix. (cherry picked from commit 8543997) Signed-off-by: Michael Armbrust <michael@databricks.com>

…in GROUP BY clause cloud-fan Can you please take a look ? In this case, we are failing during check analysis while validating the aggregation expression. I have added a semanticEquals for HiveGenericUDF to fix this. Please let me know if this is the right way to address this issue. Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #10520 from dilipbiswal/spark-12558. (cherry picked from commit dc7b387) Signed-off-by: Yin Huai <yhuai@databricks.com> Conflicts: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveShim.scala

The default run has changed, but the documentation didn't fully reflect the change. Author: Luc Bourlier <luc.bourlier@typesafe.com> Closes #10740 from skyluc/issue/mesos-modes-doc. (cherry picked from commit cc91e21) Signed-off-by: Reynold Xin <rxin@databricks.com>

…verflow jira: https://issues.apache.org/jira/browse/SPARK-12685 master PR: #10627 the log of word2vec reports trainWordsCount = -785727483 during computation over a large dataset. Update the priority as it will affect the computation process. alpha = learningRate * (1 - numPartitions * wordCount.toDouble / (trainWordsCount + 1)) Author: Yuhao Yang <hhbyyh@gmail.com> Closes #10721 from hhbyyh/branch-1.4. (cherry picked from commit 7bd2564) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>

…thon3 This replaces the `execfile` used for running custom python shell scripts with explicit open, compile and exec (as recommended by 2to3). The reason for this change is to make the pythonstartup option compatible with python3. Author: Erik Selin <erik.selin@gmail.com> Closes #10255 from tyro89/pythonstartup-python3. (cherry picked from commit e4e0b3f) Signed-off-by: Josh Rosen <joshrosen@databricks.com>

I hit the exception below. The `UnsafeKVExternalSorter` does pass `null` as the consumer when creating an `UnsafeInMemorySorter`. Normally the NPE doesn't occur because the `inMemSorter` is set to null later and the `free()` method is not called. It happens when there is another exception like OOM thrown before setting `inMemSorter` to null. Anyway, we can add the null check to avoid it. ``` ERROR spark.TaskContextImpl: Error in TaskCompletionListener java.lang.NullPointerException at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.free(UnsafeInMemorySorter.java:110) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.cleanupResources(UnsafeExternalSorter.java:288) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter$1.onTaskCompletion(UnsafeExternalSorter.java:141) at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:79) at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:77) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:77) at org.apache.spark.scheduler.Task.run(Task.scala:91) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) ``` Author: Carson Wang <carson.wang@intel.com> Closes #10637 from carsonwang/FixNPE. (cherry picked from commit eabc7b8) Signed-off-by: Josh Rosen <joshrosen@databricks.com>

…number of features is large jira: https://issues.apache.org/jira/browse/SPARK-12026 The issue is valid as features.toArray.view.zipWithIndex.slice(startCol, endCol) becomes slower as startCol gets larger. I tested on local and the change can improve the performance and the running time was stable. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #10146 from hhbyyh/chiSq. (cherry picked from commit 021dafc) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>

When an Executor process is destroyed, the FileAppender that is asynchronously reading the stderr stream of the process can throw an IOException during read because the stream is closed. Before the ExecutorRunner destroys the process, the FileAppender thread is flagged to stop. This PR wraps the inputStream.read call of the FileAppender in a try/catch block so that if an IOException is thrown and the thread has been flagged to stop, it will safely ignore the exception. Additionally, the FileAppender thread was changed to use Utils.tryWithSafeFinally to better log any exception that do occur. Added unit tests to verify a IOException is thrown and logged if FileAppender is not flagged to stop, and that no IOException when the flag is set. Author: Bryan Cutler <cutlerb@gmail.com> Closes #10714 from BryanCutler/file-appender-read-ioexception-SPARK-9844. (cherry picked from commit 56cdbd6) Signed-off-by: Sean Owen <sowen@cloudera.com>

… allocation Add `listener.synchronized` to get `storageStatusList` and `execInfo` atomically. Author: Shixiong Zhu <shixiong@databricks.com> Closes #10728 from zsxwing/SPARK-12784. (cherry picked from commit 501e99e) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

If sort column contains slash(e.g. "Executor ID / Host") when yarn mode,sort fail with following message. ![spark-12708](https://cloud.githubusercontent.com/assets/6679275/12193320/80814f8c-b62a-11e5-9914-7bf3907029df.png) Ｉt's similar to SPARK-4313 . Author: root <root@R520T1.(none)> Author: Koyo Yoshida <koyo0615@gmail.com> Closes #10663 from yoshidakuy/SPARK-12708. (cherry picked from commit 32cca93) Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.usca.ibm.com> Author: Oscar D. Lara Yejas <olarayej@mail.usf.edu> Author: Oscar D. Lara Yejas <oscar.lara.yejas@us.ibm.com> Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.attlocal.net> Closes #9613 from olarayej/SPARK-11031. (cherry picked from commit ba4a641) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

…read completion Changed Logging FileAppender to use join in `awaitTermination` to ensure that thread is properly finished before returning. Author: Bryan Cutler <cutlerb@gmail.com> Closes #10654 from BryanCutler/fileAppender-join-thread-SPARK-12701. (cherry picked from commit ea104b8) Signed-off-by: Sean Owen <sowen@cloudera.com>

http://spark.apache.org/docs/latest/ml-guide.html#example-pipeline ``` val sameModel = Pipeline.load("/tmp/spark-logistic-regression-model") ``` should be ``` val sameModel = PipelineModel.load("/tmp/spark-logistic-regression-model") ``` cc: jkbradley Author: Jeff Lam <sha0lin@alumni.carnegiemellon.edu> Closes #10769 from Agent007/SPARK-12722. (cherry picked from commit 86972fa) Signed-off-by: Sean Owen <sowen@cloudera.com>

…plied in GROUP BY clause Addresses the comments from Yin. #10520 Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #10758 from dilipbiswal/spark-12558-followup. (cherry picked from commit db9a860) Signed-off-by: Yin Huai <yhuai@databricks.com> Conflicts: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveUDFSuite.scala

…ures Currently `summary()` fails on a GLM model fitted over a vector feature missing ML attrs, since the output feature attrs will also have no name. We can avoid this situation by forcing `VectorAssembler` to make up suitable names when inputs are missing names. cc mengxr Author: Eric Liang <ekl@databricks.com> Closes #10323 from ericl/spark-12346. (cherry picked from commit 5e492e9) Signed-off-by: Xiangrui Meng <meng@databricks.com>

…ntegration doc This PR added instructions to get flume assembly jar for Python users in the flume integration page like Kafka doc. Author: Shixiong Zhu <shixiong@databricks.com> Closes #10746 from zsxwing/flume-doc. (cherry picked from commit a973f48) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

… integration doc This PR added instructions to get Kinesis assembly jar for Python users in the Kinesis integration page like Kafka doc. Author: Shixiong Zhu <shixiong@databricks.com> Closes #10822 from zsxwing/kinesis-doc. (cherry picked from commit 721845c) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

In SPARK-10743 we wrap cast with `UnresolvedAlias` to give `Cast` a better alias if possible. However, for cases like filter, the `UnresolvedAlias` can't be resolved and actually we don't need a better alias for this case. This PR move the cast wrapping logic to `Column.named` so that we will only do it when we need a alias name. backport #10781 to 1.6 Author: Wenchen Fan <wenchen@databricks.com> Closes #10819 from cloud-fan/bug.

… in interface.scala Author: proflin <proflin.me@gmail.com> Closes #10824 from proflin/master. (cherry picked from commit c00744e) Signed-off-by: Reynold Xin <rxin@databricks.com>

Change assertion's message so it's consistent with the code. The old message says that the invoked method was lapack.dports, where in fact it was lapack.dppsv method. Author: Wojciech Jurczyk <wojtek.jurczyk@gmail.com> Closes #10818 from wjur/wjur/rename_error_message. (cherry picked from commit ebd9ce0) Signed-off-by: Sean Owen <sowen@cloudera.com>

…ReaderBase It looks like there's one place left in the codebase, SpecificParquetRecordReaderBase, where we didn't use SparkHadoopUtil's reflective accesses of TaskAttemptContext methods, which could create problems when using a single Spark artifact with both Hadoop 1.x and 2.x. Author: Josh Rosen <joshrosen@databricks.com> Closes #10843 from JoshRosen/SPARK-12921.

https://issues.apache.org/jira/browse/SPARK-12747 Postgres JDBC driver uses "FLOAT4" or "FLOAT8" not "real". Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #10695 from viirya/fix-postgres-jdbc. (cherry picked from commit 55c7dd0) Signed-off-by: Reynold Xin <rxin@databricks.com>

…s don't fit in Streaming page Added CSS style to force names of input streams with receivers to wrap Author: Alex Bozarth <ajbozart@us.ibm.com> Closes #10873 from ajbozarth/spark12859. (cherry picked from commit 358a33b) Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

…local vs cluster srowen thanks for the PR at #10866! sorry it took me a while. This is related to #10866, basically the assignment in the lambda expression in the python example is actually invalid ``` In [1]: data = [1, 2, 3, 4, 5] In [2]: counter = 0 In [3]: rdd = sc.parallelize(data) In [4]: rdd.foreach(lambda x: counter += x) File "<ipython-input-4-fcb86c182bad>", line 1 rdd.foreach(lambda x: counter += x) ^ SyntaxError: invalid syntax ``` Author: Mortada Mehyar <mortada.mehyar@gmail.com> Closes #10867 from mortada/doc_python_fix. (cherry picked from commit 56f57f8) Signed-off-by: Sean Owen <sowen@cloudera.com>

…al vs cluster mode in closure handling Clarify that modifying a driver local variable won't have the desired effect in cluster modes, and may or may not work as intended in local mode Author: Sean Owen <sowen@cloudera.com> Closes #10866 from srowen/SPARK-12760. (cherry picked from commit aca2a01) Signed-off-by: Sean Owen <sowen@cloudera.com>

…ialize HiveContext in PySpark davies Mind to review ? This is the error message after this PR ``` 15/12/03 16:59:53 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException /Users/jzhang/github/spark/python/pyspark/sql/context.py:689: UserWarning: You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly warnings.warn("You must build Spark with Hive. " Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 663, in read return DataFrameReader(self) File "/Users/jzhang/github/spark/python/pyspark/sql/readwriter.py", line 56, in __init__ self._jreader = sqlContext._ssql_ctx.read() File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 692, in _ssql_ctx raise e py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext. : java.lang.RuntimeException: java.net.ConnectException: Call From jzhangMBPr.local/127.0.0.1 to 0.0.0.0:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522) at org.apache.spark.sql.hive.client.ClientWrapper.<init>(ClientWrapper.scala:194) at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:238) at org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:218) at org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:208) at org.apache.spark.sql.hive.HiveContext.functionRegistry$lzycompute(HiveContext.scala:462) at org.apache.spark.sql.hive.HiveContext.functionRegistry(HiveContext.scala:461) at org.apache.spark.sql.UDFRegistration.<init>(UDFRegistration.scala:40) at org.apache.spark.sql.SQLContext.<init>(SQLContext.scala:330) at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:90) at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:101) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381) at py4j.Gateway.invoke(Gateway.java:214) at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79) at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68) at py4j.GatewayConnection.run(GatewayConnection.java:209) at java.lang.Thread.run(Thread.java:745) ``` Author: Jeff Zhang <zjffdu@apache.org> Closes #10126 from zjffdu/SPARK-12120. (cherry picked from commit e789b1d) Signed-off-by: Josh Rosen <joshrosen@databricks.com>

…to Python rows When actual row length doesn't conform to specified schema field length, we should give a better error message instead of throwing an unintuitive `ArrayOutOfBoundsException`. Author: Cheng Lian <lian@databricks.com> Closes #10886 from liancheng/spark-12624. (cherry picked from commit 3327fd2) Signed-off-by: Yin Huai <yhuai@databricks.com>

…e failure Author: Andy Grove <andygrove73@gmail.com> Closes #10865 from andygrove/SPARK-12932. (cherry picked from commit d8e4805) Signed-off-by: Sean Owen <sowen@cloudera.com>

[SPARK-12755][CORE] Stop the event logger before the DAG scheduler to avoid a race condition where the standalone master attempts to build the app's history UI before the event log is stopped. This contribution is my original work, and I license this work to the Spark project under the project's open source license. Author: Michael Allman <michael@videoamp.com> Closes #10700 from mallman/stop_event_logger_first. (cherry picked from commit 4ee8191) Signed-off-by: Sean Owen <sowen@cloudera.com>

JIRA: https://issues.apache.org/jira/browse/SPARK-12961 To prevent memory leak in snappy-java, just call the method once and cache the result. After the library releases new version, we can remove this object. JoshRosen Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #10875 from viirya/prevent-snappy-memory-leak. (cherry picked from commit 5936bf9) Signed-off-by: Sean Owen <sowen@cloudera.com>

… hive metadata format This PR adds a new table option (`skip_hive_metadata`) that'd allow the user to skip storing the table metadata in hive metadata format. While this could be useful in general, the specific use-case for this change is that Hive doesn't handle wide schemas well (see https://issues.apache.org/jira/browse/SPARK-12682 and https://issues.apache.org/jira/browse/SPARK-6024) which in turn prevents such tables from being queried in SparkSQL. Author: Sameer Agarwal <sameer@databricks.com> Closes #10826 from sameeragarwal/skip-hive-metadata. (cherry picked from commit 08c781c) Signed-off-by: Yin Huai <yhuai@databricks.com>

Author: Yin Huai <yhuai@databricks.com> Closes #10925 from yhuai/branch-1.6-hot-fix.

Previously (when the PR was first created) not specifying b= explicitly was fine (and treated as default null) - instead be explicit about b being None in the test. Author: Holden Karau <holden@us.ibm.com> Closes #10564 from holdenk/SPARK-12611-fix-test-infer-schema-local. (cherry picked from commit 13dab9c) Signed-off-by: Yin Huai <yhuai@databricks.com>

…vaList Backport of SPARK-12834 for branch-1.6 Original PR: #10772 Original commit message: We use `SerDe.dumps()` to serialize `JavaArray` and `JavaList` in `PythonMLLibAPI`, then deserialize them with `PickleSerializer` in Python side. However, there is no need to transform them in such an inefficient way. Instead of it, we can use type conversion to convert them, e.g. `list(JavaArray)` or `list(JavaList)`. What's more, there is an issue to Ser/De Scala Array as I said in https://issues.apache.org/jira/browse/SPARK-12780 Author: Xusen Yin <yinxusen@gmail.com> Closes #10941 from jkbradley/yinxusen-SPARK-12834-1.6.

…ith `None` triggers cryptic failure The error message is now changed from "Do not support type class scala.Tuple2." to "Do not support type class org.json4s.JsonAST$JNull$" to be more informative about what is not supported. Also, StructType metadata now handles JNull correctly, i.e., {'a': None}. test_metadata_null is added to tests.py to show the fix works. Author: Jason Lee <cjlee@us.ibm.com> Closes #8969 from jasoncl/SPARK-10847. (cherry picked from commit edd4737) Signed-off-by: Yin Huai <yhuai@databricks.com>

…to branch-1.6 SPARK-13082 actually fixed by #10559. However, it's a big PR and not backported to 1.6. This PR just backported the fix of 'read.json(rdd)' to branch-1.6. Author: Shixiong Zhu <shixiong@databricks.com> Closes #10988 from zsxwing/json-rdd.

Apparently chrome removed `SVGElement.prototype.getTransformToElement`, which is used by our JS library dagre-d3 when creating edges. The real diff can be found here: andrewor14/dagre-d3@7d6c000, which is taken from the fix in the main repo: dagrejs/dagre-d3@1ef067f Upstream issue: dagrejs/dagre-d3#202 Author: Andrew Or <andrew@databricks.com> Closes #10986 from andrewor14/fix-dag-viz. (cherry picked from commit 70e69fc) Signed-off-by: Andrew Or <andrew@databricks.com>

…uildPartitionedTableScan Hello Michael & All: We have some issues to submit the new codes in the other PR(#10299), so we closed that PR and open this one with the fix. The reason for the previous failure is that the projection for the scan when there is a filter that is not pushed down (the "left-over" filter) could be different, in elements or ordering, from the original projection. With this new codes, the approach to solve this problem is: Insert a new Project if the "left-over" filter is nonempty and (the original projection is not empty and the projection for the scan has more than one elements which could otherwise cause different ordering in projection). We create 3 test cases to cover the otherwise failure cases. Author: Kevin Yu <qyu@us.ibm.com> Closes #10388 from kevinyu98/spark-12231. (cherry picked from commit fd50df4) Signed-off-by: Cheng Lian <lian@databricks.com>

JIRA: https://issues.apache.org/jira/browse/SPARK-12989 In the rule `ExtractWindowExpressions`, we simply replace alias by the corresponding attribute. However, this will cause an issue exposed by the following case: ```scala val data = Seq(("a", "b", "c", 3), ("c", "b", "a", 3)).toDF("A", "B", "C", "num") .withColumn("Data", struct("A", "B", "C")) .drop("A") .drop("B") .drop("C") val winSpec = Window.partitionBy("Data.A", "Data.B").orderBy($"num".desc) data.select($"*", max("num").over(winSpec) as "max").explain(true) ``` In this case, both `Data.A` and `Data.B` are `alias` in `WindowSpecDefinition`. If we replace these alias expression by their alias names, we are unable to know what they are since they will not be put in `missingExpr` too. Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #10963 from gatorsmile/seletStarAfterColDrop. (cherry picked from commit 33c8a49) Signed-off-by: Michael Armbrust <michael@databricks.com>

ISTM `lib` is better because `datanucleus` jars are located in `lib` for release builds. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #10901 from maropu/DocFix. (cherry picked from commit da9146c) Signed-off-by: Michael Armbrust <michael@databricks.com>

Changed a target at branch-1.6 from #10635. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #10915 from maropu/pr9935-v3.

It is not valid to call `toAttribute` on a `NamedExpression` unless we know for sure that the child produced that `NamedExpression`. The current code worked fine when the grouping expressions were simple, but when they were a derived value this blew up at execution time. Author: Michael Armbrust <michael@databricks.com> Closes #11011 from marmbrus/groupByFunction.

Author: Michael Armbrust <michael@databricks.com> Closes #11014 from marmbrus/seqEncoders. (cherry picked from commit 29d9218) Signed-off-by: Michael Armbrust <michael@databricks.com>

…ML python models' properties Backport of [SPARK-12780] for branch-1.6 Original PR for master: #10724 This fixes StringIndexerModel.labels in pyspark. Author: Xusen Yin <yinxusen@gmail.com> Closes #10950 from jkbradley/yinxusen-spark-12780-backport.

I've tried to solve some of the issues mentioned in: https://issues.apache.org/jira/browse/SPARK-12629 Please, let me know what do you think. Thanks! Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com> Closes #10580 from NarineK/sparkrSavaAsRable. (cherry picked from commit 8a88e12) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

java mapwithstate with Function3 has wrong conversion of java `Optional` to scala `Option`, fixed code uses same conversion used in the mapwithstate call that uses Function4 as an input. `Optional.fromNullable(v.get)` fails if v is `None`, better to use `JavaUtils.optionToOptional(v)` instead. Author: Gabriele Nizzoli <mail@nizzoli.net> Closes #11007 from gabrielenizzoli/branch-1.6.

…lumn name duplication Fixes problem and verifies fix by test suite. Also - adds optional parameter: nullable (Boolean) to: SchemaUtils.appendColumn and deduplicates SchemaUtils.appendColumn functions. Author: Grzegorz Chilkiewicz <grzegorz.chilkiewicz@codilime.com> Closes #10741 from grzegorz-chilkiewicz/master. (cherry picked from commit b1835d7) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>

Jira: https://issues.apache.org/jira/browse/SPARK-13056 Create a map like { "a": "somestring", "b": null} Query like SELECT col["b"] FROM t1; NPE would be thrown. Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #10964 from adrian-wang/npewriter. (cherry picked from commit 358300c) Signed-off-by: Michael Armbrust <michael@databricks.com> Conflicts: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

The example will throw error like <console>:20: error: not found: value StructType Need to add this line: import org.apache.spark.sql.types._ Author: Kevin (Sangwoo) Kim <sangwookim.me@gmail.com> Closes #10141 from swkimme/patch-1. (cherry picked from commit b377b03) Signed-off-by: Michael Armbrust <michael@databricks.com>

https://issues.apache.org/jira/browse/SPARK-13122 A race condition can occur in MemoryStore's unrollSafely() method if two threads that return the same value for currentTaskAttemptId() execute this method concurrently. This change makes the operation of reading the initial amount of unroll memory used, performing the unroll, and updating the associated memory maps atomic in order to avoid this race condition. Initial proposed fix wraps all of unrollSafely() in a memoryManager.synchronized { } block. A cleaner approach might be introduce a mechanism that synchronizes based on task attempt ID. An alternative option might be to track unroll/pending unroll memory based on block ID rather than task attempt ID. Author: Adam Budde <budde@amazon.com> Closes #11012 from budde/master. (cherry picked from commit ff71261) Signed-off-by: Andrew Or <andrew@databricks.com> Conflicts: core/src/main/scala/org/apache/spark/storage/MemoryStore.scala

…uration columns I have clearly prefix the two 'Duration' columns in 'Details of Batch' Streaming tab as 'Output Op Duration' and 'Job Duration' Author: Mario Briggs <mario.briggs@in.ibm.com> Author: mariobriggs <mariobriggs@in.ibm.com> Closes #11022 from mariobriggs/spark-12739. (cherry picked from commit e9eb248) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

…ld not fail analysis of encoder nullability should only be considered as an optimization rather than part of the type system, so instead of failing analysis for mismatch nullability, we should pass analysis and add runtime null check. backport #11035 to 1.6 Author: Wenchen Fan <wenchen@databricks.com> Closes #11042 from cloud-fan/branch-1.6.

minor fix for api link in ml onevsrest Author: Yuhao Yang <hhbyyh@gmail.com> Closes #11068 from hhbyyh/onevsrestDoc. (cherry picked from commit c2c956b) Signed-off-by: Xiangrui Meng <meng@databricks.com>

…ot set but timeoutThreshold is defined Check the state Existence before calling get. Author: Shixiong Zhu <shixiong@databricks.com> Closes #11081 from zsxwing/SPARK-13195. (cherry picked from commit 8e2f296) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

Author: Bill Chambers <bill@databricks.com> Closes #11094 from anabranch/dynamic-docs. (cherry picked from commit 66e1383) Signed-off-by: Andrew Or <andrew@databricks.com>

There is a bug when we try to grow the buffer, OOM is ignore wrongly (the assert also skipped by JVM), then we try grow the array again, this one will trigger spilling free the current page, the current record we inserted will be invalid. The root cause is that JVM has less free memory than MemoryManager thought, it will OOM when allocate a page without trigger spilling. We should catch the OOM, and acquire memory again to trigger spilling. And also, we could not grow the array in `insertRecord` of `InMemorySorter` (it was there just for easy testing). Author: Davies Liu <davies@databricks.com> Closes #11095 from davies/fix_expand.

…ters with Jackson 2.2.3 Patch to 1. Shade jackson 2.x in spark-yarn-shuffle JAR: core, databind, annotation 2. Use maven antrun to verify the JAR has the renamed classes Being Maven-based, I don't know if the verification phase kicks in on an SBT/jenkins build. It will on a `mvn install` Author: Steve Loughran <stevel@hortonworks.com> Closes #10780 from steveloughran/stevel/patches/SPARK-12807-master-shuffle. (cherry picked from commit 34d0b70) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

JIRA: https://issues.apache.org/jira/browse/SPARK-10524 Currently we use the hard prediction (`ImpurityCalculator.predict`) to order categories' bins. But we should use the soft prediction. Author: Liang-Chi Hsieh <viirya@gmail.com> Author: Liang-Chi Hsieh <viirya@appier.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #8734 from viirya/dt-soft-centroids. (cherry picked from commit 9267bc6) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>

… SpecificParquetRecordReaderBase This is a minor followup to #10843 to fix one remaining place where we forgot to use reflective access of TaskAttemptContext methods. Author: Josh Rosen <joshrosen@databricks.com> Closes #11131 from JoshRosen/SPARK-12921-take-2.

Update Aggregator links to point to #org.apache.spark.sql.expressions.Aggregator Author: raela <raela@databricks.com> Closes #11158 from raelawang/master. (cherry picked from commit 719973b) Signed-off-by: Reynold Xin <rxin@databricks.com>

…e system besides HDFS jkbradley I tried to improve the function to export a model. When I tried to export a model to S3 under Spark 1.6, we couldn't do that. So, it should offer S3 besides HDFS. Can you review it when you have time? Thanks! Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #11151 from yu-iskw/SPARK-13265. (cherry picked from commit efb65e0) Signed-off-by: Xiangrui Meng <meng@databricks.com>

…n error Pyspark Params class has a method `hasParam(paramName)` which returns `True` if the class has a parameter by that name, but throws an `AttributeError` otherwise. There is not currently a way of getting a Boolean to indicate if a class has a parameter. With Spark 2.0 we could modify the existing behavior of `hasParam` or add an additional method with this functionality. In Python: ```python from pyspark.ml.classification import NaiveBayes nb = NaiveBayes() print nb.hasParam("smoothing") print nb.hasParam("notAParam") ``` produces: > True > AttributeError: 'NaiveBayes' object has no attribute 'notAParam' However, in Scala: ```scala import org.apache.spark.ml.classification.NaiveBayes val nb = new NaiveBayes() nb.hasParam("smoothing") nb.hasParam("notAParam") ``` produces: > true > false cc holdenk Author: sethah <seth.hendrickson16@gmail.com> Closes #10962 from sethah/SPARK-13047. (cherry picked from commit b354673) Signed-off-by: Xiangrui Meng <meng@databricks.com>

…alue parameter Fix this defect by check default value exist or not. yanboliang Please help to review. Author: Tommy YU <tummyyu@163.com> Closes #11043 from Wenpei/spark-13153-handle-param-withnodefaultvalue. (cherry picked from commit d3e2e20) Signed-off-by: Xiangrui Meng <meng@databricks.com>

… Windows Due to being on a Windows platform I have been unable to run the tests as described in the "Contributing to Spark" instructions. As the change is only to two lines of code in the Web UI, which I have manually built and tested, I am submitting this pull request anyway. I hope this is OK. Is it worth considering also including this fix in any future 1.5.x releases (if any)? I confirm this is my own original work and license it to the Spark project under its open source license. Author: markpavey <mark.pavey@thefilter.com> Closes #11135 from markpavey/JIRA_SPARK-13142_WindowsWebUILogFix. (cherry picked from commit 374c4b2) Signed-off-by: Sean Owen <sowen@cloudera.com>

…ailed test JIRA: https://issues.apache.org/jira/browse/SPARK-12363 This issue is pointed by yanboliang. When `setRuns` is removed from PowerIterationClustering, one of the tests will be failed. I found that some `dstAttr`s of the normalized graph are not correct values but 0.0. By setting `TripletFields.All` in `mapTriplets` it can work. Author: Liang-Chi Hsieh <viirya@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Closes #10539 from viirya/fix-poweriter. (cherry picked from commit e3441e3) Signed-off-by: Xiangrui Meng <meng@databricks.com>

Looks like pygments.rb gem is also required for jekyll build to work. At least on Ubuntu/RHEL I could not do build without this dependency. So added this to steps. Author: Amit Dev <amitdev@gmail.com> Closes #11180 from amitdev/master. (cherry picked from commit 331293c) Signed-off-by: Sean Owen <sowen@cloudera.com>

…-guide Response to JIRA https://issues.apache.org/jira/browse/SPARK-13312. This contribution is my original work and I license the work to this project. Author: JeremyNixon <jnixon2@gmail.com> Closes #11199 from JeremyNixon/update_train_val_split_example. (cherry picked from commit adb5483) Signed-off-by: Sean Owen <sowen@cloudera.com>

There's a small typo in the SparseVector.parse docstring (which says that it returns a DenseVector rather than a SparseVector), which seems to be incorrect. Author: Miles Yucht <miles@databricks.com> Closes #11213 from mgyucht/fix-sparsevector-docs. (cherry picked from commit 827ed1c) Signed-off-by: Sean Owen <sowen@cloudera.com>

This commit removes an unnecessary duplicate check in addPendingTask that meant that scheduling a task set took time proportional to (# tasks)^2. Author: Sital Kedia <skedia@fb.com> Closes #11175 from sitalkedia/fix_stuck_driver. (cherry picked from commit 1e1e31e) Signed-off-by: Kay Ousterhout <kayousterhout@gmail.com>

… default is "python2.7" Author: Christopher C. Aycock <chris@chrisaycock.com> Closes #11239 from chrisaycock/master. (cherry picked from commit a7c74d7) Signed-off-by: Josh Rosen <joshrosen@databricks.com>

…pares Option and String directly. ## What changes were proposed in this pull request? Fix some comparisons between unequal types that cause IJ warnings and in at least one case a likely bug (TaskSetManager) ## How was the this patch tested? Running Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #11253 from srowen/SPARK-13371. (cherry picked from commit 7856253) Signed-off-by: Andrew Or <andrew@databricks.com>

A common problem that users encounter with Spark 1.6.0 is that writing to a partitioned parquet table OOMs. The root cause is that parquet allocates a significant amount of memory that is not accounted for by our own mechanisms. As a workaround, we can ensure that only a single file is open per task unless the user explicitly asks for more. Author: Michael Armbrust <michael@databricks.com> Closes #11308 from marmbrus/parquetWriteOOM. (cherry picked from commit 173aa94) Signed-off-by: Michael Armbrust <michael@databricks.com>

…ome special character ## What changes were proposed in this pull request? When there are some special characters (e.g., `"`, `\`) in `label`, DAG will be broken. This patch just escapes `label` to avoid DAG being broken by some special characters ## How was the this patch tested? Jenkins tests Author: Shixiong Zhu <shixiong@databricks.com> Closes #11309 from zsxwing/SPARK-13298. (cherry picked from commit a11b399) Signed-off-by: Andrew Or <andrew@databricks.com>

In SparkSQLCLI, we have created a `CliSessionState`, but then we call `SparkSQLEnv.init()`, which will start another `SessionState`. This would lead to exception because `processCmd` need to get the `CliSessionState` instance by calling `SessionState.get()`, but the return value would be a instance of `SessionState`. See the exception below. spark-sql> !echo "test"; Exception in thread "main" java.lang.ClassCastException: org.apache.hadoop.hive.ql.session.SessionState cannot be cast to org.apache.hadoop.hive.cli.CliSessionState at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:112) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:301) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:242) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:691) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #9589 from adrian-wang/clicommand. (cherry picked from commit 5d80fac) Signed-off-by: Michael Armbrust <michael@databricks.com> Conflicts: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/ClientWrapper.scala

…false) fix for branch-1.6 https://issues.apache.org/jira/browse/SPARK-13359 Author: Earthson Lu <Earthson.Lu@gmail.com> Closes #11237 from Earthson/SPARK-13359.

`GraphImpl.fromExistingRDDs` expects preprocessed vertex RDD as input. We call it in LDA without validating this requirement. So it might introduce errors. Replacing it by `Graph.apply` would be safer and more proper because it is a public API. The tests still pass. So maybe it is safe to use `fromExistingRDDs` here (though it doesn't seem so based on the implementation) or the test cases are special. jkbradley ankurdave Author: Xiangrui Meng <meng@databricks.com> Closes #11226 from mengxr/SPARK-13355. (cherry picked from commit 764ca18) Signed-off-by: Xiangrui Meng <meng@databricks.com>

## What changes were proposed in this pull request? This PR adds equality operators to UDT classes so that they can be correctly tested for dataType equality during union operations. This was previously causing `"AnalysisException: u"unresolved operator 'Union;""` when trying to unionAll two dataframes with UDT columns as below. ``` from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT from pyspark.sql import types schema = types.StructType([types.StructField("point", PythonOnlyUDT(), True)]) a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema) b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema) c = a.unionAll(b) ``` ## How was the this patch tested? Tested using two unit tests in sql/test.py and the DataFrameSuite. Additional information here : https://issues.apache.org/jira/browse/SPARK-13410 rxin Author: Franklyn D'souza <franklynd@gmail.com> Closes #11333 from damnMeddlingKid/udt-union-patch.

…q is not Serializable ## What changes were proposed in this pull request? `scala.collection.Iterator`'s methods (e.g., map, filter) will return an `AbstractIterator` which is not Serializable. E.g., ```Scala scala> val iter = Array(1, 2, 3).iterator.map(_ + 1) iter: Iterator[Int] = non-empty iterator scala> println(iter.isInstanceOf[Serializable]) false ``` If we call something like `Iterator.map(...).toSeq`, it will create a `Stream` that contains a non-serializable `AbstractIterator` field and make the `Stream` be non-serializable. This PR uses `toArray` instead of `toSeq` to fix such issue in `def createDataFrame(data: java.util.List[_], beanClass: Class[_]): DataFrame`. ## How was the this patch tested? Jenkins tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #11334 from zsxwing/SPARK-13390.

…PR builder even if a PR only changes sql/core ## What changes were proposed in this pull request? `HiveCompatibilitySuite` should still run in PR build even if a PR only changes sql/core. So, I am going to remove `ExtendedHiveTest` annotation from `HiveCompatibilitySuite`. https://issues.apache.org/jira/browse/SPARK-13475 Author: Yin Huai <yhuai@databricks.com> Closes #11351 from yhuai/SPARK-13475. (cherry picked from commit bc35380) Signed-off-by: Yin Huai <yhuai@databricks.com>

…iton named in TransportConf. `spark.storage.memoryMapThreshold` has two kind of the value, one is 2*1024*1024 as integer and the other one is '2m' as string. "2m" is recommanded in document but it will go wrong if the code goes into `TransportConf#memoryMapBytes`. [Jira](https://issues.apache.org/jira/browse/SPARK-13482) Author: huangzhaowei <carlmartinmax@gmail.com> Closes #11360 from SaintBacchus/SPARK-13482. (cherry picked from commit 264533b) Signed-off-by: Reynold Xin <rxin@databricks.com>

…ministic field(s) ## What changes were proposed in this pull request? Predicates shouldn't be pushed through project with nondeterministic field(s). See graphframes/graphframes#23 and SPARK-13473 for more details. This PR targets master, branch-1.6, and branch-1.5. ## How was this patch tested? A test case is added in `FilterPushdownSuite`. It constructs a query plan where a filter is over a project with a nondeterministic field. Optimized query plan shouldn't change in this case. Author: Cheng Lian <lian@databricks.com> Closes #11348 from liancheng/spark-13473-no-ppd-through-nondeterministic-project-field. (cherry picked from commit 3fa6491) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…DataFrames Change line 113 of QuantileDiscretizer.scala to `val requiredSamples = math.max(numBins * numBins, 10000.0)` so that `requiredSamples` is a `Double`. This will fix the division in line 114 which currently results in zero if `requiredSamples < dataset.count` Manual tests. I was having a problems using QuantileDiscretizer with my a dataset and after making this change QuantileDiscretizer behaves as expected. Author: Oliver Pierson <ocp@gatech.edu> Author: Oliver Pierson <opierson@umd.edu> Closes #11319 from oliverpierson/SPARK-13444. (cherry picked from commit 6f8e835) Signed-off-by: Sean Owen <sowen@cloudera.com>

## What changes were proposed in this pull request? Instead of using result of File.listFiles() directly, which may throw NPE, check for null first. If it is null, log a warning instead ## How was the this patch tested? Ran the ./dev/run-tests locally Tested manually on a cluster Author: Terence Yim <terence@cask.co> Closes #11337 from chtyim/fixes/SPARK-13441-null-check. (cherry picked from commit fae88af) Signed-off-by: Sean Owen <sowen@cloudera.com>

Author: Michael Gummelt <mgummelt@mesosphere.io> Closes #11311 from mgummelt/document_csv. (cherry picked from commit c98a93d) Signed-off-by: Sean Owen <sowen@cloudera.com>

When application end, AM will clean the staging dir. But if the driver trigger to update the delegation token, it will can't find the right token file and then it will endless cycle call the method 'updateCredentialsIfRequired'. Then it lead driver StackOverflowError. https://issues.apache.org/jira/browse/SPARK-12316 Author: huangzhaowei <carlmartinmax@gmail.com> Closes #10475 from SaintBacchus/SPARK-12316. (cherry picked from commit 5fcf4c2) Signed-off-by: Tom Graves <tgraves@yahoo-inc.com>

…n large DataFrames" This reverts commit cb869a1.

…n name duplication ## What changes were proposed in this pull request? ML StringIndexer does not protect itself from column name duplication. We should still improve a way to validate a schema of `StringIndexer` and `StringIndexerModel`. However, it would be great to fix at another issue. ## How was this patch tested? unit test Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #11370 from yu-iskw/SPARK-12874. (cherry picked from commit 14e2700) Signed-off-by: Xiangrui Meng <meng@databricks.com>

…ith an underscore. ## What changes were proposed in this pull request? This change adds a workaround to allow users to drop a table with a name starting with an underscore. Without this patch, we can create such a table, but we cannot drop it. The reason is that Hive's parser unquote an quoted identifier (see https://github.com/apache/hive/blob/release-1.2.1/ql/src/java/org/apache/hadoop/hive/ql/parse/HiveLexer.g#L453). So, when we issue a drop table command to Hive, a table name starting with an underscore is actually not quoted. Then, Hive will complain about it because it does not support a table name starting with an underscore without using backticks (underscores are allowed as long as it is not the first char though). ## How was this patch tested? Add a test to make sure we can drop a table with a name starting with an underscore. https://issues.apache.org/jira/browse/SPARK-13454 Author: Yin Huai <yhuai@databricks.com> Closes #11349 from yhuai/fixDropTable.

…ts to home.apache.org Due to the people.apache.org -> home.apache.org migration, we need to update our packaging scripts to publish artifacts to the new server. Because the new server only supports sftp instead of ssh, we need to update the scripts to use lftp instead of ssh + rsync. Author: Josh Rosen <joshrosen@databricks.com> Closes #11350 from JoshRosen/update-release-scripts-for-apache-home. (cherry picked from commit f77dc4e) Signed-off-by: Josh Rosen <joshrosen@databricks.com>

This patch updates a few more 1.6.0 version numbers for the 1.6.1 release candidate. Verified this by running ``` git grep "1\.6\.0" | grep -v since | grep -v deprecated | grep -v Since | grep -v versionadded | grep 1.6.0 ``` and inspecting the output. Author: Josh Rosen <joshrosen@databricks.com> Closes #11407 from JoshRosen/version-number-updates.

… string datatypes to Oracle VARCHAR datatype ## What changes were proposed in this pull request? This Pull request is used for the fix SPARK-12941, creating a data type mapping to Oracle for the corresponding data type"Stringtype" from dataframe. This PR is for the master branch fix, where as another PR is already tested with the branch 1.4 ## How was the this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) This patch was tested using the Oracle docker .Created a new integration suite for the same.The oracle.jdbc jar was to be downloaded from the maven repository.Since there was no jdbc jar available in the maven repository, the jar was downloaded from oracle site manually and installed in the local; thus tested. So, for SparkQA test case run, the ojdbc jar might be manually placed in the local maven repository(com/oracle/ojdbc6/11.2.0.2.0) while Spark QA test run. Author: thomastechs <thomas.sebastian@tcs.com> Closes #11306 from thomastechs/master. (cherry picked from commit 8afe491) Signed-off-by: Yin Huai <yhuai@databricks.com>

## What changes were proposed in this pull request? TaskContext supports task completion callback, which gets called regardless of task failures. However, there is no way for the listener to know if there is an error. This patch adds a new listener that gets called when a task fails. ## How was this patch tested? New unit test case and integration test case covering the code path Author: Davies Liu <davies@databricks.com> Closes #11478 from davies/add_failure_1.6.

In order to tell OutputStream that the task has failed or not, we should call the failure callbacks BEFORE calling writer.close(). Added new unit tests. Author: Davies Liu <davies@databricks.com> Closes #11450 from davies/callback.

Fix race conditions when cleanup files. Existing tests. Author: Davies Liu <davies@databricks.com> Closes #11507 from davies/flaky. (cherry picked from commit d062587) Signed-off-by: Davies Liu <davies.liu@gmail.com> Conflicts: sql/hive/src/test/scala/org/apache/spark/sql/sources/CommitFailureTestRelationSuite.scala

…cled ## What changes were proposed in this pull request? `sendRpcSync` should copy the response content because the underlying buffer will be recycled and reused. ## How was this patch tested? Jenkins unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #11499 from zsxwing/SPARK-13652. (cherry picked from commit 465c665) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

cc jkbradley Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #9535 from yu-iskw/SPARK-11515. (cherry picked from commit 574571c) Signed-off-by: Sean Owen <sowen@cloudera.com>

… string datatypes to Oracle VARCHAR datatype mapping A test suite added for the bug fix -SPARK 12941; for the mapping of the StringType to corresponding in Oracle manual tests done (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: thomastechs <thomas.sebastian@tcs.com> Author: THOMAS SEBASTIAN <thomas.sebastian@tcs.com> Closes #11489 from thomastechs/thomastechs-12941-master-new. (cherry picked from commit f6ac7c3) Signed-off-by: Yin Huai <yhuai@databricks.com> Conflicts: sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala

…DataFrames ## What changes were proposed in this pull request? Change line 113 of QuantileDiscretizer.scala to `val requiredSamples = math.max(numBins * numBins, 10000.0)` so that `requiredSamples` is a `Double`. This will fix the division in line 114 which currently results in zero if `requiredSamples < dataset.count` ## How was the this patch tested? Manual tests. I was having a problems using QuantileDiscretizer with my a dataset and after making this change QuantileDiscretizer behaves as expected. Author: Oliver Pierson <ocp@gatech.edu> Author: Oliver Pierson <opierson@umd.edu> Closes #11319 from oliverpierson/SPARK-13444.

…ionSerializer.loads ## What changes were proposed in this pull request? Set the function's module name to `__main__` if it's missing in `TransformFunctionSerializer.loads`. ## How was this patch tested? Manually test in the shell. Before this patch: ``` >>> from pyspark.streaming import StreamingContext >>> from pyspark.streaming.util import TransformFunction >>> ssc = StreamingContext(sc, 1) >>> func = TransformFunction(sc, lambda x: x, sc.serializer) >>> func.rdd_wrapper(lambda x: x) TransformFunction(<function <lambda> at 0x106ac8b18>) >>> bytes = bytearray(ssc._transformerSerializer.serializer.dumps((func.func, func.rdd_wrap_func, func.deserializers))) >>> func2 = ssc._transformerSerializer.loads(bytes) >>> print(func2.func.__module__) None >>> print(func2.rdd_wrap_func.__module__) None >>> ``` After this patch: ``` >>> from pyspark.streaming import StreamingContext >>> from pyspark.streaming.util import TransformFunction >>> ssc = StreamingContext(sc, 1) >>> func = TransformFunction(sc, lambda x: x, sc.serializer) >>> func.rdd_wrapper(lambda x: x) TransformFunction(<function <lambda> at 0x108bf1b90>) >>> bytes = bytearray(ssc._transformerSerializer.serializer.dumps((func.func, func.rdd_wrap_func, func.deserializers))) >>> func2 = ssc._transformerSerializer.loads(bytes) >>> print(func2.func.__module__) __main__ >>> print(func2.rdd_wrap_func.__module__) __main__ >>> ``` Author: Shixiong Zhu <shixiong@databricks.com> Closes #11535 from zsxwing/loads-module. (cherry picked from commit ee913e6) Signed-off-by: Davies Liu <davies.liu@gmail.com>

…tly refers to StatefulNetworkWordCount ## What changes were proposed in this pull request? The reference to StatefulNetworkWordCount.scala from updateStatesByKey documentation should be removed, till there is a example for updateStatesByKey. ## How was this patch tested? Have tested the new documentation with jekyll build. Author: rmishra <rmishra@pivotal.io> Closes #11545 from rishitesh/SPARK-13705. (cherry picked from commit 4b13896) Signed-off-by: Sean Owen <sowen@cloudera.com>

…-hive and spark-hiveserver (branch 1.6) ## What changes were proposed in this pull request? This is just the patch of #11449 cherry picked to branch-1.6; the enforcer and dep/ diffs are cut Modifies the dependency declarations of the all the hive artifacts, to explicitly exclude the groovy-all JAR. This stops the groovy classes *and everything else in that uber-JAR* from getting into spark-assembly JAR. ## How was this patch tested? 1. Pre-patch build was made: `mvn clean install -Pyarn,hive,hive-thriftserver` 1. spark-assembly expanded, observed to have the org.codehaus.groovy packages and JARs 1. A maven dependency tree was created `mvn dependency:tree -Pyarn,hive,hive-thriftserver -Dverbose > target/dependencies.txt` 1. This text file examined to confirm that groovy was being imported as a dependency of `org.spark-project.hive` 1. Patch applied 1. Repeated step1: clean build of project with ` -Pyarn,hive,hive-thriftserver` set 1. Examined created spark-assembly, verified no org.codehaus packages 1. Verified that the maven dependency tree no longer references groovy The `master` version updates the dependency files and an enforcer rule to keep groovy out; this patch strips it out. Author: Steve Loughran <stevel@hortonworks.com> Closes #11473 from steveloughran/fixes/SPARK-13599-groovy+branch-1.6.

The description of "spark.memory.offHeap.size" in the current document does not clearly state that memory is counted with bytes.... This PR contains a small fix for this tiny issue document fix Author: CodingCat <zhunansjtu@gmail.com> Closes #11561 from CodingCat/master. (cherry picked from commit a3ec50a) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

## What changes were proposed in this pull request? Adding the hive-cli classes to the classloader ## How was this patch tested? The hive Versionssuite tests were run This is my original work and I license the work to the project under the project's open source license. Author: Tim Preece <tim.preece.in.oz@gmail.com> Closes #11495 from preecet/master. (cherry picked from commit 46f25c2) Signed-off-by: Michael Armbrust <michael@databricks.com>

…ient as it's in driver ## What changes were proposed in this pull request? AppClient runs in the driver side. It should not call `Utils.tryOrExit` as it will send exception to SparkUncaughtExceptionHandler and call `System.exit`. This PR just removed `Utils.tryOrExit`. ## How was this patch tested? manual tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #11566 from zsxwing/SPARK-13711.

When generating Graphviz DOT files in the SQL query visualization we need to escape double-quotes inside node labels. This is a followup to #11309, which fixed a similar graph in Spark Core's DAG visualization. Author: Josh Rosen <joshrosen@databricks.com> Closes #11587 from JoshRosen/graphviz-escaping. (cherry picked from commit 81f54ac) Signed-off-by: Josh Rosen <joshrosen@databricks.com>

## What changes were proposed in this pull request? If a job is being scheduled in one thread which has a dependency on an RDD currently executing a shuffle in another thread, Spark would throw a NullPointerException. This patch synchronizes access to `mapStatuses` and skips null status entries (which are in-progress shuffle tasks). ## How was this patch tested? Our client code unit test suite, which was reliably reproducing the race condition with 10 threads, shows that this fixes it. I have not found a minimal test case to add to Spark, but I will attempt to do so if desired. The same test case was tripping up on SPARK-4454, which was fixed by making other DAGScheduler code thread-safe. shivaram srowen Author: Andy Sloane <asloane@tetrationanalytics.com> Closes #11505 from a1k0n/SPARK-13631. (cherry picked from commit cbff280) Signed-off-by: Sean Owen <sowen@cloudera.com>

## What changes were proposed in this pull request? If there are many branches in a CaseWhen expression, the generated code could go above the 64K limit for single java method, will fail to compile. This PR change it to fallback to interpret mode if there are more than 20 branches. ## How was this patch tested? Add tests Author: Davies Liu <davies@databricks.com> Closes #11606 from davies/fix_when_16.

## What changes were proposed in this pull request? A very minor change for using `BigDecimal.decimal(f: Float)` instead of `BigDecimal(f: float)`. The latter is deprecated and can result in inconsistencies due to an implicit conversion to `Double`. ## How was this patch tested? N/A cc yhuai Author: Sameer Agarwal <sameer@databricks.com> Closes #11597 from sameeragarwal/bigdecimal. (cherry picked from commit 926e9c4) Signed-off-by: Yin Huai <yhuai@databricks.com>

This reverts commit 926e9c4.

Update snappy to 1.1.2.1 to pull in a single fix -- the OOM fix we already worked around. Supersedes #11524 Jenkins tests. Author: Sean Owen <sowen@cloudera.com> Closes #11631 from srowen/SPARK-13663. (cherry picked from commit 927e22e) Signed-off-by: Sean Owen <sowen@cloudera.com>

## What changes were proposed in this pull request? Today, Spark 1.6.1 and updated docs are release. Unfortunately, there is obsolete hive version information on docs: [Building Spark](http://spark.apache.org/docs/latest/building-spark.html#building-with-hive-and-jdbc-support). This PR fixes the following two lines. ``` -By default Spark will build with Hive 0.13.1 bindings. +By default Spark will build with Hive 1.2.1 bindings. -# Apache Hadoop 2.4.X with Hive 13 support +# Apache Hadoop 2.4.X with Hive 1.2.1 support ``` `sql/README.md` file also describe ## How was this patch tested? Manual. (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11639 from dongjoon-hyun/fix_doc_hive_version. (cherry picked from commit 88fa866) Signed-off-by: Reynold Xin <rxin@databricks.com>

Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.attlocal.net> Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.usca.ibm.com> Closes #11220 from olarayej/SPARK-13312-3. (cherry picked from commit 416e71a) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

…ions ## What changes were proposed in this pull request? Currently, when a java.net.BindException is thrown, it displays the following message: java.net.BindException: Address already in use: Service '$serviceName' failed after 16 retries! This change adds port configuration suggestions to the BindException, for example, for the UI, it now displays java.net.BindException: Address already in use: Service 'SparkUI' failed after 16 retries! Consider explicitly setting the appropriate port for 'SparkUI' (for example spark.ui.port for SparkUI) to an available port or increasing spark.port.maxRetries. ## How was this patch tested? Manual tests Author: Bjorn Jonsson <bjornjon@gmail.com> Closes #11644 from bjornjon/master. (cherry picked from commit 515e4af) Signed-off-by: Sean Owen <sowen@cloudera.com>

## What changes were proposed in this pull request? fix typo in DataSourceRegister ## How was this patch tested? found when going through latest code Author: Jacky Li <jacky.likun@huawei.com> Closes #11686 from jackylk/patch-12. (cherry picked from commit f3daa09) Signed-off-by: Reynold Xin <rxin@databricks.com>

Branch 1.6 #11668

Branch 1.6 #11668

Commits on Dec 11, 2015

Commits on Dec 12, 2015

Commits on Dec 13, 2015

Commits on Dec 14, 2015

Commits on Dec 15, 2015

Commits on Dec 16, 2015

Commits on Dec 17, 2015

Commits on Dec 18, 2015

Commits on Dec 19, 2015

Commits on Dec 21, 2015

Commits on Dec 22, 2015

Commits on Dec 23, 2015

Commits on Dec 24, 2015

Commits on Dec 28, 2015

Commits on Dec 29, 2015

Commits on Dec 30, 2015

Commits on Jan 3, 2016

Commits on Jan 4, 2016

Commits on Jan 5, 2016

Commits on Jan 6, 2016

Commits on Jan 7, 2016

Commits on Jan 8, 2016

Commits on Jan 9, 2016

Commits on Jan 10, 2016

Commits on Jan 11, 2016

Commits on Jan 12, 2016

Commits on Jan 13, 2016

Commits on Jan 14, 2016

Commits on Jan 15, 2016

Commits on Jan 16, 2016

Commits on Jan 18, 2016

Commits on Jan 19, 2016

Commits on Jan 21, 2016

Commits on Jan 22, 2016

Commits on Jan 23, 2016

Commits on Jan 24, 2016

Commits on Jan 25, 2016

Commits on Jan 26, 2016

Commits on Jan 27, 2016

Commits on Jan 29, 2016

Commits on Jan 30, 2016

Commits on Feb 1, 2016

Commits on Feb 2, 2016

Commits on Feb 3, 2016

Commits on Feb 4, 2016

Commits on Feb 5, 2016

Commits on Feb 8, 2016

Commits on Feb 9, 2016

Commits on Feb 10, 2016

Commits on Feb 11, 2016

Commits on Feb 12, 2016

Commits on Feb 13, 2016

Commits on Feb 14, 2016

Commits on Feb 15, 2016

Commits on Feb 16, 2016

Commits on Feb 17, 2016

Commits on Feb 18, 2016

Commits on Feb 22, 2016

Commits on Feb 23, 2016

Commits on Feb 24, 2016

Commits on Feb 25, 2016

Commits on Feb 26, 2016

Commits on Feb 27, 2016

Commits on Feb 29, 2016

Commits on Mar 3, 2016

Commits on Mar 4, 2016

Commits on Mar 6, 2016

Commits on Mar 7, 2016

Commits on Mar 8, 2016

Commits on Mar 9, 2016

Commits on Mar 10, 2016

Commits on Mar 11, 2016

Commits on Mar 13, 2016

Commits on Mar 14, 2016