[SPARK-17165][SQL] FileStreamSource should not track the list of seen files indefinitely #14728

petermaxlee · 2016-08-20T06:40:13Z

What changes were proposed in this pull request?

Before this change, FileStreamSource uses an in-memory hash set to track the list of files processed by the engine. The list can grow indefinitely, leading to OOM or overflow of the hash set.

This patch introduces a new user-defined option called "maxFileAge", default to 24 hours. If a file is older than this age, FileStreamSource will purge it from the in-memory map that was used to track the list of files that have been processed.

How was this patch tested?

Added unit tests for the underlying utility, and also added an end-to-end test to validate the purge in FileStreamSourceSuite. Also verified the new test cases would fail when the timeout was set to a very large number.

… files indefinitely

petermaxlee · 2016-08-20T06:40:32Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamOptions.scala

+/**
+ * User specified options for file streams.
+ */
+class FileStreamOptions(@transient private val parameters: Map[String, String])


This is a similar setup to CSVOptions and JSONOptions. I felt it would be easier to track the list of options read by the source here.

You can remove Serializable and @transient from this class. It's not used in the executor side.

petermaxlee · 2016-08-20T06:44:36Z

cc @rxin @cloud-fan

rxin · 2016-08-20T06:54:52Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamOptions.scala

+
+  /** Maximum age of a file that can be found in this directory, before it is deleted. */
+  val maxFileAgeMs: Long =
+    Utils.timeStringAsMs(parameters.getOrElse("maxFileAge", "24h"))


24 hour seems too short. Maybe a month or a week?

rxin · 2016-08-20T07:25:08Z

cc @tdas

SparkQA · 2016-08-20T08:39:44Z

Test build #64133 has finished for PR 14728 at commit ce1dd9c.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class FileEntry(path: String, timestamp: Timestamp) extends Serializable
- class SeenFilesMap(maxAgeMs: Long)

marmbrus · 2016-08-23T00:51:12Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala


 /**
- * A very simple source that reads text files from the given directory as they appear.
- *
- * TODO Clean up the metadata files periodically


This TODO still applies right?

Put it back. Also updated the file with a new test case to make the seen map more robust.

SparkQA · 2016-08-23T04:58:10Z

Test build #64260 has finished for PR 14728 at commit a371f05.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

petermaxlee · 2016-08-24T06:43:44Z

I've updated the default and set it to 1 week.

rxin · 2016-08-24T06:50:13Z

@jerryshao want to review this?

jerryshao · 2016-08-24T06:55:18Z

Sure, let me take a look at this.

jerryshao · 2016-08-24T08:07:41Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamOptions.scala

+
+  def apply(paramName: String, paramValue: String): FileStreamOptions = {
+    new FileStreamOptions(Map(paramName -> paramValue))
+  }


Looks like these two apply methods are not used, it would be better to remove them if not used.

SparkQA · 2016-08-24T08:41:14Z

Test build #64336 has finished for PR 14728 at commit 6817dec.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jerryshao · 2016-08-24T08:50:02Z

For the definition of maxAge, currently from the code it is max age to latest file, people may misunderstand it is max age to current time, so it would be better to document the meaning.

Another concern is that this patch changes the MetadataLog format, which will break upgrade compatibility, also user with old metadata log existed will encounter exception when using new code.

petermaxlee · 2016-08-24T17:00:58Z

Re: format compatibility

My understanding is that structured streaming is not yet production ready and the compatibility guarantee across versions doesn't really apply. Otherwise, we shouldn't use Java serialization here at all and should opt for explicit JSON or protobuf.

SparkQA · 2016-08-24T19:01:29Z

Test build #64362 has finished for PR 14728 at commit 04a112d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2016-08-26T00:19:17Z

Looks pretty good. Just one comment about Serializable.

SparkQA · 2016-08-26T06:15:22Z

Test build #64454 has finished for PR 14728 at commit 9a5ed19.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class FileStreamOptions(parameters: Map[String, String]) extends Logging

zsxwing · 2016-08-26T18:29:40Z

LGTM. Thanks! Merging into master and 2.0.

… files indefinitely ## What changes were proposed in this pull request? Before this change, FileStreamSource uses an in-memory hash set to track the list of files processed by the engine. The list can grow indefinitely, leading to OOM or overflow of the hash set. This patch introduces a new user-defined option called "maxFileAge", default to 24 hours. If a file is older than this age, FileStreamSource will purge it from the in-memory map that was used to track the list of files that have been processed. ## How was this patch tested? Added unit tests for the underlying utility, and also added an end-to-end test to validate the purge in FileStreamSourceSuite. Also verified the new test cases would fail when the timeout was set to a very large number. Author: petermaxlee <petermaxlee@gmail.com> Closes #14728 from petermaxlee/SPARK-17165. (cherry picked from commit 9812f7d) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

* [SPARK-16964][SQL] Remove private[hive] from sql.hive.execution package ## What changes were proposed in this pull request? This PR is a small follow-up to https://github.com/apache/spark/pull/14554. This also widens the visibility of a few (similar) Hive classes. ## How was this patch tested? No test. Only a visibility change. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #14654 from hvanhovell/SPARK-16964-hive. (cherry picked from commit 8fdc6ce400f9130399fbdd004df48b3ba95bcd6a) Signed-off-by: Reynold Xin <rxin@databricks.com> * Revert "[SPARK-16964][SQL] Remove private[hive] from sql.hive.execution package" This reverts commit 2e2c787bf588e129eaaadc792737fd9d2892939c. * [SPARK-16964][SQL] Remove private[sql] and private[spark] from sql.execution package [Backport] ## What changes were proposed in this pull request? This PR backports https://github.com/apache/spark/pull/14554 to branch-2.0. I have also changed the visibility of a few similar Hive classes. ## How was this patch tested? (Only a package visibility change) Author: Herman van Hovell <hvanhovell@databricks.com> Author: Reynold Xin <rxin@databricks.com> Closes #14652 from hvanhovell/SPARK-16964. * [SPARK-16519][SPARKR] Handle SparkR RDD generics that create warnings in R CMD check Rename RDD functions for now to avoid CRAN check warnings. Some RDD functions are sharing generics with DataFrame functions (hence the problem) so after the renames we need to add new generics, for now. unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #14626 from felixcheung/rrddfunctions. (cherry picked from commit c34b546d674ce186f13d9999b97977bc281cfedf) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu> * [SNAPPYDATA] Allow configurable MemoryManager To plugin SnappyData's Unified Memory Manager (SnappyUnifiedMemoryManager), the MemoryManager is made configurable using "spark.memory.manager" configuration property that is set to the SnappyData's manager in embedded mode. * [SPARK-17089][DOCS] Remove api doc link for mapReduceTriplets operator ## What changes were proposed in this pull request? Remove the api doc link for mapReduceTriplets operator because in latest api they are remove so when user link to that api they will not get mapReduceTriplets there so its more good to remove than confuse the user. ## How was this patch tested? Run all the test cases ![screenshot from 2016-08-16 23-08-25](https://cloud.githubusercontent.com/assets/8075390/17709393/8cfbf75a-6406-11e6-98e6-38f7b319d833.png) Author: sandy <phalodi@gmail.com> Closes #14669 from phalodi/SPARK-17089. (cherry picked from commit e28a8c5899c48ff065e2fd3bb6b10c82b4d39c2c) Signed-off-by: Reynold Xin <rxin@databricks.com> * [SPARK-17084][SQL] Rename ParserUtils.assert to validate ## What changes were proposed in this pull request? This PR renames `ParserUtils.assert` to `ParserUtils.validate`. This is done because this method is used to check requirements, and not to check if the program is in an invalid state. ## How was this patch tested? Simple rename. Compilation should do. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #14665 from hvanhovell/SPARK-17084. (cherry picked from commit 4a2c375be2bcd98cc7e00bea920fd6a0f68a4e14) Signed-off-by: Reynold Xin <rxin@databricks.com> * [MINOR][DOC] Fix the descriptions for `properties` argument in the documenation for jdbc APIs ## What changes were proposed in this pull request? This should be credited to mvervuurt. The main purpose of this PR is - simply to include the change for the same instance in `DataFrameReader` just to match up. - just avoid duplicately verifying the PR (as I already did). The documentation for both should be the same because both assume the `properties` should be the same `dict` for the same option. ## How was this patch tested? Manually building Python documentation. This will produce the output as below: - `DataFrameReader` ![2016-08-17 11 12 00](https://cloud.githubusercontent.com/assets/6477701/17722764/b3f6568e-646f-11e6-8b75-4fb672f3f366.png) - `DataFrameWriter` ![2016-08-17 11 12 10](https://cloud.githubusercontent.com/assets/6477701/17722765/b58cb308-646f-11e6-841a-32f19800d139.png) Closes #14624 Author: hyukjinkwon <gurwls223@gmail.com> Author: mvervuurt <m.a.vervuurt@gmail.com> Closes #14677 from HyukjinKwon/typo-python. (cherry picked from commit 0f6aa8afaacdf0ceca9c2c1650ca26a5c167ae69) Signed-off-by: Reynold Xin <rxin@databricks.com> * [SNAPPYDATA] Fixing a build issue with gradle in some environments * [SPARK-15285][SQL] Generated SpecificSafeProjection.apply method grows beyond 64 KB ## What changes were proposed in this pull request? This PR splits the generated code for ```SafeProjection.apply``` by using ```ctx.splitExpressions()```. This is because the large code body for ```NewInstance``` may grow beyond 64KB bytecode size for ```apply()``` method. Here is [the original PR](https://github.com/apache/spark/pull/13243) for SPARK-15285. However, it breaks a build with Scala 2.10 since Scala 2.10 does not a case class with large number of members. Thus, it was reverted by [this commit](https://github.com/apache/spark/commit/fa244e5a90690d6a31be50f2aa203ae1a2e9a1cf). ## How was this patch tested? Added new tests by using `DefinedByConstructorParams` instead of case class for scala-2.10 Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #14670 from kiszk/SPARK-15285-2. (cherry picked from commit 56d86742d2600b8426d75bd87ab3c73332dca1d2) Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-17102][SQL] bypass UserDefinedGenerator for json format check ## What changes were proposed in this pull request? We use reflection to convert `TreeNode` to json string, and currently don't support arbitrary object. `UserDefinedGenerator` takes a function object, so we should skip json format test for it, or the tests can be flacky, e.g. `DataFrameSuite.simple explode`, this test always fail with scala 2.10(branch 1.6 builds with scala 2.10 by default), but pass with scala 2.11(master branch builds with scala 2.11 by default). ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #14679 from cloud-fan/json. (cherry picked from commit 928ca1c6d12b23d84f9b6205e22d2e756311f072) Signed-off-by: Yin Huai <yhuai@databricks.com> * [SPARK-17096][SQL][STREAMING] Improve exception string reported through the StreamingQueryListener ## What changes were proposed in this pull request? Currently, the stackTrace (as `Array[StackTraceElements]`) reported through StreamingQueryListener.onQueryTerminated is useless as it has the stack trace of where StreamingQueryException is defined, not the stack trace of underlying exception. For example, if a streaming query fails because of a / by zero exception in a task, the `QueryTerminated.stackTrace` will have ``` org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:211) org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:124) ``` This is basically useless, as it is location where the StreamingQueryException was defined. What we want is Here is the right way to reason about what should be posted as through StreamingQueryListener.onQueryTerminated - The actual exception could either be a SparkException, or an arbitrary exception. - SparkException reports the relevant executor stack trace of a failed task as a string in the the exception message. The `Array[StackTraceElements]` returned by `SparkException.stackTrace()` is mostly irrelevant. - For any arbitrary exception, the `Array[StackTraceElements]` returned by `exception.stackTrace()` may be relevant. - When there is an error in a streaming query, it's hard to reason whether the `Array[StackTraceElements]` is useful or not. In fact, it is not clear whether it is even useful to report the stack trace as this array of Java objects. It may be sufficient to report the strack trace as a string, along with the message. This is how Spark reported executor stra - Hence, this PR simplifies the API by removing the array `stackTrace` from `QueryTerminated`. Instead the `exception` returns a string containing the message and the stack trace of the actual underlying exception that failed the streaming query (i.e. not that of the StreamingQueryException). If anyone is interested in the actual stack trace as an array, can always access them through `streamingQuery.exception` which returns the exception object. With this change, if a streaming query fails because of a / by zero exception in a task, the `QueryTerminated.exception` will be ``` org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage 0.0 (TID 1, localhost): java.lang.ArithmeticException: / by zero at org.apache.spark.sql.streaming.StreamingQueryListenerSuite$$anonfun$5$$anonfun$apply$mcV$sp$4$$anonfun$apply$mcV$sp$5.apply$mcII$sp(StreamingQueryListenerSuite.scala:153) at org.apache.spark.sql.streaming.StreamingQueryListenerSuite$$anonfun$5$$anonfun$apply$mcV$sp$4$$anonfun$apply$mcV$sp$5.apply(StreamingQueryListenerSuite.scala:153) at org.apache.spark.sql.streaming.StreamingQueryListenerSuite$$anonfun$5$$anonfun$apply$mcV$sp$4$$anonfun$apply$mcV$sp$5.apply(StreamingQueryListenerSuite.scala:153) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:232) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:226) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1429) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1417) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1416) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1416) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) ... ``` It contains the relevant executor stack trace. In a case non-SparkException, if the streaming source MemoryStream throws an exception, exception message will have the relevant stack trace. ``` java.lang.RuntimeException: this is the exception message at org.apache.spark.sql.execution.streaming.MemoryStream.getBatch(memory.scala:103) at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$5.apply(StreamExecution.scala:316) at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$5.apply(StreamExecution.scala:313) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at org.apache.spark.sql.execution.streaming.StreamProgress.flatMap(StreamProgress.scala:25) at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch(StreamExecution.scala:313) at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1.apply$mcZ$sp(StreamExecution.scala:197) at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:43) at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:187) at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:124) ``` Note that this change in the public `QueryTerminated` class is okay as the APIs are still experimental. ## How was this patch tested? Unit tests that test whether the right information is present in the exception message reported through QueryTerminated object. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #14675 from tdas/SPARK-17096. (cherry picked from commit d60af8f6aa53373de1333cc642cf2a9d7b39d912) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> * [SPARK-17038][STREAMING] fix metrics retrieval source of 'lastReceivedBatch' https://issues.apache.org/jira/browse/SPARK-17038 ## What changes were proposed in this pull request? StreamingSource's lastReceivedBatch_submissionTime, lastReceivedBatch_processingTimeStart, and lastReceivedBatch_processingTimeEnd all use data from lastCompletedBatch instead of lastReceivedBatch. In particular, this makes it impossible to match lastReceivedBatch_records with a batchID/submission time. This is apparent when looking at StreamingSource.scala, lines 89-94. ## How was this patch tested? Manually running unit tests on local laptop Author: Xin Ren <iamshrek@126.com> Closes #14681 from keypointt/SPARK-17038. (cherry picked from commit e6bef7d52f0e19ec771fb0f3e96c7ddbd1a6a19b) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> * [SPARK-16995][SQL] TreeNodeException when flat mapping RelationalGroupedDataset created from DataFrame containing a column created with lit/expr ## What changes were proposed in this pull request? A TreeNodeException is thrown when executing the following minimal example in Spark 2.0. import spark.implicits._ case class test (x: Int, q: Int) val d = Seq(1).toDF("x") d.withColumn("q", lit(0)).as[test].groupByKey(_.x).flatMapGroups{case (x, iter) => List[Int]()}.show d.withColumn("q", expr("0")).as[test].groupByKey(_.x).flatMapGroups{case (x, iter) => List[Int]()}.show The problem is at `FoldablePropagation`. The rule will do `transformExpressions` on `LogicalPlan`. The query above contains a `MapGroups` which has a parameter `dataAttributes:Seq[Attribute]`. One attributes in `dataAttributes` will be transformed to an `Alias(literal(0), _)` in `FoldablePropagation`. `Alias` is not an `Attribute` and causes the error. We can't easily detect such type inconsistency during transforming expressions. A direct approach to this problem is to skip doing `FoldablePropagation` on object operators as they should not contain such expressions. ## How was this patch tested? Jenkins tests. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #14648 from viirya/flat-mapping. (cherry picked from commit 10204b9d29cd69895f5a606e75510dc64cf2e009) Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-16391][SQL] Support partial aggregation for reduceGroups ## What changes were proposed in this pull request? This patch introduces a new private ReduceAggregator interface that is a subclass of Aggregator. ReduceAggregator only requires a single associative and commutative reduce function. ReduceAggregator is also used to implement KeyValueGroupedDataset.reduceGroups in order to support partial aggregation. Note that the pull request was initially done by viirya. ## How was this patch tested? Covered by original tests for reduceGroups, as well as a new test suite for ReduceAggregator. Author: Reynold Xin <rxin@databricks.com> Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #14576 from rxin/reduceAggregator. (cherry picked from commit 1748f824101870b845dbbd118763c6885744f98a) Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-17117][SQL] 1 / NULL should not fail analysis ## What changes were proposed in this pull request? This patch fixes the problem described in SPARK-17117, i.e. "SELECT 1 / NULL" throws an analysis exception: ``` org.apache.spark.sql.AnalysisException: cannot resolve '(1 / NULL)' due to data type mismatch: differing types in '(1 / NULL)' (int and null). ``` The problem is that division type coercion did not take null type into account. ## How was this patch tested? A unit test for the type coercion, and a few end-to-end test cases using SQLQueryTestSuite. Author: petermaxlee <petermaxlee@gmail.com> Closes #14695 from petermaxlee/SPARK-17117. (cherry picked from commit 68f5087d2107d6afec5d5745f0cb0e9e3bdd6a0b) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com> * [MINOR][SPARKR] R API documentation for "coltypes" is confusing ## What changes were proposed in this pull request? R API documentation for "coltypes" is confusing, found when working on another ticket. Current version http://spark.apache.org/docs/2.0.0/api/R/coltypes.html, where parameters have 2 "x" which is a duplicate, and also the example is not very clear ![current](https://cloud.githubusercontent.com/assets/3925641/17386808/effb98ce-59a2-11e6-9657-d477d258a80c.png) ![screen shot 2016-08-03 at 5 56 00 pm](https://cloud.githubusercontent.com/assets/3925641/17386884/91831096-59a3-11e6-84af-39890b3d45d8.png) ## How was this patch tested? Tested manually on local machine. And the screenshots are like below: ![screen shot 2016-08-07 at 11 29 20 pm](https://cloud.githubusercontent.com/assets/3925641/17471144/df36633c-5cf6-11e6-8238-4e32ead0e529.png) ![screen shot 2016-08-03 at 5 56 22 pm](https://cloud.githubusercontent.com/assets/3925641/17386896/9d36cb26-59a3-11e6-9619-6dae29f7ab17.png) Author: Xin Ren <iamshrek@126.com> Closes #14489 from keypointt/rExample. (cherry picked from commit 1203c8415cd11540f79a235e66a2f241ca6c71e4) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu> * [SPARK-17069] Expose spark.range() as table-valued function in SQL This adds analyzer rules for resolving table-valued functions, and adds one builtin implementation for range(). The arguments for range() are the same as those of `spark.range()`. Unit tests. cc hvanhovell Author: Eric Liang <ekl@databricks.com> Closes #14656 from ericl/sc-4309. (cherry picked from commit 412dba63b511474a6db3c43c8618d803e604bc6b) Signed-off-by: Reynold Xin <rxin@databricks.com> * [SPARK-16947][SQL] Support type coercion and foldable expression for inline tables This patch improves inline table support with the following: 1. Support type coercion. 2. Support using foldable expressions. Previously only literals were supported. 3. Improve error message handling. 4. Improve test coverage. Added a new unit test suite ResolveInlineTablesSuite and a new file-based end-to-end test inline-table.sql. Author: petermaxlee <petermaxlee@gmail.com> Closes #14676 from petermaxlee/SPARK-16947. (cherry picked from commit f5472dda51b980a726346587257c22873ff708e3) Signed-off-by: Reynold Xin <rxin@databricks.com> * HOTFIX: compilation broken due to protected ctor. (cherry picked from commit b482c09fa22c5762a355f95820e4ba3e2517fb77) Signed-off-by: Reynold Xin <rxin@databricks.com> * [SPARK-16961][CORE] Fixed off-by-one error that biased randomizeInPlace JIRA issue link: https://issues.apache.org/jira/browse/SPARK-16961 Changed one line of Utils.randomizeInPlace to allow elements to stay in place. Created a unit test that runs a Pearson's chi squared test to determine whether the output diverges significantly from a uniform distribution. Author: Nick Lavers <nick.lavers@videoamp.com> Closes #14551 from nicklavers/SPARK-16961-randomizeInPlace. (cherry picked from commit 5377fc62360d5e9b5c94078e41d10a96e0e8a535) Signed-off-by: Sean Owen <sowen@cloudera.com> * [SPARK-16994][SQL] Whitelist operators for predicate pushdown ## What changes were proposed in this pull request? This patch changes predicate pushdown optimization rule (PushDownPredicate) from using a blacklist to a whitelist. That is to say, operators must be explicitly allowed. This approach is more future-proof: previously it was possible for us to introduce a new operator and then render the optimization rule incorrect. This also fixes the bug that previously we allowed pushing filter beneath limit, which was incorrect. That is to say, before this patch, the optimizer would rewrite ``` select * from (select * from range(10) limit 5) where id > 3 to select * from range(10) where id > 3 limit 5 ``` ## How was this patch tested? - a unit test case in FilterPushdownSuite - an end-to-end test in limit.sql Author: Reynold Xin <rxin@databricks.com> Closes #14713 from rxin/SPARK-16994. (cherry picked from commit 67e59d464f782ff5f509234212aa072a7653d7bf) Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-11227][CORE] UnknownHostException can be thrown when NameNode HA is enabled. ## What changes were proposed in this pull request? If the following conditions are satisfied, executors don't load properties in `hdfs-site.xml` and UnknownHostException can be thrown. (1) NameNode HA is enabled (2) spark.eventLogging is disabled or logging path is NOT on HDFS (3) Using Standalone or Mesos for the cluster manager (4) There are no code to load `HdfsCondition` class in the driver regardless of directly or indirectly. (5) The tasks access to HDFS (There might be some more conditions...) For example, following code causes UnknownHostException when the conditions above are satisfied. ``` sc.textFile("<path on HDFS>").collect ``` ``` java.lang.IllegalArgumentException: java.net.UnknownHostException: hacluster at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:378) at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:310) at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:176) at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:678) at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:619) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:149) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2653) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:170) at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:656) at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:438) at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:411) at org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:986) at org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:986) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:177) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:177) at scala.Option.map(Option.scala:146) at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:177) at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:213) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:209) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:102) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318) at org.apache.spark.rdd.RDD.iterator(RDD.scala:282) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318) at org.apache.spark.rdd.RDD.iterator(RDD.scala:282) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.net.UnknownHostException: hacluster ``` But following code doesn't cause the Exception because `textFile` method loads `HdfsConfiguration` indirectly. ``` sc.textFile("<path on HDFS>").collect ``` When a job includes some operations which access to HDFS, the object of `org.apache.hadoop.Configuration` is wrapped by `SerializableConfiguration`, serialized and broadcasted from driver to executors and each executor deserialize the object with `loadDefaults` false so HDFS related properties should be set before broadcasted. ## How was this patch tested? Tested manually on my standalone cluster. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #13738 from sarutak/SPARK-11227. (cherry picked from commit 071eaaf9d2b63589f2e66e5279a16a5a484de6f5) Signed-off-by: Tom Graves <tgraves@yahoo-inc.com> * [SPARK-16686][SQL] Remove PushProjectThroughSample since it is handled by ColumnPruning We push down `Project` through `Sample` in `Optimizer` by the rule `PushProjectThroughSample`. However, if the projected columns produce new output, they will encounter whole data instead of sampled data. It will bring some inconsistency between original plan (Sample then Project) and optimized plan (Project then Sample). In the extreme case such as attached in the JIRA, if the projected column is an UDF which is supposed to not see the sampled out data, the result of UDF will be incorrect. Since the rule `ColumnPruning` already handles general `Project` pushdown. We don't need `PushProjectThroughSample` anymore. The rule `ColumnPruning` also avoids the described issue. Jenkins tests. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #14327 from viirya/fix-sample-pushdown. (cherry picked from commit 7b06a8948fc16d3c14e240fdd632b79ce1651008) Signed-off-by: Reynold Xin <rxin@databricks.com> * [SPARK-17113] [SHUFFLE] Job failure due to Executor OOM in offheap mode ## What changes were proposed in this pull request? This PR fixes executor OOM in offheap mode due to bug in Cooperative Memory Management for UnsafeExternSorter. UnsafeExternalSorter was checking if memory page is being used by upstream by comparing the base object address of the current page with the base object address of upstream. However, in case of offheap memory allocation, the base object addresses are always null, so there was no spilling happening and eventually the operator would OOM. Following is the stack trace this issue addresses - java.lang.OutOfMemoryError: Unable to acquire 1220 bytes of memory, got 0 at org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:341) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:362) at org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:93) at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:170) ## How was this patch tested? Tested by running the failing job. Author: Sital Kedia <skedia@fb.com> Closes #14693 from sitalkedia/fix_offheap_oom. (cherry picked from commit cf0cce90364d17afe780ff9a5426dfcefa298535) Signed-off-by: Davies Liu <davies.liu@gmail.com> * [SPARK-17149][SQL] array.sql for testing array related functions ## What changes were proposed in this pull request? This patch creates array.sql in SQLQueryTestSuite for testing array related functions, including: - indexing - array creation - size - array_contains - sort_array ## How was this patch tested? The patch itself is about adding tests. Author: petermaxlee <petermaxlee@gmail.com> Closes #14708 from petermaxlee/SPARK-17149. (cherry picked from commit a117afa7c2d94f943106542ec53d74ba2b5f1058) Signed-off-by: Reynold Xin <rxin@databricks.com> * [SPARK-17158][SQL] Change error message for out of range numeric literals ## What changes were proposed in this pull request? Modifies error message for numeric literals to Numeric literal <literal> does not fit in range [min, max] for type <T> ## How was this patch tested? Fixed up the error messages for literals.sql in SqlQueryTestSuite and re-ran via sbt. Also fixed up error messages in ExpressionParserSuite Author: Srinath Shankar <srinath@databricks.com> Closes #14721 from srinathshankar/sc4296. (cherry picked from commit ba1737c21aab91ff3f1a1737aa2d6b07575e36a3) Signed-off-by: Reynold Xin <rxin@databricks.com> * [SPARK-17150][SQL] Support SQL generation for inline tables ## What changes were proposed in this pull request? This patch adds support for SQL generation for inline tables. With this, it would be possible to create a view that depends on inline tables. ## How was this patch tested? Added a test case in LogicalPlanToSQLSuite. Author: petermaxlee <petermaxlee@gmail.com> Closes #14709 from petermaxlee/SPARK-17150. (cherry picked from commit 45d40d9f66c666eec6df926db23937589d67225d) Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-17104][SQL] LogicalRelation.newInstance should follow the semantics of MultiInstanceRelation ## What changes were proposed in this pull request? Currently `LogicalRelation.newInstance()` simply creates another `LogicalRelation` object with the same parameters. However, the `newInstance()` method inherited from `MultiInstanceRelation` should return a copy of object with unique expression ids. Current `LogicalRelation.newInstance()` can cause failure when doing self-join. ## How was this patch tested? Jenkins tests. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #14682 from viirya/fix-localrelation. (cherry picked from commit 31a015572024046f4deaa6cec66bb6fab110f31d) Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-17124][SQL] RelationalGroupedDataset.agg should preserve order and allow multiple aggregates per column ## What changes were proposed in this pull request? This patch fixes a longstanding issue with one of the RelationalGroupedDataset.agg function. Even though the signature accepts vararg of pairs, the underlying implementation turns the seq into a map, and thus not order preserving nor allowing multiple aggregates per column. This change also allows users to use this function to run multiple different aggregations for a single column, e.g. ``` agg("age" -> "max", "age" -> "count") ``` ## How was this patch tested? Added a test case in DataFrameAggregateSuite. Author: petermaxlee <petermaxlee@gmail.com> Closes #14697 from petermaxlee/SPARK-17124. (cherry picked from commit 9560c8d29542a5dcaaa07b7af9ef5ddcdbb5d14d) Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-12666][CORE] SparkSubmit packages fix for when 'default' conf doesn't exist in dependent module ## What changes were proposed in this pull request? Adding a "(runtime)" to the dependency configuration will set a fallback configuration to be used if the requested one is not found. E.g. with the setting "default(runtime)", Ivy will look for the conf "default" in the module ivy file and if not found will look for the conf "runtime". This can help with the case when using "sbt publishLocal" which does not write a "default" conf in the published ivy.xml file. ## How was this patch tested? used spark-submit with --packages option for a package published locally with no default conf, and a package resolved from Maven central. Author: Bryan Cutler <cutlerb@gmail.com> Closes #13428 from BryanCutler/fallback-package-conf-SPARK-12666. (cherry picked from commit 9f37d4eac28dd179dd523fa7d645be97bb52af9c) Signed-off-by: Josh Rosen <joshrosen@databricks.com> * [MINOR][R] add SparkR.Rcheck/ and SparkR_*.tar.gz to R/.gitignore ## What changes were proposed in this pull request? Ignore temp files generated by `check-cran.sh`. Author: Xiangrui Meng <meng@databricks.com> Closes #14740 from mengxr/R-gitignore. (cherry picked from commit ab7143463daf2056736c85e3a943c826b5992623) Signed-off-by: Xiangrui Meng <meng@databricks.com> * [SPARK-16508][SPARKR] Fix CRAN undocumented/duplicated arguments warnings. This PR tries to fix all the remaining "undocumented/duplicated arguments" warnings given by CRAN-check. One left is doc for R `stats::glm` exported in SparkR. To mute that warning, we have to also provide document for all arguments of that non-SparkR function. Some previous conversation is in #14558. R unit test and `check-cran.sh` script (with no-test). Author: Junyang Qian <junyangq@databricks.com> Closes #14705 from junyangq/SPARK-16508-master. (cherry picked from commit 01401e965b58f7e8ab615764a452d7d18f1d4bf0) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu> * [SPARK-17098][SQL] Fix `NullPropagation` optimizer to handle `COUNT(NULL) OVER` correctly ## What changes were proposed in this pull request? Currently, `NullPropagation` optimizer replaces `COUNT` on null literals in a bottom-up fashion. During that, `WindowExpression` is not covered properly. This PR adds the missing propagation logic. **Before** ```scala scala> sql("SELECT COUNT(1 + NULL) OVER ()").show java.lang.UnsupportedOperationException: Cannot evaluate expression: cast(0 as bigint) windowspecdefinition(ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ``` **After** ```scala scala> sql("SELECT COUNT(1 + NULL) OVER ()").show +----------------------------------------------------------------------------------------------+ |count((1 + CAST(NULL AS INT))) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)| +----------------------------------------------------------------------------------------------+ | 0| +----------------------------------------------------------------------------------------------+ ``` ## How was this patch tested? Pass the Jenkins test with a new test case. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14689 from dongjoon-hyun/SPARK-17098. (cherry picked from commit 91c2397684ab791572ac57ffb2a924ff058bb64f) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com> * [SPARK-17115][SQL] decrease the threshold when split expressions ## What changes were proposed in this pull request? In 2.0, we change the threshold of splitting expressions from 16K to 64K, which cause very bad performance on wide table, because the generated method can't be JIT compiled by default (above the limit of 8K bytecode). This PR will decrease it to 1K, based on the benchmark results for a wide table with 400 columns of LongType. It also fix a bug around splitting expression in whole-stage codegen (it should not split them). ## How was this patch tested? Added benchmark suite. Author: Davies Liu <davies@databricks.com> Closes #14692 from davies/split_exprs. (cherry picked from commit 8d35a6f68d6d733212674491cbf31bed73fada0f) Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-17085][STREAMING][DOCUMENTATION AND ACTUAL CODE DIFFERS - UNSUPPORTED OPERATIONS] Changes in Spark Stuctured Streaming doc in this link https://spark.apache.org/docs/2.0.0/structured-streaming-programming-guide.html#unsupported-operations Author: Jagadeesan <as2@us.ibm.com> Closes #14715 from jagadeesanas2/SPARK-17085. (cherry picked from commit bd9655063bdba8836b4ec96ed115e5653e246b65) Signed-off-by: Sean Owen <sowen@cloudera.com> * [SPARKR][MINOR] Fix Cache Folder Path in Windows ## What changes were proposed in this pull request? This PR tries to fix the scheme of local cache folder in Windows. The name of the environment variable should be `LOCALAPPDATA` rather than `%LOCALAPPDATA%`. ## How was this patch tested? Manual test in Windows 7. Author: Junyang Qian <junyangq@databricks.com> Closes #14743 from junyangq/SPARKR-FixWindowsInstall. (cherry picked from commit 209e1b3c0683a9106428e269e5041980b6cc327f) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu> * [SPARK-16320][DOC] Document G1 heap region's effect on spark 2.0 vs 1.6 ## What changes were proposed in this pull request? Collect GC discussion in one section, and documenting findings about G1 GC heap region size. ## How was this patch tested? Jekyll doc build Author: Sean Owen <sowen@cloudera.com> Closes #14732 from srowen/SPARK-16320. (cherry picked from commit 342278c09cf6e79ed4f63422988a6bbd1e7d8a91) Signed-off-by: Yin Huai <yhuai@databricks.com> * [SPARKR][MINOR] Add Xiangrui and Felix to maintainers ## What changes were proposed in this pull request? This change adds Xiangrui Meng and Felix Cheung to the maintainers field in the package description. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #14758 from shivaram/sparkr-maintainers. (cherry picked from commit 6f3cd36f93c11265449fdce3323e139fec8ab22d) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu> * [SPARK-17162] Range does not support SQL generation ## What changes were proposed in this pull request? The range operator previously didn't support SQL generation, which made it not possible to use in views. ## How was this patch tested? Unit tests. cc hvanhovell Author: Eric Liang <ekl@databricks.com> Closes #14724 from ericl/spark-17162. (cherry picked from commit 84770b59f773f132073cd2af4204957fc2d7bf35) Signed-off-by: Reynold Xin <rxin@databricks.com> * [SPARK-16508][SPARKR] doc updates and more CRAN check fixes replace ``` ` ``` in code doc with `\code{thing}` remove added `...` for drop(DataFrame) fix remaining CRAN check warnings create doc with knitr junyangq Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #14734 from felixcheung/rdoccleanup. (cherry picked from commit 71afeeea4ec8e67edc95b5d504c557c88a2598b9) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu> * [SPARK-16550][SPARK-17042][CORE] Certain classes fail to deserialize in block manager replication ## What changes were proposed in this pull request? This is a straightforward clone of JoshRosen 's original patch. I have follow-up changes to fix block replication for repl-defined classes as well, but those appear to be flaking tests so I'm going to leave that for SPARK-17042 ## How was this patch tested? End-to-end test in ReplSuite (also more tests in DistributedSuite from the original patch). Author: Eric Liang <ekl@databricks.com> Closes #14311 from ericl/spark-16550. (cherry picked from commit 8e223ea67acf5aa730ccf688802f17f6fc10907c) Signed-off-by: Reynold Xin <rxin@databricks.com> * [SPARK-16577][SPARKR] Add CRAN documentation checks to run-tests.sh ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? This change adds CRAN documentation checks to be run as a part of `R/run-tests.sh` . As this script is also used by Jenkins this means that we will get documentation checks on every PR going forward. (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #14759 from shivaram/sparkr-cran-jenkins. (cherry picked from commit 920806ab272ba58a369072a5eeb89df5e9b470a6) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu> * [SPARK-17182][SQL] Mark Collect as non-deterministic ## What changes were proposed in this pull request? This PR marks the abstract class `Collect` as non-deterministic since the results of `CollectList` and `CollectSet` depend on the actual order of input rows. ## How was this patch tested? Existing test cases should be enough. Author: Cheng Lian <lian@databricks.com> Closes #14749 from liancheng/spark-17182-non-deterministic-collect. (cherry picked from commit 2cdd92a7cd6f85186c846635b422b977bdafbcdd) Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARKR][MINOR] Update R DESCRIPTION file ## What changes were proposed in this pull request? Update DESCRIPTION ## How was this patch tested? Run install and CRAN tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #14764 from felixcheung/rpackagedescription. (cherry picked from commit d2b3d3e63e1a9217de6ef507c350308017664a62) Signed-off-by: Xiangrui Meng <meng@databricks.com> * [SPARK-13286] [SQL] add the next expression of SQLException as cause Some JDBC driver (for example PostgreSQL) does not use the underlying exception as cause, but have another APIs (getNextException) to access that, so it it's included in the error logging, making us hard to find the root cause, especially in batch mode. This PR will pull out the next exception and add it as cause (if it's different) or suppressed (if there is another different cause). Can't reproduce this on the default JDBC driver, so did not add a regression test. Author: Davies Liu <davies@databricks.com> Closes #14722 from davies/keep_cause. (cherry picked from commit 9afdfc94f49395e69a7959e881c19d787ce00c3e) Signed-off-by: Davies Liu <davies.liu@gmail.com> * [SPARKR][MINOR] Remove reference link for common Windows environment variables ## What changes were proposed in this pull request? The PR removes reference link in the doc for environment variables for common Windows folders. The cran check gave code 503: service unavailable on the original link. ## How was this patch tested? Manual check. Author: Junyang Qian <junyangq@databricks.com> Closes #14767 from junyangq/SPARKR-RemoveLink. (cherry picked from commit 8fd63e808e15c8a7e78fef847183c86f332daa91) Signed-off-by: Felix Cheung <felixcheung@apache.org> * [MINOR][DOC] Use standard quotes instead of "curly quote" marks from Mac in structured streaming programming guides This PR fixes curly quotes (`“` and `”` ) to standard quotes (`"`). This will be a actual problem when users copy and paste the examples. This would not work. This seems only happening in `structured-streaming-programming-guide.md`. Manually built. This will change some examples to be correctly marked down as below: ![2016-08-23 3 24 13](https://cloud.githubusercontent.com/assets/6477701/17882878/2a38332e-694a-11e6-8e84-76bdb89151e0.png) to ![2016-08-23 3 26 06](https://cloud.githubusercontent.com/assets/6477701/17882888/376eaa28-694a-11e6-8b88-32ea83997037.png) Author: hyukjinkwon <gurwls223@gmail.com> Closes #14770 from HyukjinKwon/minor-quotes. (cherry picked from commit 588559911de94bbe0932526ee1e1dd36a581a423) Signed-off-by: Sean Owen <sowen@cloudera.com> * [SPARK-17194] Use single quotes when generating SQL for string literals When Spark emits SQL for a string literal, it should wrap the string in single quotes, not double quotes. Databases which adhere more strictly to the ANSI SQL standards, such as Postgres, allow only single-quotes to be used for denoting string literals (see http://stackoverflow.com/a/1992331/590203). Author: Josh Rosen <joshrosen@databricks.com> Closes #14763 from JoshRosen/SPARK-17194. (cherry picked from commit bf8ff833e30b39e5e5e35ba8dcac31b79323838c) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com> * [MINOR][SQL] Remove implemented functions from comments of 'HiveSessionCatalog.scala' ## What changes were proposed in this pull request? This PR removes implemented functions from comments of `HiveSessionCatalog.scala`: `java_method`, `posexplode`, `str_to_map`. ## How was this patch tested? Manual. Author: Weiqing Yang <yangweiqing001@gmail.com> Closes #14769 from Sherry302/cleanComment. (cherry picked from commit b9994ad05628077016331e6b411fbc09017b1e63) Signed-off-by: Reynold Xin <rxin@databricks.com> * [SPARK-17186][SQL] remove catalog table type INDEX ## What changes were proposed in this pull request? Actually Spark SQL doesn't support index, the catalog table type `INDEX` is from Hive. However, most operations in Spark SQL can't handle index table, e.g. create table, alter table, etc. Logically index table should be invisible to end users, and Hive also generates special table name for index table to avoid users accessing it directly. Hive has special SQL syntax to create/show/drop index tables. At Spark SQL side, although we can describe index table directly, but the result is unreadable, we should use the dedicated SQL syntax to do it(e.g. `SHOW INDEX ON tbl`). Spark SQL can also read index table directly, but the result is always empty.(Can hive read index table directly?) This PR remove the table type `INDEX`, to make it clear that Spark SQL doesn't support index currently. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #14752 from cloud-fan/minor2. (cherry picked from commit 52fa45d62a5a0bc832442f38f9e634c5d8e29e08) Signed-off-by: Reynold Xin <rxin@databricks.com> * [MINOR][BUILD] Fix Java CheckStyle Error As Spark 2.0.1 will be released soon (mentioned in the spark dev mailing list), besides the critical bugs, it's better to fix the code style errors before the release. Before: ``` ./dev/lint-java Checkstyle checks failed at following occurrences: [ERROR] src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java:[525] (sizes) LineLength: Line is longer than 100 characters (found 119). [ERROR] src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCount.java:[64] (sizes) LineLength: Line is longer than 100 characters (found 103). ``` After: ``` ./dev/lint-java Using `mvn` from path: /usr/local/bin/mvn Checkstyle checks passed. ``` Manual. Author: Weiqing Yang <yangweiqing001@gmail.com> Closes #14768 from Sherry302/fixjavastyle. (cherry picked from commit 673a80d2230602c9e6573a23e35fb0f6b832bfca) Signed-off-by: Sean Owen <sowen@cloudera.com> * [SPARK-17086][ML] Fix InvalidArgumentException issue in QuantileDiscretizer when some quantiles are duplicated ## What changes were proposed in this pull request? In cases when QuantileDiscretizerSuite is called upon a numeric array with duplicated elements, we will take the unique elements generated from approxQuantiles as input for Bucketizer. ## How was this patch tested? An unit test is added in QuantileDiscretizerSuite QuantileDiscretizer.fit will throw an illegal exception when calling setSplits on a list of splits with duplicated elements. Bucketizer.setSplits should only accept either a numeric vector of two or more unique cut points, although that may produce less number of buckets than requested. Signed-off-by: VinceShieh <vincent.xieintel.com> Author: VinceShieh <vincent.xie@intel.com> Closes #14747 from VinceShieh/SPARK-17086. (cherry picked from commit 92c0eaf348b42b3479610da0be761013f9d81c54) Signed-off-by: Sean Owen <sowen@cloudera.com> * [SPARKR][MINOR] Fix doc for show method ## What changes were proposed in this pull request? The original doc of `show` put methods for multiple classes together but the text only talks about `SparkDataFrame`. This PR tries to fix this problem. ## How was this patch tested? Manual test. Author: Junyang Qian <junyangq@databricks.com> Closes #14776 from junyangq/SPARK-FixShowDoc. (cherry picked from commit d2932a0e987132c694ed59515b7c77adaad052e6) Signed-off-by: Felix Cheung <felixcheung@apache.org> * [SPARK-16781][PYSPARK] java launched by PySpark as gateway may not be the same java used in the spark environment ## What changes were proposed in this pull request? Update to py4j 0.10.3 to enable JAVA_HOME support ## How was this patch tested? Pyspark tests Author: Sean Owen <sowen@cloudera.com> Closes #14748 from srowen/SPARK-16781. (cherry picked from commit 0b3a4be92ca6b38eef32ea5ca240d9f91f68aa65) Signed-off-by: Sean Owen <sowen@cloudera.com> * [SPARKR][MINOR] Add more examples to window function docs ## What changes were proposed in this pull request? This PR adds more examples to window function docs to make them more accessible to the users. It also fixes default value issues for `lag` and `lead`. ## How was this patch tested? Manual test, R unit test. Author: Junyang Qian <junyangq@databricks.com> Closes #14779 from junyangq/SPARKR-FixWindowFunctionDocs. (cherry picked from commit 18708f76c366c6e01b5865981666e40d8642ac20) Signed-off-by: Felix Cheung <felixcheung@apache.org> * [SPARKR][MINOR] Add installation message for remote master mode and improve other messages ## What changes were proposed in this pull request? This PR gives informative message to users when they try to connect to a remote master but don't have Spark package in their local machine. As a clarification, for now, automatic installation will only happen if they start SparkR in R console (rather than from sparkr-shell) and connect to local master. In the remote master mode, local Spark package is still needed, but we will not trigger the install.spark function because the versions have to match those on the cluster, which involves more user input. Instead, we here try to provide detailed message that may help the users. Some of the other messages have also been slightly changed. ## How was this patch tested? Manual test. Author: Junyang Qian <junyangq@databricks.com> Closes #14761 from junyangq/SPARK-16579-V1. (cherry picked from commit 3a60be4b15a5ab9b6e0c4839df99dac7738aa7fe) Signed-off-by: Felix Cheung <felixcheung@apache.org> * [SPARK-16216][SQL][BRANCH-2.0] Backport Read/write dateFormat/timestampFormat options for CSV and JSON ## What changes were proposed in this pull request? This PR backports https://github.com/apache/spark/pull/14279 to 2.0. ## How was this patch tested? Unit tests were added in `CSVSuite` and `JsonSuite`. For JSON, existing tests cover the default cases. Author: hyukjinkwon <gurwls223@gmail.com> Closes #14799 from HyukjinKwon/SPARK-16216-json-csv-backport. * [SPARK-17228][SQL] Not infer/propagate non-deterministic constraints ## What changes were proposed in this pull request? Given that filters based on non-deterministic constraints shouldn't be pushed down in the query plan, unnecessarily inferring them is confusing and a source of potential bugs. This patch simplifies the inferring logic by simply ignoring them. ## How was this patch tested? Added a new test in `ConstraintPropagationSuite`. Author: Sameer Agarwal <sameerag@cs.berkeley.edu> Closes #14795 from sameeragarwal/deterministic-constraints. (cherry picked from commit ac27557eb622a257abeb3e8551f06ebc72f87133) Signed-off-by: Reynold Xin <rxin@databricks.com> * Snap 293 (#1) * Using the scala collections compiler was not able to differentiate between the scala and java api * Minor changes for snappy implementation of executor This is required as we need to have a classloader that also looks into the snappy store for the classes. * Revert "Using the scala collections" This reverts commit c2ab0c5aa3974337277560fb1c8a9d0c3661ec09. * [SPARK-17193][CORE] HadoopRDD NPE at DEBUG log level when getLocationInfo == null ## What changes were proposed in this pull request? Handle null from Hadoop getLocationInfo directly instead of catching (and logging) exception ## How was this patch tested? Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #14760 from srowen/SPARK-17193. (cherry picked from commit 2bcd5d5ce3eaf0eb1600a12a2b55ddb40927533b) Signed-off-by: Sean Owen <sowen@cloudera.com> * [SPARK-17061][SPARK-17093][SQL] MapObjects` should make copies of unsafe-backed data Currently `MapObjects` does not make copies of unsafe-backed data, leading to problems like [SPARK-17061](https://issues.apache.org/jira/browse/SPARK-17061) [SPARK-17093](https://issues.apache.org/jira/browse/SPARK-17093). This patch makes `MapObjects` make copies of unsafe-backed data. Generated code - prior to this patch: ```java ... /* 295 */ if (isNull12) { /* 296 */ convertedArray1[loopIndex1] = null; /* 297 */ } else { /* 298 */ convertedArray1[loopIndex1] = value12; /* 299 */ } ... ``` Generated code - after this patch: ```java ... /* 295 */ if (isNull12) { /* 296 */ convertedArray1[loopIndex1] = null; /* 297 */ } else { /* 298 */ convertedArray1[loopIndex1] = value12 instanceof UnsafeRow? value12.copy() : value12; /* 299 */ } ... ``` Add a new test case which would fail without this patch. Author: Liwei Lin <lwlin7@gmail.com> Closes #14698 from lw-lin/mapobjects-copy. (cherry picked from commit e0b20f9f24d5c3304bf517a4dcfb0da93be5bc75) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com> * Revert "[SPARK-17061][SPARK-17093][SQL] MapObjects` should make copies of unsafe-backed data" This reverts commit fb1c697143a5bb2df69d9f2c9cbddc4eb526f047. * [SPARK-17061][SPARK-17093][SQL][BACKPORT] MapObjects should make copies of unsafe-backed data ## What changes were proposed in this pull request? This PR backports https://github.com/apache/spark/pull/14698 to branch-2.0. See that PR for more details. All credit should go to lw-lin. Author: Herman van Hovell <hvanhovell@databricks.com> Author: Liwei Lin <lwlin7@gmail.com> Closes #14806 from hvanhovell/SPARK-17061. * [SPARK-16991][SPARK-17099][SPARK-17120][SQL] Fix Outer Join Elimination when Filter's isNotNull Constraints Unable to Filter Out All Null-supplying Rows ### What changes were proposed in this pull request? This PR is to fix an incorrect outer join elimination when filter's `isNotNull` constraints is unable to filter out all null-supplying rows. For example, `isnotnull(coalesce(b#227, c#238))`. Users can hit this error when they try to use `using/natural outer join`, which is converted to a normal outer join with a `coalesce` expression on the `using columns`. For example, ```Scala val a = Seq((1, 2), (2, 3)).toDF("a", "b") val b = Seq((2, 5), (3, 4)).toDF("a", "c") val c = Seq((3, 1)).toDF("a", "d") val ab = a.join(b, Seq("a"), "fullouter") ab.join(c, "a").explain(true) ``` The dataframe `ab` is doing `using full-outer join`, which is converted to a normal outer join with a `coalesce` expression. Constraints inference generates a `Filter` with constraints `isnotnull(coalesce(b#227, c#238))`. Then, it triggers a wrong outer join elimination and generates a wrong result. ``` Project [a#251, b#227, c#237, d#247] +- Join Inner, (a#251 = a#246) :- Project [coalesce(a#226, a#236) AS a#251, b#227, c#237] : +- Join FullOuter, (a#226 = a#236) : :- Project [_1#223 AS a#226, _2#224 AS b#227] : : +- LocalRelation [_1#223, _2#224] : +- Project [_1#233 AS a#236, _2#234 AS c#237] : +- LocalRelation [_1#233, _2#234] +- Project [_1#243 AS a#246, _2#244 AS d#247] +- LocalRelation [_1#243, _2#244] == Optimized Logical Plan == Project [a#251, b#227, c#237, d#247] +- Join Inner, (a#251 = a#246) :- Project [coalesce(a#226, a#236) AS a#251, b#227, c#237] : +- Filter isnotnull(coalesce(a#226, a#236)) : +- Join FullOuter, (a#226 = a#236) : :- LocalRelation [a#226, b#227] : +- LocalRelation [a#236, c#237] +- LocalRelation [a#246, d#247] ``` **A note to the `Committer`**, please also give the credit to dongjoon-hyun who submitted another PR for fixing this issue. https://github.com/apache/spark/pull/14580 ### How was this patch tested? Added test cases Author: gatorsmile <gatorsmile@gmail.com> Closes #14661 from gatorsmile/fixOuterJoinElimination. (cherry picked from commit d2ae6399ee2f0524b88262735adbbcb2035de8fd) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com> * [SPARK-17167][2.0][SQL] Issue Exceptions when Analyze Table on In-Memory Cataloged Tables ### What changes were proposed in this pull request? Currently, `Analyze Table` is only used for Hive-serde tables. We should issue exceptions in all the other cases. When the tables are data source tables, we issued an exception. However, when tables are In-Memory Cataloged tables, we do not issue any exception. This PR is to issue an exception when the tables are in-memory cataloged. For example, ```SQL CREATE TABLE tbl(a INT, b INT) USING parquet ``` `tbl` is a `SimpleCatalogRelation` when the hive support is not enabled. ### How was this patch tested? Added two test cases. One of them is just to improve the test coverage when the analyzed table is data source tables. Author: gatorsmile <gatorsmile@gmail.com> Closes #14781 from gatorsmile/analyzeInMemoryTable2. * [SPARK-16700][PYSPARK][SQL] create DataFrame from dict/Row with schema In 2.0, we verify the data type against schema for every row for safety, but with performance cost, this PR make it optional. When we verify the data type for StructType, it does not support all the types we support in infer schema (for example, dict), this PR fix that to make them consistent. For Row object which is created using named arguments, the order of fields are sorted by name, they may be not different than the order in provided schema, this PR fix that by ignore the order of fields in this case. Created regression tests for them. Author: Davies Liu <davies@databricks.com> Closes #14469 from davies/py_dict. * [SPARK-15083][WEB UI] History Server can OOM due to unlimited TaskUIData ## What changes were proposed in this pull request? This is a back port of #14673 addressing merge conflicts in package.scala that prevented a cherry-pick to `branch-2.0` when it was merged to `master` Since the History Server currently loads all application's data it can OOM if too many applications have a significant task count. This trims tasks by `spark.ui.retainedTasks` (default: 100000) ## How was this patch tested? Manual testing and dev/run-tests Author: Alex Bozarth <ajbozart@us.ibm.com> Closes #14794 from ajbozarth/spark15083-branch-2.0. * [SPARKR][BUILD] ignore cran-check.out under R folder ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) R add cran check which will generate the cran-check.out. This file should be ignored in git. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Manual test it. Run clean test and git status to make sure the file is not included in git. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #14774 from wangmiao1981/ignore. (cherry picked from commit 9958ac0ce2b9e451d400604767bef2fe12a3399d) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu> * [SPARK-17205] Literal.sql should handle Infinity and NaN This patch updates `Literal.sql` to properly generate SQL for `NaN` and `Infinity` float and double literals: these special values need to be handled differently from regular values, since simply appending a suffix to the value's `toString()` representation will not work for these values. Author: Josh Rosen <joshrosen@databricks.com> Closes #14777 from JoshRosen/SPARK-17205. (cherry picked from commit 3e4c7db4d11c474457e7886a5501108ebab0cf6d) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com> * [SPARK-17231][CORE] Avoid building debug or trace log messages unless the respective log level is enabled This is simply a backport of #14798 to `branch-2.0`. This backport omits the change to `ExternalShuffleBlockHandler.java`. In `branch-2.0`, that file does not contain the log message that was patched in `master`. Author: Michael Allman <michael@videoamp.com> Closes #14811 from mallman/spark-17231-logging_perf_improvements-2.0_backport. * [SPARK-17242][DOCUMENT] Update links of external dstream projects ## What changes were proposed in this pull request? Updated links of external dstream projects. ## How was this patch tested? Just document changes. Author: Shixiong Zhu <shixiong@databricks.com> Closes #14814 from zsxwing/dstream-link. (cherry picked from commit 341e0e778dff8c404b47d34ee7661b658bb91880) Signed-off-by: Reynold Xin <rxin@databricks.com> * [SPARKR][MINOR] Fix example of spark.naiveBayes ## What changes were proposed in this pull request? The original example doesn't work because the features are not categorical. This PR fixes this by changing to another dataset. ## How was this patch tested? Manual test. Author: Junyang Qian <junyangq@databricks.com> Closes #14820 from junyangq/SPARK-FixNaiveBayes. (cherry picked from commit 18832162357282ec81515b5b2ba93747be3ad18b) Signed-off-by: Felix Cheung <felixcheung@apache.org> * [SPARK-17165][SQL] FileStreamSource should not track the list of seen files indefinitely ## What changes were proposed in this pull request? Before this change, FileStreamSource uses an in-memory hash set to track the list of files processed by the engine. The list can grow indefinitely, leading to OOM or overflow of the hash set. This patch introduces a new user-defined option called "maxFileAge", default to 24 hours. If a file is older than this age, FileStreamSource will purge it from the in-memory map that was used to track the list of files that have been processed. ## How was this patch tested? Added unit tests for the underlying utility, and also added an end-to-end test to validate the purge in FileStreamSourceSuite. Also verified the new test cases would fail when the timeout was set to a very large number. Author: petermaxlee <petermaxlee@gmail.com> Closes #14728 from petermaxlee/SPARK-17165. (cherry picked from commit 9812f7d5381f7cd8112fd30c7e45ae4f0eab6e88) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> * [SPARK-17246][SQL] Add BigDecimal literal ## What changes were proposed in this pull request? This PR adds parser support for `BigDecimal` literals. If you append the suffix `BD` to a valid number then this will be interpreted as a `BigDecimal`, for example `12.0E10BD` will interpreted into a BigDecimal with scale -9 and precision 3. This is useful in situations where you need exact values. ## How was this patch tested? Added tests to `ExpressionParserSuite`, `ExpressionSQLBuilderSuite` and `SQLQueryTestSuite`. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #14819 from hvanhovell/SPARK-17246. (cherry picked from commit a11d10f1826b578ff721c4738224eef2b3c3b9f3) Signed-off-by: Reynold Xin <rxin@databricks.com> * [SPARK-17235][SQL] Support purging of old logs in MetadataLog ## What changes were proposed in this pull request? This patch adds a purge interface to MetadataLog, and an implementation in HDFSMetadataLog. The purge function is currently unused, but I will use it to purge old execution and file source logs in follow-up patches. These changes are required in a production structured streaming job that runs for a long period of time. ## How was this patch test…

[SPARK-17165][SQL] FileStreamSource should not track the list of seen…

ce1dd9c

… files indefinitely

petermaxlee reviewed Aug 20, 2016
View reviewed changes

rxin reviewed Aug 20, 2016
View reviewed changes

marmbrus reviewed Aug 23, 2016
View reviewed changes

More conservative check against lastPurgeTimestamp

a371f05

Set the default max age to 1 week.

6817dec

jerryshao reviewed Aug 24, 2016
View reviewed changes

Code review

04a112d

Remove serializable

9a5ed19

asfgit closed this in 9812f7d Aug 26, 2016

frreiss mentioned this pull request Aug 26, 2016

[SPARK-16963] [STREAMING] [SQL] Changes to Source trait and related implementation classes #14553

Closed

frreiss mentioned this pull request Sep 13, 2016

[SPARK-15698][SQL][Streaming] Add the ability to remove the old MetadataLog in FileStreamSource #13513

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-17165][SQL] FileStreamSource should not track the list of seen files indefinitely #14728

[SPARK-17165][SQL] FileStreamSource should not track the list of seen files indefinitely #14728

petermaxlee commented Aug 20, 2016 •

edited

petermaxlee Aug 20, 2016

zsxwing Aug 25, 2016

petermaxlee commented Aug 20, 2016

rxin Aug 20, 2016 •

edited

rxin commented Aug 20, 2016

SparkQA commented Aug 20, 2016

marmbrus Aug 23, 2016

petermaxlee Aug 23, 2016

SparkQA commented Aug 23, 2016

petermaxlee commented Aug 24, 2016

rxin commented Aug 24, 2016

jerryshao commented Aug 24, 2016

jerryshao Aug 24, 2016

SparkQA commented Aug 24, 2016

jerryshao commented Aug 24, 2016

petermaxlee commented Aug 24, 2016

SparkQA commented Aug 24, 2016

zsxwing commented Aug 26, 2016

SparkQA commented Aug 26, 2016

zsxwing commented Aug 26, 2016

[SPARK-17165][SQL] FileStreamSource should not track the list of seen files indefinitely #14728

[SPARK-17165][SQL] FileStreamSource should not track the list of seen files indefinitely #14728

Conversation

petermaxlee commented Aug 20, 2016 • edited

What changes were proposed in this pull request?

How was this patch tested?

petermaxlee Aug 20, 2016

Choose a reason for hiding this comment

zsxwing Aug 25, 2016

Choose a reason for hiding this comment

petermaxlee commented Aug 20, 2016

rxin Aug 20, 2016 • edited

Choose a reason for hiding this comment

rxin commented Aug 20, 2016

SparkQA commented Aug 20, 2016

marmbrus Aug 23, 2016

Choose a reason for hiding this comment

petermaxlee Aug 23, 2016

Choose a reason for hiding this comment

SparkQA commented Aug 23, 2016

petermaxlee commented Aug 24, 2016

rxin commented Aug 24, 2016

jerryshao commented Aug 24, 2016

jerryshao Aug 24, 2016

Choose a reason for hiding this comment

SparkQA commented Aug 24, 2016

jerryshao commented Aug 24, 2016

petermaxlee commented Aug 24, 2016

SparkQA commented Aug 24, 2016

zsxwing commented Aug 26, 2016

SparkQA commented Aug 26, 2016

zsxwing commented Aug 26, 2016

petermaxlee commented Aug 20, 2016 •

edited

rxin Aug 20, 2016 •

edited