Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Branch 2.1 #18782

Closed
wants to merge 731 commits into from
Closed

Branch 2.1 #18782

wants to merge 731 commits into from
This pull request is big! We’re only showing the most recent 250 commits.

Commits on Jan 27, 2017

  1. [SPARK-19333][SPARKR] Add Apache License headers to R files

    ## What changes were proposed in this pull request?
    
    add header
    
    ## How was this patch tested?
    
    Manual run to check vignettes html is created properly
    
    Author: Felix Cheung <felixcheung_m@hotmail.com>
    
    Closes #16709 from felixcheung/rfilelicense.
    
    (cherry picked from commit 385d738)
    Signed-off-by: Felix Cheung <felixcheung@apache.org>
    felixcheung authored and Felix Cheung committed Jan 27, 2017
    Configuration menu
    Copy the full SHA
    4002ee9 View commit details
    Browse the repository at this point in the history
  2. [SPARK-19324][SPARKR] Spark VJM stdout output is getting dropped in S…

    …parkR
    
    ## What changes were proposed in this pull request?
    
    This affects mostly running job from the driver in client mode when results are expected to be through stdout (which should be somewhat rare, but possible)
    
    Before:
    ```
    > a <- as.DataFrame(cars)
    > b <- group_by(a, "dist")
    > c <- count(b)
    > sparkR.callJMethod(c$countjc, "explain", TRUE)
    NULL
    ```
    
    After:
    ```
    > a <- as.DataFrame(cars)
    > b <- group_by(a, "dist")
    > c <- count(b)
    > sparkR.callJMethod(c$countjc, "explain", TRUE)
    count#11L
    NULL
    ```
    
    Now, `column.explain()` doesn't seem very useful (we can get more extensive output with `DataFrame.explain()`) but there are other more complex examples with calls of `println` in Scala/JVM side, that are getting dropped.
    
    ## How was this patch tested?
    
    manual
    
    Author: Felix Cheung <felixcheung_m@hotmail.com>
    
    Closes #16670 from felixcheung/rjvmstdout.
    
    (cherry picked from commit a7ab6f9)
    Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
    felixcheung authored and shivaram committed Jan 27, 2017
    Configuration menu
    Copy the full SHA
    9a49f9a View commit details
    Browse the repository at this point in the history

Commits on Jan 30, 2017

  1. [SPARK-19396][DOC] JDBC Options are Case In-sensitive

    ### What changes were proposed in this pull request?
    The case are not sensitive in JDBC options, after the PR #15884 is merged to Spark 2.1.
    
    ### How was this patch tested?
    N/A
    
    Author: gatorsmile <gatorsmile@gmail.com>
    
    Closes #16734 from gatorsmile/fixDocCaseInsensitive.
    
    (cherry picked from commit c0eda7e)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    gatorsmile committed Jan 30, 2017
    Configuration menu
    Copy the full SHA
    445438c View commit details
    Browse the repository at this point in the history

Commits on Jan 31, 2017

  1. [SPARK-19406][SQL] Fix function to_json to respect user-provided options

    ### What changes were proposed in this pull request?
    Currently, the function `to_json` allows users to provide options for generating JSON. However, it does not pass it to `JacksonGenerator`. Thus, it ignores the user-provided options. This PR is to fix it. Below is an example.
    
    ```Scala
    val df = Seq(Tuple1(Tuple1(java.sql.Timestamp.valueOf("2015-08-26 18:00:00.0")))).toDF("a")
    val options = Map("timestampFormat" -> "dd/MM/yyyy HH:mm")
    df.select(to_json($"a", options)).show(false)
    ```
    The current output is like
    ```
    +--------------------------------------+
    |structtojson(a)                       |
    +--------------------------------------+
    |{"_1":"2015-08-26T18:00:00.000-07:00"}|
    +--------------------------------------+
    ```
    
    After the fix, the output is like
    ```
    +-------------------------+
    |structtojson(a)          |
    +-------------------------+
    |{"_1":"26/08/2015 18:00"}|
    +-------------------------+
    ```
    ### How was this patch tested?
    Added test cases for both `from_json` and `to_json`
    
    Author: gatorsmile <gatorsmile@gmail.com>
    
    Closes #16745 from gatorsmile/toJson.
    
    (cherry picked from commit f9156d2)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    gatorsmile committed Jan 31, 2017
    Configuration menu
    Copy the full SHA
    07a1788 View commit details
    Browse the repository at this point in the history
  2. [BACKPORT-2.1][SPARKR][DOCS] update R API doc for subset/extract

    ## What changes were proposed in this pull request?
    
    backport #16721 to branch-2.1
    
    ## How was this patch tested?
    
    manual
    
    Author: Felix Cheung <felixcheung_m@hotmail.com>
    
    Closes #16749 from felixcheung/rsubsetdocbackport.
    felixcheung authored and Felix Cheung committed Jan 31, 2017
    Configuration menu
    Copy the full SHA
    e43f161 View commit details
    Browse the repository at this point in the history

Commits on Feb 1, 2017

  1. [SPARK-19378][SS] Ensure continuity of stateOperator and eventTime me…

    …trics even if there is no new data in trigger
    
    In StructuredStreaming, if a new trigger was skipped because no new data arrived, we suddenly report nothing for the metrics `stateOperator`. We could however easily report the metrics from `lastExecution` to ensure continuity of metrics.
    
    Regression test in `StreamingQueryStatusAndProgressSuite`
    
    Author: Burak Yavuz <brkyvz@gmail.com>
    
    Closes #16716 from brkyvz/state-agg.
    
    (cherry picked from commit 081b7ad)
    Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
    brkyvz authored and tdas committed Feb 1, 2017
    Configuration menu
    Copy the full SHA
    d35a126 View commit details
    Browse the repository at this point in the history
  2. [SPARK-19410][DOC] Fix brokens links in ml-pipeline and ml-tuning

    ## What changes were proposed in this pull request?
    Fix brokens links in ml-pipeline and ml-tuning
    `<div data-lang="scala">`  ->   `<div data-lang="scala" markdown="1">`
    
    ## How was this patch tested?
    manual tests
    
    Author: Zheng RuiFeng <ruifengz@foxmail.com>
    
    Closes #16754 from zhengruifeng/doc_api_fix.
    
    (cherry picked from commit 04ee8cf)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    zhengruifeng authored and srowen committed Feb 1, 2017
    Configuration menu
    Copy the full SHA
    61cdc8c View commit details
    Browse the repository at this point in the history
  3. [SPARK-19377][WEBUI][CORE] Killed tasks should have the status as KILLED

    ## What changes were proposed in this pull request?
    
    Copying of the killed status was missing while getting the newTaskInfo object by dropping the unnecessary details to reduce the memory usage. This patch adds the copying of the killed status to newTaskInfo object, this will correct the display of the status from wrong status to KILLED status in Web UI.
    
    ## How was this patch tested?
    
    Current behaviour of displaying tasks in stage UI page,
    
    | Index | ID | Attempt | Status | Locality Level | Executor ID / Host | Launch Time | Duration | GC Time | Input Size / Records | Write Time | Shuffle Write Size / Records | Errors |
    | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
    |143	|10	|0	|SUCCESS	|NODE_LOCAL	|6 / x.xx.x.x stdout stderr|2017/01/25 07:49:27	|0 ms |		|0.0 B / 0		| |0.0 B / 0	|TaskKilled (killed intentionally)|
    |156	|11	|0	|SUCCESS	|NODE_LOCAL	|5 / x.xx.x.x stdout stderr|2017/01/25 07:49:27	|0 ms |		|0.0 B / 0		| |0.0 B / 0	|TaskKilled (killed intentionally)|
    
    Web UI display after applying the patch,
    
    | Index | ID | Attempt | Status | Locality Level | Executor ID / Host | Launch Time | Duration | GC Time | Input Size / Records | Write Time | Shuffle Write Size / Records | Errors |
    | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
    |143	|10	|0	|KILLED	|NODE_LOCAL	|6 / x.xx.x.x stdout stderr|2017/01/25 07:49:27	|0 ms |		|0.0 B / 0		|  | 0.0 B / 0	| TaskKilled (killed intentionally)|
    |156	|11	|0	|KILLED	|NODE_LOCAL	|5 / x.xx.x.x stdout stderr|2017/01/25 07:49:27	|0 ms |		|0.0 B / 0		|  |0.0 B / 0	| TaskKilled (killed intentionally)|
    
    Author: Devaraj K <devaraj@apache.org>
    
    Closes #16725 from devaraj-kavali/SPARK-19377.
    
    (cherry picked from commit df4a27c)
    Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
    Devaraj K authored and zsxwing committed Feb 1, 2017
    Configuration menu
    Copy the full SHA
    f946464 View commit details
    Browse the repository at this point in the history

Commits on Feb 2, 2017

  1. [SPARK-19432][CORE] Fix an unexpected failure when connecting timeout

    ## What changes were proposed in this pull request?
    
    When connecting timeout, `ask` may fail with a confusing message:
    
    ```
    17/02/01 23:15:19 INFO Worker: Connecting to master ...
    java.lang.IllegalArgumentException: requirement failed: TransportClient has not yet been set.
            at scala.Predef$.require(Predef.scala:224)
            at org.apache.spark.rpc.netty.RpcOutboxMessage.onTimeout(Outbox.scala:70)
            at org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$ask$1.applyOrElse(NettyRpcEnv.scala:232)
            at org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$ask$1.applyOrElse(NettyRpcEnv.scala:231)
            at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:138)
            at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:136)
            at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
    ```
    
    It's better to provide a meaningful message.
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes #16773 from zsxwing/connect-timeout.
    
    (cherry picked from commit 8303e20)
    Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
    zsxwing committed Feb 2, 2017
    Configuration menu
    Copy the full SHA
    7c23bd4 View commit details
    Browse the repository at this point in the history

Commits on Feb 6, 2017

  1. [SPARK-19472][SQL] Parser should not mistake CASE WHEN(...) for a fun…

    …ction call
    
    ## What changes were proposed in this pull request?
    The SQL parser can mistake a `WHEN (...)` used in `CASE` for a function call. This happens in cases like the following:
    ```sql
    select case when (1) + case when 1 > 0 then 1 else 0 end = 2 then 1 else 0 end
    from tb
    ```
    This PR fixes this by re-organizing the case related parsing rules.
    
    ## How was this patch tested?
    Added a regression test to the `ExpressionParserSuite`.
    
    Author: Herman van Hovell <hvanhovell@databricks.com>
    
    Closes #16821 from hvanhovell/SPARK-19472.
    
    (cherry picked from commit cb2677b)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    hvanhovell authored and gatorsmile committed Feb 6, 2017
    Configuration menu
    Copy the full SHA
    f55bd4c View commit details
    Browse the repository at this point in the history

Commits on Feb 7, 2017

  1. [SPARK-19407][SS] defaultFS is used FileSystem.get instead of getting…

    … it from uri scheme
    
    ## What changes were proposed in this pull request?
    
    ```
    Caused by: java.lang.IllegalArgumentException: Wrong FS: s3a://**************/checkpoint/7b2231a3-d845-4740-bfa3-681850e5987f/metadata, expected: file:///
    	at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:649)
    	at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:82)
    	at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:606)
    	at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
    	at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
    	at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
    	at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
    	at org.apache.spark.sql.execution.streaming.StreamMetadata$.read(StreamMetadata.scala:51)
    	at org.apache.spark.sql.execution.streaming.StreamExecution.<init>(StreamExecution.scala:100)
    	at org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:232)
    	at org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:269)
    	at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:262)
    ```
    
    Can easily replicate on spark standalone cluster by providing checkpoint location uri scheme anything other than "file://" and not overriding in config.
    
    WorkAround  --conf spark.hadoop.fs.defaultFS=s3a://somebucket or set it in sparkConf or spark-default.conf
    
    ## How was this patch tested?
    
    existing ut
    
    Author: uncleGen <hustyugm@gmail.com>
    
    Closes #16815 from uncleGen/SPARK-19407.
    
    (cherry picked from commit 7a0a630)
    Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
    uncleGen authored and zsxwing committed Feb 7, 2017
    Configuration menu
    Copy the full SHA
    62fab5b View commit details
    Browse the repository at this point in the history
  2. [SPARK-19444][ML][DOCUMENTATION] Fix imports not being present in doc…

    …umentation
    
    ## What changes were proposed in this pull request?
    
    SPARK-19444 imports not being present in documentation
    
    ## How was this patch tested?
    
    Manual
    
    ## Disclaimer
    
    Contribution is original work and I license the work to the project under the project’s open source license
    
    Author: Aseem Bansal <anshbansal@users.noreply.github.com>
    
    Closes #16789 from anshbansal/patch-1.
    
    (cherry picked from commit aee2bd2)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    anshbansal authored and srowen committed Feb 7, 2017
    Configuration menu
    Copy the full SHA
    dd1abef View commit details
    Browse the repository at this point in the history
  3. [SPARK-18682][SS] Batch Source for Kafka

    Today, you can start a stream that reads from kafka. However, given kafka's configurable retention period, it seems like sometimes you might just want to read all of the data that is available now. As such we should add a version that works with spark.read as well.
    The options should be the same as the streaming kafka source, with the following differences:
    startingOffsets should default to earliest, and should not allow latest (which would always be empty).
    endingOffsets should also be allowed and should default to latest. the same assign json format as startingOffsets should also be accepted.
    It would be really good, if things like .limit(n) were enough to prevent all the data from being read (this might just work).
    
    KafkaRelationSuite was added for testing batch queries via KafkaUtils.
    
    Author: Tyson Condie <tcondie@gmail.com>
    
    Closes #16686 from tcondie/SPARK-18682.
    
    (cherry picked from commit 8df4444)
    Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
    tcondie authored and zsxwing committed Feb 7, 2017
    Configuration menu
    Copy the full SHA
    e642a07 View commit details
    Browse the repository at this point in the history

Commits on Feb 8, 2017

  1. [SPARK-19499][SS] Add more notes in the comments of Sink.addBatch()

    ## What changes were proposed in this pull request?
    
    addBatch method in Sink trait is supposed to be a synchronous method to coordinate with the fault-tolerance design in StreamingExecution (being different with the compute() method in DStream)
    
    We need to add more notes in the comments of this method to remind the developers
    
    ## How was this patch tested?
    
    existing tests
    
    Author: CodingCat <zhunansjtu@gmail.com>
    
    Closes #16840 from CodingCat/SPARK-19499.
    
    (cherry picked from commit d4cd975)
    Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
    CodingCat authored and zsxwing committed Feb 8, 2017
    Configuration menu
    Copy the full SHA
    706d6c1 View commit details
    Browse the repository at this point in the history
  2. [MINOR][DOC] Remove parenthesis in readStream() on kafka structured s…

    …treaming doc
    
    There is a typo in http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#creating-a-kafka-source-stream , python example n1 uses `readStream()` instead of `readStream`
    
    Just removed the parenthesis.
    
    Author: manugarri <manuel.garrido.pena@gmail.com>
    
    Closes #16836 from manugarri/fix_kafka_python_doc.
    
    (cherry picked from commit 5a0569c)
    Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
    manugarri authored and zsxwing committed Feb 8, 2017
    Configuration menu
    Copy the full SHA
    4d04029 View commit details
    Browse the repository at this point in the history
  3. [SPARK-18609][SPARK-18841][SQL][BACKPORT-2.1] Fix redundant Alias rem…

    …oval in the optimizer
    
    This is a backport of 73ee739
    
    ## What changes were proposed in this pull request?
    The optimizer tries to remove redundant alias only projections from the query plan using the `RemoveAliasOnlyProject` rule. The current rule identifies removes such a project and rewrites the project's attributes in the **entire** tree. This causes problems when parts of the tree are duplicated (for instance a self join on a temporary view/CTE)  and the duplicated part contains the alias only project, in this case the rewrite will break the tree.
    
    This PR fixes these problems by using a blacklist for attributes that are not to be moved, and by making sure that attribute remapping is only done for the parent tree, and not for unrelated parts of the query plan.
    
    The current tree transformation infrastructure works very well if the transformation at hand requires little or a global contextual information. In this case we need to know both the attributes that were not to be moved, and we also needed to know which child attributes were modified. This cannot be done easily using the current infrastructure, and solutions typically involves transversing the query plan multiple times (which is super slow). I have moved around some code in `TreeNode`, `QueryPlan` and `LogicalPlan`to make this much more straightforward; this basically allows you to manually traverse the tree.
    
    ## How was this patch tested?
    I have added unit tests to `RemoveRedundantAliasAndProjectSuite` and I have added integration tests to the `SQLQueryTestSuite.union` and `SQLQueryTestSuite.cte` test cases.
    
    Author: Herman van Hovell <hvanhovell@databricks.com>
    
    Closes #16843 from hvanhovell/SPARK-18609-2.1.
    hvanhovell committed Feb 8, 2017
    Configuration menu
    Copy the full SHA
    71b6eac View commit details
    Browse the repository at this point in the history
  4. [SPARK-19413][SS] MapGroupsWithState for arbitrary stateful operation…

    …s for branch-2.1
    
    This is a follow up PR for merging #16758 to spark 2.1 branch
    
    ## What changes were proposed in this pull request?
    
    `mapGroupsWithState` is a new API for arbitrary stateful operations in Structured Streaming, similar to `DStream.mapWithState`
    
    *Requirements*
    - Users should be able to specify a function that can do the following
    - Access the input row corresponding to a key
    - Access the previous state corresponding to a key
    - Optionally, update or remove the state
    - Output any number of new rows (or none at all)
    
    *Proposed API*
    ```
    // ------------ New methods on KeyValueGroupedDataset ------------
    class KeyValueGroupedDataset[K, V] {
    	// Scala friendly
    	def mapGroupsWithState[S: Encoder, U: Encoder](func: (K, Iterator[V], KeyedState[S]) => U)
            def flatMapGroupsWithState[S: Encode, U: Encoder](func: (K, Iterator[V], KeyedState[S]) => Iterator[U])
    	// Java friendly
           def mapGroupsWithState[S, U](func: MapGroupsWithStateFunction[K, V, S, R], stateEncoder: Encoder[S], resultEncoder: Encoder[U])
           def flatMapGroupsWithState[S, U](func: FlatMapGroupsWithStateFunction[K, V, S, R], stateEncoder: Encoder[S], resultEncoder: Encoder[U])
    }
    
    // ------------------- New Java-friendly function classes -------------------
    public interface MapGroupsWithStateFunction<K, V, S, R> extends Serializable {
      R call(K key, Iterator<V> values, state: KeyedState<S>) throws Exception;
    }
    public interface FlatMapGroupsWithStateFunction<K, V, S, R> extends Serializable {
      Iterator<R> call(K key, Iterator<V> values, state: KeyedState<S>) throws Exception;
    }
    
    // ---------------------- Wrapper class for state data ----------------------
    trait KeyedState[S] {
    	def exists(): Boolean
      	def get(): S 			// throws Exception is state does not exist
    	def getOption(): Option[S]
    	def update(newState: S): Unit
    	def remove(): Unit		// exists() will be false after this
    }
    ```
    
    Key Semantics of the State class
    - The state can be null.
    - If the state.remove() is called, then state.exists() will return false, and getOption will returm None.
    - After that state.update(newState) is called, then state.exists() will return true, and getOption will return Some(...).
    - None of the operations are thread-safe. This is to avoid memory barriers.
    
    *Usage*
    ```
    val stateFunc = (word: String, words: Iterator[String, runningCount: KeyedState[Long]) => {
        val newCount = words.size + runningCount.getOption.getOrElse(0L)
        runningCount.update(newCount)
       (word, newCount)
    }
    
    dataset					                        // type is Dataset[String]
      .groupByKey[String](w => w)        	                // generates KeyValueGroupedDataset[String, String]
      .mapGroupsWithState[Long, (String, Long)](stateFunc)	// returns Dataset[(String, Long)]
    ```
    
    ## How was this patch tested?
    New unit tests.
    
    Author: Tathagata Das <tathagata.das1565@gmail.com>
    
    Closes #16850 from tdas/mapWithState-branch-2.1.
    tdas authored and zsxwing committed Feb 8, 2017
    Configuration menu
    Copy the full SHA
    502c927 View commit details
    Browse the repository at this point in the history

Commits on Feb 9, 2017

  1. [SPARK-19481] [REPL] [MAVEN] Avoid to leak SparkContext in Signaling.…

    …cancelOnInterrupt
    
    ## What changes were proposed in this pull request?
    
    `Signaling.cancelOnInterrupt` leaks a SparkContext per call and it makes ReplSuite unstable.
    
    This PR adds `SparkContext.getActive` to allow `Signaling.cancelOnInterrupt` to get the active `SparkContext` to avoid the leak.
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes #16825 from zsxwing/SPARK-19481.
    
    (cherry picked from commit 303f00a)
    Signed-off-by: Davies Liu <davies.liu@gmail.com>
    zsxwing authored and davies committed Feb 9, 2017
    Configuration menu
    Copy the full SHA
    b3fd36a View commit details
    Browse the repository at this point in the history
  2. [SPARK-19509][SQL] Grouping Sets do not respect nullable grouping col…

    …umns
    
    ## What changes were proposed in this pull request?
    The analyzer currently does not check if a column used in grouping sets is actually nullable itself. This can cause the nullability of the column to be incorrect, which can cause null pointer exceptions down the line. This PR fixes that by also consider the nullability of the column.
    
    This is only a problem for Spark 2.1 and below. The latest master uses a different approach.
    
    Closes #16874
    
    ## How was this patch tested?
    Added a regression test to `SQLQueryTestSuite.grouping_set`.
    
    Author: Herman van Hovell <hvanhovell@databricks.com>
    
    Closes #16873 from hvanhovell/SPARK-19509.
    Stan Zhai authored and hvanhovell committed Feb 9, 2017
    Configuration menu
    Copy the full SHA
    a3d5300 View commit details
    Browse the repository at this point in the history

Commits on Feb 10, 2017

  1. [SPARK-19512][BACKPORT-2.1][SQL] codegen for compare structs fails #1…

    …6852
    
    ## What changes were proposed in this pull request?
    
    Set currentVars to null in GenerateOrdering.genComparisons before genCode is called. genCode ignores INPUT_ROW if currentVars is not null and in genComparisons we want it to use INPUT_ROW.
    
    ## How was this patch tested?
    
    Added test with 2 queries in WholeStageCodegenSuite
    
    Author: Bogdan Raducanu <bogdan.rdc@gmail.com>
    
    Closes #16875 from bogdanrdc/SPARK-19512-2.1.
    bogdanrdc authored and rxin committed Feb 10, 2017
    Configuration menu
    Copy the full SHA
    ff5818b View commit details
    Browse the repository at this point in the history
  2. [SPARK-19543] from_json fails when the input row is empty

    ## What changes were proposed in this pull request?
    
    Using from_json on a column with an empty string results in: java.util.NoSuchElementException: head of empty list.
    
    This is because `parser.parse(input)` may return `Nil` when `input.trim.isEmpty`
    
    ## How was this patch tested?
    
    Regression test in `JsonExpressionsSuite`
    
    Author: Burak Yavuz <brkyvz@gmail.com>
    
    Closes #16881 from brkyvz/json-fix.
    
    (cherry picked from commit d5593f7)
    Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>
    brkyvz authored and hvanhovell committed Feb 10, 2017
    Configuration menu
    Copy the full SHA
    7b5ea00 View commit details
    Browse the repository at this point in the history

Commits on Feb 11, 2017

  1. [SPARK-18717][SQL] Make code generation for Scala Map work with immut…

    …able.Map also
    
    ## What changes were proposed in this pull request?
    
    Fixes compile errors in generated code when user has case class with a `scala.collections.immutable.Map` instead of a `scala.collections.Map`. Since ArrayBasedMapData.toScalaMap returns the immutable version we can make it work with both.
    
    ## How was this patch tested?
    
    Additional unit tests.
    
    Author: Andrew Ray <ray.andrew@gmail.com>
    
    Closes #16161 from aray/fix-map-codegen.
    
    (cherry picked from commit 46d30ac)
    Signed-off-by: Cheng Lian <lian@databricks.com>
    aray authored and liancheng committed Feb 11, 2017
    Configuration menu
    Copy the full SHA
    e580bb0 View commit details
    Browse the repository at this point in the history

Commits on Feb 12, 2017

  1. [SPARK-19342][SPARKR] bug fixed in collect method for collecting time…

    …stamp column
    
    ## What changes were proposed in this pull request?
    
    Fix a bug in collect method for collecting timestamp column, the bug can be reproduced as shown in the following codes and outputs:
    
    ```
    library(SparkR)
    sparkR.session(master = "local")
    df <- data.frame(col1 = c(0, 1, 2),
                     col2 = c(as.POSIXct("2017-01-01 00:00:01"), NA, as.POSIXct("2017-01-01 12:00:01")))
    
    sdf1 <- createDataFrame(df)
    print(dtypes(sdf1))
    df1 <- collect(sdf1)
    print(lapply(df1, class))
    
    sdf2 <- filter(sdf1, "col1 > 0")
    print(dtypes(sdf2))
    df2 <- collect(sdf2)
    print(lapply(df2, class))
    ```
    
    As we can see from the printed output, the column type of col2 in df2 is converted to numeric unexpectedly, when NA exists at the top of the column.
    
    This is caused by method `do.call(c, list)`, if we convert a list, i.e. `do.call(c, list(NA, as.POSIXct("2017-01-01 12:00:01"))`, the class of the result is numeric instead of POSIXct.
    
    Therefore, we need to cast the data type of the vector explicitly.
    
    ## How was this patch tested?
    
    The patch can be tested manually with the same code above.
    
    Author: titicaca <fangzhou.yang@hotmail.com>
    
    Closes #16689 from titicaca/sparkr-dev.
    
    (cherry picked from commit bc0a0e6)
    Signed-off-by: Felix Cheung <felixcheung@apache.org>
    titicaca authored and Felix Cheung committed Feb 12, 2017
    Configuration menu
    Copy the full SHA
    173c238 View commit details
    Browse the repository at this point in the history
  2. [SPARK-19319][BACKPORT-2.1][SPARKR] SparkR Kmeans summary returns err…

    …or when the cluster size doesn't equal to k
    
    ## What changes were proposed in this pull request?
    
    Backport fix of #16666
    
    ## How was this patch tested?
    
    Backport unit tests
    
    Author: wm624@hotmail.com <wm624@hotmail.com>
    
    Closes #16761 from wangmiao1981/kmeansport.
    wangmiao1981 authored and Felix Cheung committed Feb 12, 2017
    Configuration menu
    Copy the full SHA
    06e77e0 View commit details
    Browse the repository at this point in the history

Commits on Feb 13, 2017

  1. [SPARK-19564][SPARK-19559][SS][KAFKA] KafkaOffsetReader's consumers s…

    …hould not be in the same group
    
    ## What changes were proposed in this pull request?
    
    In `KafkaOffsetReader`, when error occurs, we abort the existing consumer and create a new consumer. In our current implementation, the first consumer and the second consumer would be in the same group (which leads to SPARK-19559), **_violating our intention of the two consumers not being in the same group._**
    
    The cause is that, in our current implementation, the first consumer is created before `groupId` and `nextId` are initialized in the constructor. Then even if `groupId` and `nextId` are increased during the creation of that first consumer, `groupId` and `nextId` would still be initialized to default values in the constructor for the second consumer.
    
    We should make sure that `groupId` and `nextId` are initialized before any consumer is created.
    
    ## How was this patch tested?
    
    Ran 100 times of `KafkaSourceSuite`; all passed
    
    Author: Liwei Lin <lwlin7@gmail.com>
    
    Closes #16902 from lw-lin/SPARK-19564-.
    
    (cherry picked from commit 2bdbc87)
    Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
    lw-lin authored and zsxwing committed Feb 13, 2017
    Configuration menu
    Copy the full SHA
    fe4fcc5 View commit details
    Browse the repository at this point in the history
  2. [SPARK-19574][ML][DOCUMENTATION] Fix Liquid Exception: Start indices …

    …amount is not equal to end indices amount
    
    ### What changes were proposed in this pull request?
    ```
    Liquid Exception: Start indices amount is not equal to end indices amount, see /Users/xiao/IdeaProjects/sparkDelivery/docs/../examples/src/main/java/org/apache/spark/examples/ml/JavaTokenizerExample.java. in ml-features.md
    ```
    
    So far, the build is broken after merging #16789
    
    This PR is to fix it.
    
    ## How was this patch tested?
    Manual
    
    Author: Xiao Li <gatorsmile@gmail.com>
    
    Closes #16908 from gatorsmile/docMLFix.
    
    (cherry picked from commit 855a1b7)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    gatorsmile authored and srowen committed Feb 13, 2017
    Configuration menu
    Copy the full SHA
    a3b6751 View commit details
    Browse the repository at this point in the history
  3. [SPARK-19506][ML][PYTHON] Import warnings in pyspark.ml.util

    ## What changes were proposed in this pull request?
    
    Add missing `warnings` import.
    
    ## How was this patch tested?
    
    Manual tests.
    
    Author: zero323 <zero323@users.noreply.github.com>
    
    Closes #16846 from zero323/SPARK-19506.
    
    (cherry picked from commit 5e7cd33)
    Signed-off-by: Holden Karau <holden@us.ibm.com>
    zero323 authored and holdenk committed Feb 13, 2017
    Configuration menu
    Copy the full SHA
    ef4fb7e View commit details
    Browse the repository at this point in the history
  4. [SPARK-19542][SS] Delete the temp checkpoint if a query is stopped wi…

    …thout errors
    
    ## What changes were proposed in this pull request?
    
    When a query uses a temp checkpoint dir, it's better to delete it if it's stopped without errors.
    
    ## How was this patch tested?
    
    New unit tests.
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes #16880 from zsxwing/delete-temp-checkpoint.
    
    (cherry picked from commit 3dbff9b)
    Signed-off-by: Burak Yavuz <brkyvz@gmail.com>
    zsxwing authored and brkyvz committed Feb 13, 2017
    Configuration menu
    Copy the full SHA
    c5a7cb0 View commit details
    Browse the repository at this point in the history
  5. [SPARK-17714][CORE][TEST-MAVEN][TEST-HADOOP2.6] Avoid using ExecutorC…

    …lassLoader to load Netty generated classes
    
    ## What changes were proposed in this pull request?
    
    Netty's `MessageToMessageEncoder` uses [Javassist](https://github.com/netty/netty/blob/91a0bdc17a8298437d6de08a8958d753799bd4a6/common/src/main/java/io/netty/util/internal/JavassistTypeParameterMatcherGenerator.java#L62) to generate a matcher class and the implementation calls `Class.forName` to check if this class is already generated. If `MessageEncoder` or `MessageDecoder` is created in `ExecutorClassLoader.findClass`, it will cause `ClassCircularityError`. This is because loading this Netty generated class will call `ExecutorClassLoader.findClass` to search this class, and `ExecutorClassLoader` will try to use RPC to load it and cause to load the non-exist matcher class again. JVM will report `ClassCircularityError` to prevent such infinite recursion.
    
    ##### Why it only happens in Maven builds
    
    It's because Maven and SBT have different class loader tree. The Maven build will set a URLClassLoader as the current context class loader to run the tests and expose this issue. The class loader tree is as following:
    
    ```
    bootstrap class loader ------ ... ----- REPL class loader ---- ExecutorClassLoader
    |
    |
    URLClasssLoader
    ```
    
    The SBT build uses the bootstrap class loader directly and `ReplSuite.test("propagation of local properties")` is the first test in ReplSuite, which happens to load `io/netty/util/internal/__matchers__/org/apache/spark/network/protocol/MessageMatcher` into the bootstrap class loader (Note: in maven build, it's loaded into URLClasssLoader so it cannot be found in ExecutorClassLoader). This issue can be reproduced in SBT as well. Here are the produce steps:
    - Enable `hadoop.caller.context.enabled`.
    - Replace `Class.forName` with `Utils.classForName` in `object CallerContext`.
    - Ignore `ReplSuite.test("propagation of local properties")`.
    - Run `ReplSuite` using SBT.
    
    This PR just creates a singleton MessageEncoder and MessageDecoder and makes sure they are created before switching to ExecutorClassLoader. TransportContext will be created when creating RpcEnv and that happens before creating ExecutorClassLoader.
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes #16859 from zsxwing/SPARK-17714.
    
    (cherry picked from commit 905fdf0)
    Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
    zsxwing committed Feb 13, 2017
    Configuration menu
    Copy the full SHA
    328b229 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    2968d8c View commit details
    Browse the repository at this point in the history
  7. [SPARK-19529] TransportClientFactory.createClient() shouldn't call aw…

    …aitUninterruptibly()
    
    This patch replaces a single `awaitUninterruptibly()` call with a plain `await()` call in Spark's `network-common` library in order to fix a bug which may cause tasks to be uncancellable.
    
    In Spark's Netty RPC layer, `TransportClientFactory.createClient()` calls `awaitUninterruptibly()` on a Netty future while waiting for a connection to be established. This creates problem when a Spark task is interrupted while blocking in this call (which can happen in the event of a slow connection which will eventually time out). This has bad impacts on task cancellation when `interruptOnCancel = true`.
    
    As an example of the impact of this problem, I experienced significant numbers of uncancellable "zombie tasks" on a production cluster where several tasks were blocked trying to connect to a dead shuffle server and then continued running as zombies after I cancelled the associated Spark stage. The zombie tasks ran for several minutes with the following stack:
    
    ```
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:460)
    io.netty.util.concurrent.DefaultPromise.await0(DefaultPromise.java:607)
    io.netty.util.concurrent.DefaultPromise.awaitUninterruptibly(DefaultPromise.java:301)
    org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:224)
    org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179) => holding Monitor(java.lang.Object1849476028})
    org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:105)
    org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
    org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
    org.apache.spark.network.shuffle.ExternalShuffleClient.fetchBlocks(ExternalShuffleClient.java:114)
    org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:169)
    org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchUpToMaxBytes(ShuffleBlockFetcherIterator.scala:
    350)
    org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:286)
    org.apache.spark.storage.ShuffleBlockFetcherIterator.<init>(ShuffleBlockFetcherIterator.scala:120)
    org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:45)
    org.apache.spark.sql.execution.ShuffledRowRDD.compute(ShuffledRowRDD.scala:169)
    org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    [...]
    ```
    
    As far as I can tell, `awaitUninterruptibly()` might have been used in order to avoid having to declare that methods throw `InterruptedException` (this code is written in Java, hence the need to use checked exceptions). This patch simply replaces this with a regular, interruptible `await()` call,.
    
    This required several interface changes to declare a new checked exception (these are internal interfaces, though, and this change doesn't significantly impact binary compatibility).
    
    An alternative approach would be to wrap `InterruptedException` into `IOException` in order to avoid having to change interfaces. The problem with this approach is that the `network-shuffle` project's `RetryingBlockFetcher` code treats `IOExceptions` as transitive failures when deciding whether to retry fetches, so throwing a wrapped `IOException` might cause an interrupted shuffle fetch to be retried, further prolonging the lifetime of a cancelled zombie task.
    
    Note that there are three other `awaitUninterruptibly()` in the codebase, but those calls have a hard 10 second timeout and are waiting on a `close()` operation which is expected to complete near instantaneously, so the impact of uninterruptibility there is much smaller.
    
    Manually.
    
    Author: Josh Rosen <joshrosen@databricks.com>
    
    Closes #16866 from JoshRosen/SPARK-19529.
    
    (cherry picked from commit 1c4d10b)
    Signed-off-by: Cheng Lian <lian@databricks.com>
    JoshRosen authored and liancheng committed Feb 13, 2017
    Configuration menu
    Copy the full SHA
    5db2347 View commit details
    Browse the repository at this point in the history
  8. [SPARK-19520][STREAMING] Do not encrypt data written to the WAL.

    Spark's I/O encryption uses an ephemeral key for each driver instance.
    So driver B cannot decrypt data written by driver A since it doesn't
    have the correct key.
    
    The write ahead log is used for recovery, thus needs to be readable by
    a different driver. So it cannot be encrypted by Spark's I/O encryption
    code.
    
    The BlockManager APIs used by the WAL code to write the data automatically
    encrypt data, so changes are needed so that callers can to opt out of
    encryption.
    
    Aside from that, the "putBytes" API in the BlockManager does not do
    encryption, so a separate situation arised where the WAL would write
    unencrypted data to the BM and, when those blocks were read, decryption
    would fail. So the WAL code needs to ask the BM to encrypt that data
    when encryption is enabled; this code is not optimal since it results
    in a (temporary) second copy of the data block in memory, but should be
    OK for now until a more performant solution is added. The non-encryption
    case should not be affected.
    
    Tested with new unit tests, and by running streaming apps that do
    recovery using the WAL data with I/O encryption turned on.
    
    Author: Marcelo Vanzin <vanzin@cloudera.com>
    
    Closes #16862 from vanzin/SPARK-19520.
    
    (cherry picked from commit 0169360)
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
    Marcelo Vanzin committed Feb 13, 2017
    Configuration menu
    Copy the full SHA
    7fe3543 View commit details
    Browse the repository at this point in the history

Commits on Feb 14, 2017

  1. [SPARK-19585][DOC][SQL] Fix the cacheTable and uncacheTable api call …

    …in the doc
    
    ## What changes were proposed in this pull request?
    
    https://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory
    In the doc, the call spark.cacheTable(“tableName”) and spark.uncacheTable(“tableName”) actually needs to be spark.catalog.cacheTable and spark.catalog.uncacheTable
    
    ## How was this patch tested?
    Built the docs and verified the change shows up fine.
    
    Author: Sunitha Kambhampati <skambha@us.ibm.com>
    
    Closes #16919 from skambha/docChange.
    
    (cherry picked from commit 9b5e460)
    Signed-off-by: Xiao Li <gatorsmile@gmail.com>
    skambha authored and gatorsmile committed Feb 14, 2017
    Configuration menu
    Copy the full SHA
    c8113b0 View commit details
    Browse the repository at this point in the history
  2. [SPARK-19501][YARN] Reduce the number of HDFS RPCs during YARN deploy…

    …ment
    
    ## What changes were proposed in this pull request?
    
    As discussed in [JIRA](https://issues.apache.org/jira/browse/SPARK-19501), this patch addresses the problem where too many HDFS RPCs are made when there are many URIs specified in `spark.yarn.jars`, potentially adding hundreds of RTTs to YARN before the application launches. This becomes significant when submitting the application to a non-local YARN cluster (where the RTT may be in order of 100ms, for example). For each URI specified, the current implementation makes at least two HDFS RPCs, for:
    
    - [Calling `getFileStatus()` before uploading each file to the distributed cache in `ClientDistributedCacheManager.addResource()`](https://github.com/apache/spark/blob/v2.1.0/yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientDistributedCacheManager.scala#L71).
    - [Resolving any symbolic links in each of the file URI](https://github.com/apache/spark/blob/v2.1.0/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L377-L379), which repeatedly makes HDFS RPCs until the all symlinks are resolved. (see [`FileContext.resolve(Path)`](https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileContext.java#L2189-L2195), [`FSLinkResolver.resolve(FileContext, Path)`](https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FSLinkResolver.java#L79-L112), and [`AbstractFileSystem.resolvePath()`](https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/AbstractFileSystem.java#L464-L468).)
    
    The first `getFileStatus` RPC can be removed, using `statCache` populated with the file statuses retrieved with [the previous `globStatus` call](https://github.com/apache/spark/blob/v2.1.0/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L531).
    
    The second one can be largely reduced by caching the symlink resolution results in a mutable.HashMap. This patch adds a local variable in `yarn.Client.prepareLocalResources()` and passes it as an additional parameter to `yarn.Client.copyFileToRemote`.  [The symlink resolution code was added in 2013](a35472e#diff-b050df3f55b82065803d6e83453b9706R187) and has not changed since. I am assuming that this is still required, but otherwise we can remove using `symlinkCache` and symlink resolution altogether.
    
    ## How was this patch tested?
    
    This patch is based off 8e8afb3, currently the latest YARN patch on master. All tests except a few in spark-hive passed with `./dev/run-tests` on my machine, using JDK 1.8.0_112 on macOS 10.12.3; also tested myself with this modified version of SPARK 2.2.0-SNAPSHOT which performed a normal deployment and execution on a YARN cluster without errors.
    
    Author: Jong Wook Kim <jongwook@nyu.edu>
    
    Closes #16916 from jongwook/SPARK-19501.
    
    (cherry picked from commit ab9872d)
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
    jongwook authored and Marcelo Vanzin committed Feb 14, 2017
    Configuration menu
    Copy the full SHA
    f837ced View commit details
    Browse the repository at this point in the history
  3. [SPARK-19387][SPARKR] Tests do not run with SparkR source package in …

    …CRAN check
    
    ## What changes were proposed in this pull request?
    
    - this is cause by changes in SPARK-18444, SPARK-18643 that we no longer install Spark when `master = ""` (default), but also related to SPARK-18449 since the real `master` value is not known at the time the R code in `sparkR.session` is run. (`master` cannot default to "local" since it could be overridden by spark-submit commandline or spark config)
    - as a result, while running SparkR as a package in IDE is working fine, CRAN check is not as it is launching it via non-interactive script
    - fix is to add check to the beginning of each test and vignettes; the same would also work by changing `sparkR.session()` to `sparkR.session(master = "local")` in tests, but I think being more explicit is better.
    
    ## How was this patch tested?
    
    Tested this by reverting version to 2.1, since it needs to download the release jar with matching version. But since there are changes in 2.2 (specifically around SparkR ML) that are incompatible with 2.1, some tests are failing in this config. Will need to port this to branch-2.1 and retest with 2.1 release jar.
    
    manually as:
    ```
    # modify DESCRIPTION to revert version to 2.1.0
    SPARK_HOME=/usr/spark R CMD build pkg
    # run cran check without SPARK_HOME
    R CMD check --as-cran SparkR_2.1.0.tar.gz
    ```
    
    Author: Felix Cheung <felixcheung_m@hotmail.com>
    
    Closes #16720 from felixcheung/rcranchecktest.
    
    (cherry picked from commit a3626ca)
    Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
    felixcheung authored and shivaram committed Feb 14, 2017
    Configuration menu
    Copy the full SHA
    7763b0b View commit details
    Browse the repository at this point in the history

Commits on Feb 15, 2017

  1. [SPARK-19584][SS][DOCS] update structured streaming documentation aro…

    …und batch mode
    
    ## What changes were proposed in this pull request?
    
    Revision to structured-streaming-kafka-integration.md to reflect new Batch query specification and options.
    
    zsxwing tdas
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Tyson Condie <tcondie@gmail.com>
    
    Closes #16918 from tcondie/kafka-docs.
    
    (cherry picked from commit 447b2b5)
    Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
    tcondie authored and tdas committed Feb 15, 2017
    Configuration menu
    Copy the full SHA
    8ee4ec8 View commit details
    Browse the repository at this point in the history
  2. [SPARK-19399][SPARKR] Add R coalesce API for DataFrame and Column

    Add coalesce on DataFrame for down partitioning without shuffle and coalesce on Column
    
    manual, unit tests
    
    Author: Felix Cheung <felixcheung_m@hotmail.com>
    
    Closes #16739 from felixcheung/rcoalesce.
    
    (cherry picked from commit 671bc08)
    Signed-off-by: Felix Cheung <felixcheung@apache.org>
    felixcheung authored and Felix Cheung committed Feb 15, 2017
    Configuration menu
    Copy the full SHA
    6c35399 View commit details
    Browse the repository at this point in the history

Commits on Feb 16, 2017

  1. [SPARK-19599][SS] Clean up HDFSMetadataLog

    ## What changes were proposed in this pull request?
    
    SPARK-19464 removed support for Hadoop 2.5 and earlier, so we can do some cleanup for HDFSMetadataLog.
    
    This PR includes the following changes:
    - ~~Remove the workaround codes for HADOOP-10622.~~ Unfortunately, there is another issue [HADOOP-14084](https://issues.apache.org/jira/browse/HADOOP-14084) that prevents us from removing the workaround codes.
    - Remove unnecessary `writer: (T, OutputStream) => Unit` and just call `serialize` directly.
    - Remove catching FileNotFoundException.
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes #16932 from zsxwing/metadata-cleanup.
    
    (cherry picked from commit 21b4ba2)
    Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
    zsxwing committed Feb 16, 2017
    Configuration menu
    Copy the full SHA
    88c43f4 View commit details
    Browse the repository at this point in the history
  2. [SPARK-19604][TESTS] Log the start of every Python test

    ## What changes were proposed in this pull request?
    Right now, we only have info level log after we finish the tests of a Python test file. We should also log the start of a test. So, if a test is hanging, we can tell which test file is running.
    
    ## How was this patch tested?
    This is a change for python tests.
    
    Author: Yin Huai <yhuai@databricks.com>
    
    Closes #16935 from yhuai/SPARK-19604.
    
    (cherry picked from commit f6c3bba)
    Signed-off-by: Yin Huai <yhuai@databricks.com>
    yhuai committed Feb 16, 2017
    Configuration menu
    Copy the full SHA
    b9ab4c0 View commit details
    Browse the repository at this point in the history
  3. [SPARK-19603][SS] Fix StreamingQuery explain command

    ## What changes were proposed in this pull request?
    
    `StreamingQuery.explain` doesn't show the correct streaming physical plan right now because `ExplainCommand` receives a runtime batch plan and its `logicalPlan.isStreaming` is always false.
    
    This PR adds `streaming` parameter to `ExplainCommand` to allow `StreamExecution` to specify that it's a streaming plan.
    
    Examples of the explain outputs:
    
    - streaming DataFrame.explain()
    ```
    == Physical Plan ==
    *HashAggregate(keys=[value#518], functions=[count(1)])
    +- StateStoreSave [value#518], OperatorStateId(<unknown>,0,0), Append, 0
       +- *HashAggregate(keys=[value#518], functions=[merge_count(1)])
          +- StateStoreRestore [value#518], OperatorStateId(<unknown>,0,0)
             +- *HashAggregate(keys=[value#518], functions=[merge_count(1)])
                +- Exchange hashpartitioning(value#518, 5)
                   +- *HashAggregate(keys=[value#518], functions=[partial_count(1)])
                      +- *SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#518]
                         +- *MapElements <function1>, obj#517: java.lang.String
                            +- *DeserializeToObject value#513.toString, obj#516: java.lang.String
                               +- StreamingRelation MemoryStream[value#513], [value#513]
    ```
    
    - StreamingQuery.explain(extended = false)
    ```
    == Physical Plan ==
    *HashAggregate(keys=[value#518], functions=[count(1)])
    +- StateStoreSave [value#518], OperatorStateId(...,0,0), Complete, 0
       +- *HashAggregate(keys=[value#518], functions=[merge_count(1)])
          +- StateStoreRestore [value#518], OperatorStateId(...,0,0)
             +- *HashAggregate(keys=[value#518], functions=[merge_count(1)])
                +- Exchange hashpartitioning(value#518, 5)
                   +- *HashAggregate(keys=[value#518], functions=[partial_count(1)])
                      +- *SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#518]
                         +- *MapElements <function1>, obj#517: java.lang.String
                            +- *DeserializeToObject value#543.toString, obj#516: java.lang.String
                               +- LocalTableScan [value#543]
    ```
    
    - StreamingQuery.explain(extended = true)
    ```
    == Parsed Logical Plan ==
    Aggregate [value#518], [value#518, count(1) AS count(1)#524L]
    +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#518]
       +- MapElements <function1>, class java.lang.String, [StructField(value,StringType,true)], obj#517: java.lang.String
          +- DeserializeToObject cast(value#543 as string).toString, obj#516: java.lang.String
             +- LocalRelation [value#543]
    
    == Analyzed Logical Plan ==
    value: string, count(1): bigint
    Aggregate [value#518], [value#518, count(1) AS count(1)#524L]
    +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#518]
       +- MapElements <function1>, class java.lang.String, [StructField(value,StringType,true)], obj#517: java.lang.String
          +- DeserializeToObject cast(value#543 as string).toString, obj#516: java.lang.String
             +- LocalRelation [value#543]
    
    == Optimized Logical Plan ==
    Aggregate [value#518], [value#518, count(1) AS count(1)#524L]
    +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#518]
       +- MapElements <function1>, class java.lang.String, [StructField(value,StringType,true)], obj#517: java.lang.String
          +- DeserializeToObject value#543.toString, obj#516: java.lang.String
             +- LocalRelation [value#543]
    
    == Physical Plan ==
    *HashAggregate(keys=[value#518], functions=[count(1)], output=[value#518, count(1)#524L])
    +- StateStoreSave [value#518], OperatorStateId(...,0,0), Complete, 0
       +- *HashAggregate(keys=[value#518], functions=[merge_count(1)], output=[value#518, count#530L])
          +- StateStoreRestore [value#518], OperatorStateId(...,0,0)
             +- *HashAggregate(keys=[value#518], functions=[merge_count(1)], output=[value#518, count#530L])
                +- Exchange hashpartitioning(value#518, 5)
                   +- *HashAggregate(keys=[value#518], functions=[partial_count(1)], output=[value#518, count#530L])
                      +- *SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#518]
                         +- *MapElements <function1>, obj#517: java.lang.String
                            +- *DeserializeToObject value#543.toString, obj#516: java.lang.String
                               +- LocalTableScan [value#543]
    ```
    
    ## How was this patch tested?
    
    The updated unit test.
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes #16934 from zsxwing/SPARK-19603.
    
    (cherry picked from commit fc02ef9)
    Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
    zsxwing committed Feb 16, 2017
    Configuration menu
    Copy the full SHA
    db7adb6 View commit details
    Browse the repository at this point in the history
  4. [SPARK-19399][SPARKR][BACKPORT-2.1] fix tests broken by merge

    ## What changes were proposed in this pull request?
    
    fix test broken by git merge for #16739
    
    ## How was this patch tested?
    
    manual
    
    Author: Felix Cheung <felixcheung_m@hotmail.com>
    
    Closes #16950 from felixcheung/fixrtest.
    felixcheung authored and Felix Cheung committed Feb 16, 2017
    Configuration menu
    Copy the full SHA
    252dd05 View commit details
    Browse the repository at this point in the history

Commits on Feb 17, 2017

  1. [SPARK-19622][WEBUI] Fix a http error in a paged table when using a `…

    …Go` button to search.
    
    ## What changes were proposed in this pull request?
    
    The search function of paged table is not available because of we don't skip the hash data of the reqeust path.
    
    ![](https://issues.apache.org/jira/secure/attachment/12852996/screenshot-1.png)
    
    ## How was this patch tested?
    
    Tested manually with my browser.
    
    Author: Stan Zhai <zhaishidan@haizhi.com>
    
    Closes #16953 from stanzhai/fix-webui-paged-table.
    
    (cherry picked from commit 021062a)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    stanzhai authored and srowen committed Feb 17, 2017
    Configuration menu
    Copy the full SHA
    55958bc View commit details
    Browse the repository at this point in the history
  2. [SPARK-19500] [SQL] Fix off-by-one bug in BytesToBytesMap

    ## What changes were proposed in this pull request?
    
    Radix sort require that half of array as free (as temporary space), so we use 0.5 as the scale factor to make sure that BytesToBytesMap will not have more items than 1/2 of capacity. Turned out this is not true, the current implementation of append() could leave 1 more item than the threshold (1/2 of capacity) in the array, which break the requirement of radix sort (fail the assert in 2.2, or fail to insert into InMemorySorter in 2.1).
    
    This PR fix the off-by-one bug in BytesToBytesMap.
    
    This PR also fix a bug that the array will never grow if it fail to grow once (stay as initial capacity), introduced by #15722 .
    
    ## How was this patch tested?
    
    Added regression test.
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes #16844 from davies/off_by_one.
    
    (cherry picked from commit 3d0c3af)
    Signed-off-by: Davies Liu <davies.liu@gmail.com>
    Davies Liu authored and davies committed Feb 17, 2017
    Configuration menu
    Copy the full SHA
    6e3abed View commit details
    Browse the repository at this point in the history
  3. [SPARK-19517][SS] KafkaSource fails to initialize partition offsets

    ## What changes were proposed in this pull request?
    
    This patch fixes a bug in `KafkaSource` with the (de)serialization of the length of the JSON string that contains the initial partition offsets.
    
    ## How was this patch tested?
    
    I ran the test suite for spark-sql-kafka-0-10.
    
    Author: Roberto Agostino Vitillo <ra.vitillo@gmail.com>
    
    Closes #16857 from vitillo/kafka_source_fix.
    vitillo authored and zsxwing committed Feb 17, 2017
    Configuration menu
    Copy the full SHA
    b083ec5 View commit details
    Browse the repository at this point in the history

Commits on Feb 20, 2017

  1. [SPARK-19646][CORE][STREAMING] binaryRecords replicates records in sc…

    …ala API
    
    ## What changes were proposed in this pull request?
    
    Use `BytesWritable.copyBytes`, not `getBytes`, because `getBytes` returns the underlying array, which may be reused when repeated reads don't need a different size, as is the case with binaryRecords APIs
    
    ## How was this patch tested?
    
    Existing tests
    
    Author: Sean Owen <sowen@cloudera.com>
    
    Closes #16974 from srowen/SPARK-19646.
    
    (cherry picked from commit d0ecca6)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    srowen committed Feb 20, 2017
    Configuration menu
    Copy the full SHA
    7c371de View commit details
    Browse the repository at this point in the history
  2. [SPARK-19646][BUILD][HOTFIX] Fix compile error from cherry-pick of SP…

    …ARK-19646 into branch 2.1
    
    ## What changes were proposed in this pull request?
    
    Fix compile error from cherry-pick of SPARK-19646 into branch 2.1
    
    ## How was this patch tested?
    
    Jenkins tests
    
    Author: Sean Owen <sowen@cloudera.com>
    
    Closes #17003 from srowen/SPARK-19646.2.
    srowen committed Feb 20, 2017
    Configuration menu
    Copy the full SHA
    c331674 View commit details
    Browse the repository at this point in the history

Commits on Feb 21, 2017

  1. [SPARK-19626][YARN] Using the correct config to set credentials updat…

    …e time
    
    ## What changes were proposed in this pull request?
    
    In #14065, we introduced a configurable credential manager for Spark running on YARN. Also two configs `spark.yarn.credentials.renewalTime` and `spark.yarn.credentials.updateTime` were added, one is for the credential renewer and the other updater. But now we just query `spark.yarn.credentials.renewalTime` by mistake during CREDENTIALS UPDATING, where should be actually `spark.yarn.credentials.updateTime` .
    
    This PR fixes this mistake.
    
    ## How was this patch tested?
    
    existing test
    
    cc jerryshao vanzin
    
    Author: Kent Yao <yaooqinn@hotmail.com>
    
    Closes #16955 from yaooqinn/cred_update.
    
    (cherry picked from commit 7363dde)
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
    yaooqinn authored and Marcelo Vanzin committed Feb 21, 2017
    Configuration menu
    Copy the full SHA
    6edf02a View commit details
    Browse the repository at this point in the history

Commits on Feb 22, 2017

  1. [SPARK-19617][SS] Fix the race condition when starting and stopping a…

    … query quickly (branch-2.1)
    
    ## What changes were proposed in this pull request?
    
    Backport #16947 to branch 2.1. Note: we still need to support old Hadoop versions in 2.1.*.
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes #16979 from zsxwing/SPARK-19617-branch-2.1.
    zsxwing committed Feb 22, 2017
    Configuration menu
    Copy the full SHA
    9a890b5 View commit details
    Browse the repository at this point in the history
  2. [SPARK-19652][UI] Do auth checks for REST API access (branch-2.1).

    The REST API has a security filter that performs auth checks
    based on the UI root's security manager. That works fine when
    the UI root is the app's UI, but not when it's the history server.
    
    In the SHS case, all users would be allowed to see all applications
    through the REST API, even if the UI itself wouldn't be available
    to them.
    
    This change adds auth checks for each app access through the API
    too, so that only authorized users can see the app's data.
    
    The change also modifies the existing security filter to use
    `HttpServletRequest.getRemoteUser()`, which is used in other
    places. That is not necessarily the same as the principal's
    name; for example, when using Hadoop's SPNEGO auth filter,
    the remote user strips the realm information, which then matches
    the user name registered as the owner of the application.
    
    I also renamed the UIRootFromServletContext trait to a more generic
    name since I'm using it to store more context information now.
    
    Tested manually with an authentication filter enabled.
    
    Author: Marcelo Vanzin <vanzin@cloudera.com>
    
    Closes #17019 from vanzin/SPARK-19652_2.1.
    Marcelo Vanzin committed Feb 22, 2017
    Configuration menu
    Copy the full SHA
    21afc45 View commit details
    Browse the repository at this point in the history

Commits on Feb 23, 2017

  1. [SPARK-19682][SPARKR] Issue warning (or error) when subset method "[[…

    …" takes vector index
    
    ## What changes were proposed in this pull request?
    The `[[` method is supposed to take a single index and return a column. This is different from base R which takes a vector index.  We should check for this and issue warning or error when vector index is supplied (which is very likely given the behavior in base R).
    
    Currently I'm issuing a warning message and just take the first element of the vector index. We could change this to an error it that's better.
    
    ## How was this patch tested?
    new tests
    
    Author: actuaryzhang <actuaryzhang10@gmail.com>
    
    Closes #17017 from actuaryzhang/sparkRSubsetter.
    
    (cherry picked from commit 7bf0943)
    Signed-off-by: Felix Cheung <felixcheung@apache.org>
    actuaryzhang authored and Felix Cheung committed Feb 23, 2017
    Configuration menu
    Copy the full SHA
    d30238f View commit details
    Browse the repository at this point in the history
  2. [SPARK-19459][SQL][BRANCH-2.1] Support for nested char/varchar fields…

    … in ORC
    
    ## What changes were proposed in this pull request?
    This is a backport of the two following commits: 78eae7e & de8a03e
    
    This PR adds support for ORC tables with (nested) char/varchar fields.
    
    ## How was this patch tested?
    Added a regression test to `OrcSourceSuite`.
    
    Author: Herman van Hovell <hvanhovell@databricks.com>
    
    Closes #17041 from hvanhovell/SPARK-19459-branch-2.1.
    hvanhovell authored and gatorsmile committed Feb 23, 2017
    Configuration menu
    Copy the full SHA
    43084b3 View commit details
    Browse the repository at this point in the history

Commits on Feb 24, 2017

  1. [SPARK-19691][SQL][BRANCH-2.1] Fix ClassCastException when calculatin…

    …g percentile of decimal column
    
    ## What changes were proposed in this pull request?
    This is a backport of the two following commits: 93aa427
    
    This pr fixed a class-cast exception below;
    ```
    scala> spark.range(10).selectExpr("cast (id as decimal) as x").selectExpr("percentile(x, 0.5)").collect()
     java.lang.ClassCastException: org.apache.spark.sql.types.Decimal cannot be cast to java.lang.Number
    	at org.apache.spark.sql.catalyst.expressions.aggregate.Percentile.update(Percentile.scala:141)
    	at org.apache.spark.sql.catalyst.expressions.aggregate.Percentile.update(Percentile.scala:58)
    	at org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.update(interfaces.scala:514)
    	at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:171)
    	at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:171)
    	at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:187)
    	at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:181)
    	at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:151)
    	at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.<init>(ObjectAggregationIterator.scala:78)
    	at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:109)
    	at
    ```
    This fix simply converts catalyst values (i.e., `Decimal`) into scala ones by using `CatalystTypeConverters`.
    
    ## How was this patch tested?
    Added a test in `DataFrameSuite`.
    
    Author: Takeshi Yamamuro <yamamuro@apache.org>
    
    Closes #17046 from maropu/SPARK-19691-BACKPORT2.1.
    maropu authored and hvanhovell committed Feb 24, 2017
    Configuration menu
    Copy the full SHA
    66a7ca2 View commit details
    Browse the repository at this point in the history
  2. [SPARK-19707][CORE] Improve the invalid path check for sc.addJar

    ## What changes were proposed in this pull request?
    
    Currently in Spark there're two issues when we add jars with invalid path:
    
    * If the jar path is a empty string {--jar ",dummy.jar"}, then Spark will resolve it to the current directory path and add to classpath / file server, which is unwanted. This is happened in our programatic way to submit Spark application. From my understanding Spark should defensively filter out such empty path.
    * If the jar path is a invalid path (file doesn't exist), `addJar` doesn't check it and will still add to file server, the exception will be delayed until job running. Actually this local path could be checked beforehand, no need to wait until task running. We have similar check in `addFile`, but lacks similar similar mechanism in `addJar`.
    
    ## How was this patch tested?
    
    Add unit test and local manual verification.
    
    Author: jerryshao <sshao@hortonworks.com>
    
    Closes #17038 from jerryshao/SPARK-19707.
    
    (cherry picked from commit b0a8c16)
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
    jerryshao authored and Marcelo Vanzin committed Feb 24, 2017
    Configuration menu
    Copy the full SHA
    6da6a27 View commit details
    Browse the repository at this point in the history
  3. [SPARK-19038][YARN] Avoid overwriting keytab configuration in yarn-cl…

    …ient
    
    ## What changes were proposed in this pull request?
    
    Because yarn#client will reset the `spark.yarn.keytab` configuration to point to the location in distributed file, so if user still uses the old `SparkConf` to create `SparkSession` with Hive enabled, it will read keytab from the path in distributed cached. This is OK for yarn cluster mode, but in yarn client mode where driver is running out of container, it will be failed to fetch the keytab.
    
    So here we should avoid reseting this configuration in the `yarn#client` and only overwriting it for AM, so using `spark.yarn.keytab` could get correct keytab path no matter running in client (keytab in local fs) or cluster (keytab in distributed cache) mode.
    
    ## How was this patch tested?
    
    Verified in security cluster.
    
    Author: jerryshao <sshao@hortonworks.com>
    
    Closes #16923 from jerryshao/SPARK-19038.
    
    (cherry picked from commit a920a43)
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
    jerryshao authored and Marcelo Vanzin committed Feb 24, 2017
    Configuration menu
    Copy the full SHA
    ed9aaa3 View commit details
    Browse the repository at this point in the history

Commits on Feb 25, 2017

  1. [MINOR][DOCS] Fixes two problems in the SQL programing guide page

    ## What changes were proposed in this pull request?
    
    Removed duplicated lines in sql python example and found a typo.
    
    ## How was this patch tested?
    
    Searched for other typo's in the page to minimize PR's.
    
    Author: Boaz Mohar <boazmohar@gmail.com>
    
    Closes #17066 from boazmohar/doc-fix.
    
    (cherry picked from commit 061bcfb)
    Signed-off-by: Xiao Li <gatorsmile@gmail.com>
    boazmohar authored and gatorsmile committed Feb 25, 2017
    Configuration menu
    Copy the full SHA
    97866e1 View commit details
    Browse the repository at this point in the history

Commits on Feb 26, 2017

  1. [SPARK-14772][PYTHON][ML] Fixed Params.copy method to match Scala imp…

    …lementation
    
    ## What changes were proposed in this pull request?
    Fixed the PySpark Params.copy method to behave like the Scala implementation.  The main issue was that it did not account for the _defaultParamMap and merged it into the explicitly created param map.
    
    ## How was this patch tested?
    Added new unit test to verify the copy method behaves correctly for copying uid, explicitly created params, and default params.
    
    Author: Bryan Cutler <cutlerb@gmail.com>
    
    Closes #17048 from BryanCutler/pyspark-ml-param_copy-Scala_sync-SPARK-14772-2_1.
    BryanCutler authored and jkbradley committed Feb 26, 2017
    Configuration menu
    Copy the full SHA
    20a4329 View commit details
    Browse the repository at this point in the history
  2. [SPARK-19594][STRUCTURED STREAMING] StreamingQueryListener fails to h…

    …andle QueryTerminatedEvent if more then one listeners exists
    
    ## What changes were proposed in this pull request?
    
    currently if multiple streaming queries listeners exists, when a QueryTerminatedEvent is triggered, only one of the listeners will be invoked while the rest of the listeners will ignore the event.
    this is caused since the the streaming queries listeners bus holds a set of running queries ids and when a termination event is triggered, after the first listeners is handling the event, the terminated query id is being removed from the set.
    in this PR, the query id will be removed from the set only after all the listeners handles the event
    
    ## How was this patch tested?
    
    a test with multiple listeners has been added to StreamingQueryListenerSuite
    
    Author: Eyal Zituny <eyal.zituny@equalum.io>
    
    Closes #16991 from eyalzit/master.
    
    (cherry picked from commit 9f8e392)
    Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
    Eyal Zituny authored and zsxwing committed Feb 26, 2017
    Configuration menu
    Copy the full SHA
    04fbb9e View commit details
    Browse the repository at this point in the history

Commits on Feb 28, 2017

  1. [SPARK-19748][SQL] refresh function has a wrong order to do cache inv…

    …alidate and regenerate the inmemory var for InMemoryFileIndex with FileStatusCache
    
    ## What changes were proposed in this pull request?
    
    If we refresh a InMemoryFileIndex with a FileStatusCache, it will first use the FileStatusCache to re-generate the cachedLeafFiles etc, then call FileStatusCache.invalidateAll.
    
    While the order to do these two actions is wrong, this lead to the refresh action does not take effect.
    
    ```
      override def refresh(): Unit = {
        refresh0()
        fileStatusCache.invalidateAll()
      }
    
      private def refresh0(): Unit = {
        val files = listLeafFiles(rootPaths)
        cachedLeafFiles =
          new mutable.LinkedHashMap[Path, FileStatus]() ++= files.map(f => f.getPath -> f)
        cachedLeafDirToChildrenFiles = files.toArray.groupBy(_.getPath.getParent)
        cachedPartitionSpec = null
      }
    ```
    ## How was this patch tested?
    unit test added
    
    Author: windpiger <songjun@outlook.com>
    
    Closes #17079 from windpiger/fixInMemoryFileIndexRefresh.
    
    (cherry picked from commit a350bc1)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    windpiger authored and cloud-fan committed Feb 28, 2017
    Configuration menu
    Copy the full SHA
    4b4c3bf View commit details
    Browse the repository at this point in the history
  2. [SPARK-19677][SS] Committing a delta file atop an existing one should…

    … not fail on HDFS
    
    ## What changes were proposed in this pull request?
    
    HDFSBackedStateStoreProvider fails to rename files on HDFS but not on the local filesystem. According to the [implementation notes](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/filesystem/filesystem.html) of `rename()`, the behavior of the local filesystem and HDFS varies:
    
    > Destination exists and is a file
    > Renaming a file atop an existing file is specified as failing, raising an exception.
    >    - Local FileSystem : the rename succeeds; the destination file is replaced by the source file.
    >    - HDFS : The rename fails, no exception is raised. Instead the method call simply returns false.
    
    This patch ensures that `rename()` isn't called if the destination file already exists. It's still semantically correct because Structured Streaming requires that rerunning a batch should generate the same output.
    
    ## How was this patch tested?
    
    This patch was tested by running `StateStoreSuite`.
    
    Author: Roberto Agostino Vitillo <ra.vitillo@gmail.com>
    
    Closes #17012 from vitillo/fix_rename.
    
    (cherry picked from commit 9734a92)
    Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
    vitillo authored and zsxwing committed Feb 28, 2017
    Configuration menu
    Copy the full SHA
    947c0cd View commit details
    Browse the repository at this point in the history
  3. [SPARK-19769][DOCS] Update quickstart instructions

    ## What changes were proposed in this pull request?
    
    This change addresses the renaming of the `simple.sbt` build file to
    `build.sbt`. Newer versions of the sbt tool are not finding the older
    named file and are looking for the `build.sbt`. The quickstart
    instructions for self-contained applications is updated with this
    change.
    
    ## How was this patch tested?
    
    As this is a relatively minor change of a few words, the markdown was checked for syntax and spelling. Site was built with `SKIP_API=1 jekyll serve` for testing purposes.
    
    Author: Michael McCune <msm@redhat.com>
    
    Closes #17101 from elmiko/spark-19769.
    
    (cherry picked from commit bf5987c)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    elmiko authored and srowen committed Feb 28, 2017
    Configuration menu
    Copy the full SHA
    d887f75 View commit details
    Browse the repository at this point in the history

Commits on Mar 1, 2017

  1. [SPARK-19572][SPARKR] Allow to disable hive in sparkR shell

    ## What changes were proposed in this pull request?
    SPARK-15236 do this for scala shell, this ticket is for sparkR shell. This is not only for sparkR itself, but can also benefit downstream project like livy which use shell.R for its interactive session. For now, livy has no control of whether enable hive or not.
    
    ## How was this patch tested?
    
    Tested it manually, run `bin/sparkR --master local --conf spark.sql.catalogImplementation=in-memory` and verify hive is not enabled.
    
    Author: Jeff Zhang <zjffdu@apache.org>
    
    Closes #16907 from zjffdu/SPARK-19572.
    
    (cherry picked from commit 7315880)
    Signed-off-by: Felix Cheung <felixcheung@apache.org>
    zjffdu authored and Felix Cheung committed Mar 1, 2017
    Configuration menu
    Copy the full SHA
    f719ccc View commit details
    Browse the repository at this point in the history
  2. [SPARK-19766][SQL] Constant alias columns in INNER JOIN should not be…

    … folded by FoldablePropagation rule
    
    ## What changes were proposed in this pull request?
    This PR fixes the code in Optimizer phase where the constant alias columns of a `INNER JOIN` query are folded in Rule `FoldablePropagation`.
    
    For the following query():
    
    ```
    val sqlA =
      """
        |create temporary view ta as
        |select a, 'a' as tag from t1 union all
        |select a, 'b' as tag from t2
      """.stripMargin
    
    val sqlB =
      """
        |create temporary view tb as
        |select a, 'a' as tag from t3 union all
        |select a, 'b' as tag from t4
      """.stripMargin
    
    val sql =
      """
        |select tb.* from ta inner join tb on
        |ta.a = tb.a and
        |ta.tag = tb.tag
      """.stripMargin
    ```
    
    The tag column is an constant alias column, it's folded by `FoldablePropagation` like this:
    
    ```
    TRACE SparkOptimizer:
    === Applying Rule org.apache.spark.sql.catalyst.optimizer.FoldablePropagation ===
     Project [a#4, tag#14]                              Project [a#4, tag#14]
    !+- Join Inner, ((a#0 = a#4) && (tag#8 = tag#14))   +- Join Inner, ((a#0 = a#4) && (a = a))
        :- Union                                           :- Union
        :  :- Project [a#0, a AS tag#8]                    :  :- Project [a#0, a AS tag#8]
        :  :  +- LocalRelation [a#0]                       :  :  +- LocalRelation [a#0]
        :  +- Project [a#2, b AS tag#9]                    :  +- Project [a#2, b AS tag#9]
        :     +- LocalRelation [a#2]                       :     +- LocalRelation [a#2]
        +- Union                                           +- Union
           :- Project [a#4, a AS tag#14]                      :- Project [a#4, a AS tag#14]
           :  +- LocalRelation [a#4]                          :  +- LocalRelation [a#4]
           +- Project [a#6, b AS tag#15]                      +- Project [a#6, b AS tag#15]
              +- LocalRelation [a#6]                             +- LocalRelation [a#6]
    ```
    
    Finally the Result of Batch Operator Optimizations is:
    
    ```
    Project [a#4, tag#14]                              Project [a#4, tag#14]
    !+- Join Inner, ((a#0 = a#4) && (tag#8 = tag#14))   +- Join Inner, (a#0 = a#4)
    !   :- SubqueryAlias ta, `ta`                          :- Union
    !   :  +- Union                                        :  :- LocalRelation [a#0]
    !   :     :- Project [a#0, a AS tag#8]                 :  +- LocalRelation [a#2]
    !   :     :  +- SubqueryAlias t1, `t1`                 +- Union
    !   :     :     +- Project [a#0]                          :- LocalRelation [a#4, tag#14]
    !   :     :        +- SubqueryAlias grouping              +- LocalRelation [a#6, tag#15]
    !   :     :           +- LocalRelation [a#0]
    !   :     +- Project [a#2, b AS tag#9]
    !   :        +- SubqueryAlias t2, `t2`
    !   :           +- Project [a#2]
    !   :              +- SubqueryAlias grouping
    !   :                 +- LocalRelation [a#2]
    !   +- SubqueryAlias tb, `tb`
    !      +- Union
    !         :- Project [a#4, a AS tag#14]
    !         :  +- SubqueryAlias t3, `t3`
    !         :     +- Project [a#4]
    !         :        +- SubqueryAlias grouping
    !         :           +- LocalRelation [a#4]
    !         +- Project [a#6, b AS tag#15]
    !            +- SubqueryAlias t4, `t4`
    !               +- Project [a#6]
    !                  +- SubqueryAlias grouping
    !                     +- LocalRelation [a#6]
    ```
    
    The condition `tag#8 = tag#14` of INNER JOIN has been removed. This leads to the data of inner join being wrong.
    
    After fix:
    
    ```
    === Result of Batch LocalRelation ===
     GlobalLimit 21                                           GlobalLimit 21
     +- LocalLimit 21                                         +- LocalLimit 21
        +- Project [a#4, tag#11]                                 +- Project [a#4, tag#11]
           +- Join Inner, ((a#0 = a#4) && (tag#8 = tag#11))         +- Join Inner, ((a#0 = a#4) && (tag#8 = tag#11))
    !         :- SubqueryAlias ta                                      :- Union
    !         :  +- Union                                              :  :- LocalRelation [a#0, tag#8]
    !         :     :- Project [a#0, a AS tag#8]                       :  +- LocalRelation [a#2, tag#9]
    !         :     :  +- SubqueryAlias t1                             +- Union
    !         :     :     +- Project [a#0]                                :- LocalRelation [a#4, tag#11]
    !         :     :        +- SubqueryAlias grouping                    +- LocalRelation [a#6, tag#12]
    !         :     :           +- LocalRelation [a#0]
    !         :     +- Project [a#2, b AS tag#9]
    !         :        +- SubqueryAlias t2
    !         :           +- Project [a#2]
    !         :              +- SubqueryAlias grouping
    !         :                 +- LocalRelation [a#2]
    !         +- SubqueryAlias tb
    !            +- Union
    !               :- Project [a#4, a AS tag#11]
    !               :  +- SubqueryAlias t3
    !               :     +- Project [a#4]
    !               :        +- SubqueryAlias grouping
    !               :           +- LocalRelation [a#4]
    !               +- Project [a#6, b AS tag#12]
    !                  +- SubqueryAlias t4
    !                     +- Project [a#6]
    !                        +- SubqueryAlias grouping
    !                           +- LocalRelation [a#6]
    ```
    
    ## How was this patch tested?
    
    add sql-tests/inputs/inner-join.sql
    All tests passed.
    
    Author: Stan Zhai <zhaishidan@haizhi.com>
    
    Closes #17099 from stanzhai/fix-inner-join.
    
    (cherry picked from commit 5502a9c)
    Signed-off-by: Xiao Li <gatorsmile@gmail.com>
    stanzhai authored and gatorsmile committed Mar 1, 2017
    Configuration menu
    Copy the full SHA
    bbe0d8c View commit details
    Browse the repository at this point in the history
  3. [SPARK-19373][MESOS] Base spark.scheduler.minRegisteredResourceRatio …

    …on registered cores rather than accepted cores
    
    See JIRA
    
    Unit tests, Mesos/Spark integration tests
    
    cc skonto susanxhuynh
    
    Author: Michael Gummelt <mgummeltmesosphere.io>
    
    Closes #17045 from mgummelt/SPARK-19373-registered-resources.
    
    ## What changes were proposed in this pull request?
    
    (Please fill in changes proposed in this fix)
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Michael Gummelt <mgummelt@mesosphere.io>
    
    Closes #17129 from mgummelt/SPARK-19373-registered-resources-2.1.
    Michael Gummelt authored and srowen committed Mar 1, 2017
    Configuration menu
    Copy the full SHA
    27347b5 View commit details
    Browse the repository at this point in the history

Commits on Mar 3, 2017

  1. [SPARK-19750][UI][BRANCH-2.1] Fix redirect issue from http to https

    ## What changes were proposed in this pull request?
    
    If spark ui port (4040) is not set, it will choose port number 0, this will make https port to also choose 0. And in Spark 2.1 code, it will use this https port (0) to do redirect, so when redirect triggered, it will point to a wrong url:
    
    like:
    
    ```
    /tmp/temp$ wget http://172.27.25.134:55015
    --2017-02-23 12:13:54--  http://172.27.25.134:55015/
    Connecting to 172.27.25.134:55015... connected.
    HTTP request sent, awaiting response... 302 Found
    Location: https://172.27.25.134:0/ [following]
    --2017-02-23 12:13:54--  https://172.27.25.134:0/
    Connecting to 172.27.25.134:0... failed: Can't assign requested address.
    Retrying.
    
    --2017-02-23 12:13:55--  (try: 2)  https://172.27.25.134:0/
    Connecting to 172.27.25.134:0... failed: Can't assign requested address.
    Retrying.
    
    --2017-02-23 12:13:57--  (try: 3)  https://172.27.25.134:0/
    Connecting to 172.27.25.134:0... failed: Can't assign requested address.
    Retrying.
    
    --2017-02-23 12:14:00--  (try: 4)  https://172.27.25.134:0/
    Connecting to 172.27.25.134:0... failed: Can't assign requested address.
    Retrying.
    
    ```
    
    So instead of using 0 to do redirect, we should pick a bound port instead.
    
    This issue only exists in Spark 2.1-, and can be reproduced in yarn cluster mode.
    
    ## How was this patch tested?
    
    Current redirect UT doesn't verify this issue, so extend current UT to do correct verification.
    
    Author: jerryshao <sshao@hortonworks.com>
    
    Closes #17083 from jerryshao/SPARK-19750.
    jerryshao authored and Marcelo Vanzin committed Mar 3, 2017
    Configuration menu
    Copy the full SHA
    3a7591a View commit details
    Browse the repository at this point in the history
  2. [SPARK-19779][SS] Delete needless tmp file after restart structured s…

    …treaming job
    
    ## What changes were proposed in this pull request?
    
    [SPARK-19779](https://issues.apache.org/jira/browse/SPARK-19779)
    
    The PR (#17012) can to fix restart a Structured Streaming application using hdfs as fileSystem, but also exist a problem that a tmp file of delta file is still reserved in hdfs. And Structured Streaming don't delete the tmp file generated when restart streaming job in future.
    
    ## How was this patch tested?
     unit tests
    
    Author: guifeng <guifengleaf@gmail.com>
    
    Closes #17124 from gf53520/SPARK-19779.
    
    (cherry picked from commit e24f21b)
    Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
    gf53520 authored and zsxwing committed Mar 3, 2017
    Configuration menu
    Copy the full SHA
    1237aae View commit details
    Browse the repository at this point in the history
  3. [SPARK-19797][DOC] ML pipeline document correction

    ## What changes were proposed in this pull request?
    Description about pipeline in this paragraph is incorrect https://spark.apache.org/docs/latest/ml-pipeline.html#how-it-works
    
    > If the Pipeline had more **stages**, it would call the LogisticRegressionModel’s transform() method on the DataFrame before passing the DataFrame to the next stage.
    
    Reason: Transformer could also be a stage. But only another Estimator will invoke an transform call and pass the data to next stage. The description in the document misleads ML pipeline users.
    
    ## How was this patch tested?
    This is a tiny modification of **docs/ml-pipelines.md**. I jekyll build the modification and check the compiled document.
    
    Author: Zhe Sun <ymwdalex@gmail.com>
    
    Closes #17137 from ymwdalex/SPARK-19797-ML-pipeline-document-correction.
    
    (cherry picked from commit 0bac3e4)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    ymwdalex authored and srowen committed Mar 3, 2017
    Configuration menu
    Copy the full SHA
    accbed7 View commit details
    Browse the repository at this point in the history
  4. [SPARK-19774] StreamExecution should call stop() on sources when a st…

    …ream fails
    
    ## What changes were proposed in this pull request?
    
    We call stop() on a Structured Streaming Source only when the stream is shutdown when a user calls streamingQuery.stop(). We should actually stop all sources when the stream fails as well, otherwise we may leak resources, e.g. connections to Kafka.
    
    ## How was this patch tested?
    
    Unit tests in `StreamingQuerySuite`.
    
    Author: Burak Yavuz <brkyvz@gmail.com>
    
    Closes #17107 from brkyvz/close-source.
    
    (cherry picked from commit 9314c08)
    Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
    brkyvz authored and zsxwing committed Mar 3, 2017
    Configuration menu
    Copy the full SHA
    da04d45 View commit details
    Browse the repository at this point in the history

Commits on Mar 4, 2017

  1. [SPARK-19816][SQL][TESTS] Fix an issue that DataFrameCallbackSuite do…

    …esn't recover the log level
    
    ## What changes were proposed in this pull request?
    
    "DataFrameCallbackSuite.execute callback functions when a DataFrame action failed" sets the log level to "fatal" but doesn't recover it. Hence, tests running after it won't output any logs except fatal logs.
    
    This PR uses `testQuietly` instead to avoid changing the log level.
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes #17156 from zsxwing/SPARK-19816.
    
    (cherry picked from commit fbc4058)
    Signed-off-by: Yin Huai <yhuai@databricks.com>
    zsxwing authored and yhuai committed Mar 4, 2017
    Configuration menu
    Copy the full SHA
    664c979 View commit details
    Browse the repository at this point in the history

Commits on Mar 6, 2017

  1. [SPARK-19822][TEST] CheckpointSuite.testCheckpointedOperation: should…

    … not filter checkpointFilesOfLatestTime with the PATH string.
    
    ## What changes were proposed in this pull request?
    
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73800/testReport/
    
    ```
    sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedDueToTimeoutException: The code
    passed to eventually never returned normally. Attempted 617 times over 10.003740484 seconds.
    Last failure message: 8 did not equal 2.
    	at org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
    	at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
    	at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
    	at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:336)
    	at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
    	at org.apache.spark.streaming.DStreamCheckpointTester$class.generateOutput(CheckpointSuite
    .scala:172)
    	at org.apache.spark.streaming.CheckpointSuite.generateOutput(CheckpointSuite.scala:211)
    ```
    
    the check condition is:
    
    ```
    val checkpointFilesOfLatestTime = Checkpoint.getCheckpointFiles(checkpointDir).filter {
         _.toString.contains(clock.getTimeMillis.toString)
    }
    // Checkpoint files are written twice for every batch interval. So assert that both
    // are written to make sure that both of them have been written.
    assert(checkpointFilesOfLatestTime.size === 2)
    ```
    
    the path string may contain the `clock.getTimeMillis.toString`, like `3500` :
    
    ```
    file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-500
    file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-1000
    file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-1500
    file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-2000
    file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-2500
    file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-3000
    file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-3500.bk
    file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-3500
                                                           ▲▲▲▲
    ```
    
    so we should only check the filename, but not the whole path.
    
    ## How was this patch tested?
    
    Jenkins.
    
    Author: uncleGen <hustyugm@gmail.com>
    
    Closes #17167 from uncleGen/flaky-CheckpointSuite.
    
    (cherry picked from commit 207067e)
    Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
    uncleGen authored and zsxwing committed Mar 6, 2017
    Configuration menu
    Copy the full SHA
    ca7a7e8 View commit details
    Browse the repository at this point in the history

Commits on Mar 7, 2017

  1. [SPARK-19719][SS] Kafka writer for both structured streaming and batc…

    …h queires
    
    ## What changes were proposed in this pull request?
    
    Add a new Kafka Sink and Kafka Relation for writing streaming and batch queries, respectively, to Apache Kafka.
    ### Streaming Kafka Sink
    - When addBatch is called
    -- If batchId is great than the last written batch
    --- Write batch to Kafka
    ---- Topic will be taken from the record, if present, or from a topic option, which overrides topic in record.
    -- Else ignore
    
    ### Batch Kafka Sink
    - KafkaSourceProvider will implement CreatableRelationProvider
    - CreatableRelationProvider#createRelation will write the passed in Dataframe to a Kafka
    - Topic will be taken from the record, if present, or from topic option, which overrides topic in record.
    - Save modes Append and ErrorIfExist supported under identical semantics. Other save modes result in an AnalysisException
    
    tdas zsxwing
    
    ## How was this patch tested?
    
    ### The following unit tests will be included
    - write to stream with topic field: valid stream write with data that includes an existing topic in the schema
    - write structured streaming aggregation w/o topic field, with default topic: valid stream write with data that does not include a topic field, but the configuration includes a default topic
    - write data with bad schema: various cases of writing data that does not conform to a proper schema e.g., 1. no topic field or default topic, and 2. no value field
    - write data with valid schema but wrong types: data with a complete schema but wrong types e.g., key and value types are integers.
    - write to non-existing topic: write a stream to a topic that does not exist in Kafka, which has been configured to not auto-create topics.
    - write batch to kafka: simple write batch to Kafka, which goes through the same code path as streaming scenario, so validity checks will not be redone here.
    
    ### Examples
    ```scala
    // Structured Streaming
    val writer = inputStringStream.map(s => s.get(0).toString.getBytes()).toDF("value")
     .selectExpr("value as key", "value as value")
     .writeStream
     .format("kafka")
     .option("checkpointLocation", checkpointDir)
     .outputMode(OutputMode.Append)
     .option("kafka.bootstrap.servers", brokerAddress)
     .option("topic", topic)
     .queryName("kafkaStream")
     .start()
    
    // Batch
    val df = spark
     .sparkContext
     .parallelize(Seq("1", "2", "3", "4", "5"))
     .map(v => (topic, v))
     .toDF("topic", "value")
    
    df.write
     .format("kafka")
     .option("kafka.bootstrap.servers",brokerAddress)
     .option("topic", topic)
     .save()
    ```
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Tyson Condie <tcondie@gmail.com>
    
    Closes #17043 from tcondie/kafka-writer.
    tcondie authored and tdas committed Mar 7, 2017
    Configuration menu
    Copy the full SHA
    fd6c6d5 View commit details
    Browse the repository at this point in the history
  2. [SPARK-19561] [PYTHON] cast TimestampType.toInternal output to long

    ## What changes were proposed in this pull request?
    
    Cast the output of `TimestampType.toInternal` to long to allow for proper Timestamp creation in DataFrames near the epoch.
    
    ## How was this patch tested?
    
    Added a new test that fails without the change.
    
    dongjoon-hyun davies Mind taking a look?
    
    The contribution is my original work and I license the work to the project under the project’s open source license.
    
    Author: Jason White <jason.white@shopify.com>
    
    Closes #16896 from JasonMWhite/SPARK-19561.
    
    (cherry picked from commit 6f46846)
    Signed-off-by: Davies Liu <davies.liu@gmail.com>
    JasonMWhite authored and davies committed Mar 7, 2017
    Configuration menu
    Copy the full SHA
    711addd View commit details
    Browse the repository at this point in the history

Commits on Mar 8, 2017

  1. [SPARK-19857][YARN] Correctly calculate next credential update time.

    Add parentheses so that both lines form a single statement; also add
    a log message so that the issue becomes more explicit if it shows up
    again.
    
    Tested manually with integration test that exercises the feature.
    
    Author: Marcelo Vanzin <vanzin@cloudera.com>
    
    Closes #17198 from vanzin/SPARK-19857.
    
    (cherry picked from commit 8e41c2e)
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
    Marcelo Vanzin committed Mar 8, 2017
    Configuration menu
    Copy the full SHA
    551b7bd View commit details
    Browse the repository at this point in the history
  2. Revert "[SPARK-19561] [PYTHON] cast TimestampType.toInternal output t…

    …o long"
    
    This reverts commit 6f46846.
    cloud-fan committed Mar 8, 2017
    Configuration menu
    Copy the full SHA
    cbc3700 View commit details
    Browse the repository at this point in the history
  3. [SPARK-19859][SS] The new watermark should override the old one

    ## What changes were proposed in this pull request?
    
    The new watermark should override the old one. Otherwise, we just pick up the first column which has a watermark, it may be unexpected.
    
    ## How was this patch tested?
    
    The new test.
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes #17199 from zsxwing/SPARK-19859.
    
    (cherry picked from commit d8830c5)
    Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
    zsxwing committed Mar 8, 2017
    Configuration menu
    Copy the full SHA
    3b648a6 View commit details
    Browse the repository at this point in the history
  4. [SPARK-19348][PYTHON] PySpark keyword_only decorator is not thread-safe

    ## What changes were proposed in this pull request?
    The `keyword_only` decorator in PySpark is not thread-safe.  It writes kwargs to a static class variable in the decorator, which is then retrieved later in the class method as `_input_kwargs`.  If multiple threads are constructing the same class with different kwargs, it becomes a race condition to read from the static class variable before it's overwritten.  See [SPARK-19348](https://issues.apache.org/jira/browse/SPARK-19348) for reproduction code.
    
    This change will write the kwargs to a member variable so that multiple threads can operate on separate instances without the race condition.  It does not protect against multiple threads operating on a single instance, but that is better left to the user to synchronize.
    
    ## How was this patch tested?
    Added new unit tests for using the keyword_only decorator and a regression test that verifies `_input_kwargs` can be overwritten from different class instances.
    
    Author: Bryan Cutler <cutlerb@gmail.com>
    
    Closes #17193 from BryanCutler/pyspark-keyword_only-threadsafe-SPARK-19348-2_1.
    BryanCutler authored and jkbradley committed Mar 8, 2017
    Configuration menu
    Copy the full SHA
    0ba9ecb View commit details
    Browse the repository at this point in the history
  5. [SPARK-18055][SQL] Use correct mirror in ExpresionEncoder

    Previously, we were using the mirror of passed in `TypeTag` when reflecting to build an encoder.  This fails when the outer class is built in (i.e. `Seq`'s default mirror is based on root classloader) but inner classes (i.e. `A` in `Seq[A]`) are defined in the REPL or a library.
    
    This patch changes us to always reflect based on a mirror created using the context classloader.
    
    Author: Michael Armbrust <michael@databricks.com>
    
    Closes #17201 from marmbrus/replSeqEncoder.
    
    (cherry picked from commit 314e48a)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    marmbrus authored and cloud-fan committed Mar 8, 2017
    Configuration menu
    Copy the full SHA
    320eff1 View commit details
    Browse the repository at this point in the history
  6. [SPARK-19813] maxFilesPerTrigger combo latestFirst may miss old files…

    … in combination with maxFileAge in FileStreamSource
    
    ## What changes were proposed in this pull request?
    
    **The Problem**
    There is a file stream source option called maxFileAge which limits how old the files can be, relative the latest file that has been seen. This is used to limit the files that need to be remembered as "processed". Files older than the latest processed files are ignored. This values is by default 7 days.
    This causes a problem when both
    latestFirst = true
    maxFilesPerTrigger > total files to be processed.
    Here is what happens in all combinations
    1) latestFirst = false - Since files are processed in order, there wont be any unprocessed file older than the latest processed file. All files will be processed.
    2) latestFirst = true AND maxFilesPerTrigger is not set - The maxFileAge thresholding mechanism takes one batch initialize. If maxFilesPerTrigger is not, then all old files get processed in the first batch, and so no file is left behind.
    3) latestFirst = true AND maxFilesPerTrigger is set to X - The first batch process the latest X files. That sets the threshold latest file - maxFileAge, so files older than this threshold will never be considered for processing.
    The bug is with case 3.
    
    **The Solution**
    
    Ignore `maxFileAge` when both `maxFilesPerTrigger` and `latestFirst` are set.
    
    ## How was this patch tested?
    
    Regression test in `FileStreamSourceSuite`
    
    Author: Burak Yavuz <brkyvz@gmail.com>
    
    Closes #17153 from brkyvz/maxFileAge.
    
    (cherry picked from commit a3648b5)
    Signed-off-by: Burak Yavuz <brkyvz@gmail.com>
    brkyvz committed Mar 8, 2017
    Configuration menu
    Copy the full SHA
    f6c1ad2 View commit details
    Browse the repository at this point in the history
  7. Revert "[SPARK-19413][SS] MapGroupsWithState for arbitrary stateful o…

    …perations for branch-2.1"
    
    This reverts commit 502c927.
    zsxwing committed Mar 8, 2017
    Configuration menu
    Copy the full SHA
    3457c32 View commit details
    Browse the repository at this point in the history

Commits on Mar 9, 2017

  1. [MINOR][SQL] The analyzer rules are fired twice for cases when Analys…

    …isException is raised from analyzer.
    
    ## What changes were proposed in this pull request?
    In general we have a checkAnalysis phase which validates the logical plan and throws AnalysisException on semantic errors. However we also can throw AnalysisException from a few analyzer rules like ResolveSubquery.
    
    I found that we fire up the analyzer rules twice for the queries that throw AnalysisException from one of the analyzer rules. This is a very minor fix. We don't have to strictly fix it. I just got confused seeing the rule getting fired two times when i was not expecting it.
    
    ## How was this patch tested?
    
    Tested manually.
    
    Author: Dilip Biswal <dbiswal@us.ibm.com>
    
    Closes #17214 from dilipbiswal/analyis_twice.
    
    (cherry picked from commit d809cee)
    Signed-off-by: Xiao Li <gatorsmile@gmail.com>
    dilipbiswal authored and gatorsmile committed Mar 9, 2017
    Configuration menu
    Copy the full SHA
    78cc572 View commit details
    Browse the repository at this point in the history
  2. [SPARK-19874][BUILD] Hide API docs for org.apache.spark.sql.internal

    ## What changes were proposed in this pull request?
    
    The API docs should not include the "org.apache.spark.sql.internal" package because they are internal private APIs.
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes #17217 from zsxwing/SPARK-19874.
    
    (cherry picked from commit 029e40b)
    Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
    zsxwing committed Mar 9, 2017
    Configuration menu
    Copy the full SHA
    00859e1 View commit details
    Browse the repository at this point in the history
  3. [SPARK-19859][SS][FOLLOW-UP] The new watermark should override the ol…

    …d one.
    
    ## What changes were proposed in this pull request?
    
    A follow up to SPARK-19859:
    
    - extract the calculation of `delayMs` and reuse it.
    - update EventTimeWatermarkExec
    - use the correct `delayMs` in EventTimeWatermark
    
    ## How was this patch tested?
    
    Jenkins.
    
    Author: uncleGen <hustyugm@gmail.com>
    
    Closes #17221 from uncleGen/SPARK-19859.
    
    (cherry picked from commit eeb1d6d)
    Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
    uncleGen authored and zsxwing committed Mar 9, 2017
    Configuration menu
    Copy the full SHA
    0c140c1 View commit details
    Browse the repository at this point in the history
  4. [SPARK-19561][SQL] add int case handling for TimestampType

    ## What changes were proposed in this pull request?
    
    Add handling of input of type `Int` for dataType `TimestampType` to `EvaluatePython.scala`. Py4J serializes ints smaller than MIN_INT or larger than MAX_INT to Long, which are handled correctly already, but values between MIN_INT and MAX_INT are serialized to Int.
    
    These range limits correspond to roughly half an hour on either side of the epoch. As a result, PySpark doesn't allow TimestampType values to be created in this range.
    
    Alternatives attempted: patching the `TimestampType.toInternal` function to cast return values to `long`, so Py4J would always serialize them to Scala Long. Python3 does not have a `long` type, so this approach failed on Python3.
    
    ## How was this patch tested?
    
    Added a new PySpark-side test that fails without the change.
    
    The contribution is my original work and I license the work to the project under the project’s open source license.
    
    Resubmission of #16896. The original PR didn't go through Jenkins and broke the build. davies dongjoon-hyun
    
    cloud-fan Could you kick off a Jenkins run for me? It passed everything for me locally, but it's possible something has changed in the last few weeks.
    
    Author: Jason White <jason.white@shopify.com>
    
    Closes #17200 from JasonMWhite/SPARK-19561.
    
    (cherry picked from commit 206030b)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    JasonMWhite authored and cloud-fan committed Mar 9, 2017
    Configuration menu
    Copy the full SHA
    2a76e24 View commit details
    Browse the repository at this point in the history
  5. [SPARK-19861][SS] watermark should not be a negative time.

    ## What changes were proposed in this pull request?
    
    `watermark` should not be negative. This behavior is invalid, check it before real run.
    
    ## How was this patch tested?
    
    add new unit test.
    
    Author: uncleGen <hustyugm@gmail.com>
    Author: dylon <hustyugm@gmail.com>
    
    Closes #17202 from uncleGen/SPARK-19861.
    
    (cherry picked from commit 30b18e6)
    Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
    uncleGen authored and zsxwing committed Mar 9, 2017
    Configuration menu
    Copy the full SHA
    ffe65b0 View commit details
    Browse the repository at this point in the history

Commits on Mar 10, 2017

  1. [SPARK-19886] Fix reportDataLoss if statement in SS KafkaSource

    ## What changes were proposed in this pull request?
    
    Fix the `throw new IllegalStateException` if statement part.
    
    ## How is this patch tested
    
    Regression test
    
    Author: Burak Yavuz <brkyvz@gmail.com>
    
    Closes #17228 from brkyvz/kafka-cause-fix.
    
    (cherry picked from commit 82138e0)
    Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
    brkyvz authored and zsxwing committed Mar 10, 2017
    Configuration menu
    Copy the full SHA
    a59cc36 View commit details
    Browse the repository at this point in the history
  2. [SPARK-19891][SS] Await Batch Lock notified on stream execution exit

    ## What changes were proposed in this pull request?
    
    We need to notify the await batch lock when the stream exits early e.g., when an exception has been thrown.
    
    ## How was this patch tested?
    
    Current tests that throw exceptions at runtime will finish faster as a result of this update.
    
    zsxwing
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Tyson Condie <tcondie@gmail.com>
    
    Closes #17231 from tcondie/kafka-writer.
    
    (cherry picked from commit 501b711)
    Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
    tcondie authored and zsxwing committed Mar 10, 2017
    Configuration menu
    Copy the full SHA
    f0d50fd View commit details
    Browse the repository at this point in the history

Commits on Mar 11, 2017

  1. [SPARK-19893][SQL] should not run DataFrame set oprations with map type

    In spark SQL, map type can't be used in equality test/comparison, and `Intersect`/`Except`/`Distinct` do need equality test for all columns, we should not allow map type in `Intersect`/`Except`/`Distinct`.
    
    new regression test
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes #17236 from cloud-fan/map.
    
    (cherry picked from commit fb9beda)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    cloud-fan committed Mar 11, 2017
    Configuration menu
    Copy the full SHA
    5a2ad43 View commit details
    Browse the repository at this point in the history
  2. [SPARK-19611][SQL] Introduce configurable table schema inference

    Add a new configuration option that allows Spark SQL to infer a case-sensitive schema from a Hive Metastore table's data files when a case-sensitive schema can't be read from the table properties.
    
    - Add spark.sql.hive.caseSensitiveInferenceMode param to SQLConf
    - Add schemaPreservesCase field to CatalogTable (set to false when schema can't
      successfully be read from Hive table props)
    - Perform schema inference in HiveMetastoreCatalog if schemaPreservesCase is
      false, depending on spark.sql.hive.caseSensitiveInferenceMode
    - Add alterTableSchema() method to the ExternalCatalog interface
    - Add HiveSchemaInferenceSuite tests
    - Refactor and move ParquetFileForamt.meregeMetastoreParquetSchema() as
      HiveMetastoreCatalog.mergeWithMetastoreSchema
    - Move schema merging tests from ParquetSchemaSuite to HiveSchemaInferenceSuite
    
    [JIRA for this change](https://issues.apache.org/jira/browse/SPARK-19611)
    
    The tests in ```HiveSchemaInferenceSuite``` should verify that schema inference is working as expected. ```ExternalCatalogSuite``` has also been extended to cover the new ```alterTableSchema()``` API.
    
    Author: Budde <budde@amazon.com>
    
    Closes #17229 from budde/SPARK-19611-2.1.
    Budde authored and cloud-fan committed Mar 11, 2017
    Configuration menu
    Copy the full SHA
    e481a73 View commit details
    Browse the repository at this point in the history

Commits on Mar 12, 2017

  1. [DOCS][SS] fix structured streaming python example

    ## What changes were proposed in this pull request?
    
    - SS python example: `TypeError: 'xxx' object is not callable`
    - some other doc issue.
    
    ## How was this patch tested?
    
    Jenkins.
    
    Author: uncleGen <hustyugm@gmail.com>
    
    Closes #17257 from uncleGen/docs-ss-python.
    
    (cherry picked from commit e29a74d)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    uncleGen authored and srowen committed Mar 12, 2017
    Configuration menu
    Copy the full SHA
    f9833c6 View commit details
    Browse the repository at this point in the history

Commits on Mar 13, 2017

  1. [SPARK-19853][SS] uppercase kafka topics fail when startingOffsets ar…

    …e SpecificOffsets
    
    When using the KafkaSource with Structured Streaming, consumer assignments are not what the user expects if startingOffsets is set to an explicit set of topics/partitions in JSON where the topic(s) happen to have uppercase characters. When StartingOffsets is constructed, the original string value from options is transformed toLowerCase to make matching on "earliest" and "latest" case insensitive. However, the toLowerCase JSON is passed to SpecificOffsets for the terminal condition, so topic names may not be what the user intended by the time assignments are made with the underlying KafkaConsumer.
    
    KafkaSourceProvider.scala:
    ```
    val startingOffsets = caseInsensitiveParams.get(STARTING_OFFSETS_OPTION_KEY).map(_.trim.toLowerCase) match {
        case Some("latest") => LatestOffsets
        case Some("earliest") => EarliestOffsets
        case Some(json) => SpecificOffsets(JsonUtils.partitionOffsets(json))
        case None => LatestOffsets
      }
    ```
    
    Thank cbowden for reporting.
    
    Jenkins
    
    Author: uncleGen <hustyugm@gmail.com>
    
    Closes #17209 from uncleGen/SPARK-19853.
    
    (cherry picked from commit 0a4d06a)
    Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
    uncleGen authored and zsxwing committed Mar 13, 2017
    Configuration menu
    Copy the full SHA
    8c46080 View commit details
    Browse the repository at this point in the history

Commits on Mar 14, 2017

  1. [SPARK-19933][SQL] Do not change output of a subquery

    ## What changes were proposed in this pull request?
    The `RemoveRedundantAlias` rule can change the output attributes (the expression id's to be precise) of a query by eliminating the redundant alias producing them. This is no problem for a regular query, but can cause problems for correlated subqueries: The attributes produced by the subquery are used in the parent plan; changing them will break the parent plan.
    
    This PR fixes this by wrapping a subquery in a `Subquery` top level node when it gets optimized. The `RemoveRedundantAlias` rule now recognizes `Subquery` and makes sure that the output attributes of the `Subquery` node are retained.
    
    ## How was this patch tested?
    Added a test case to `RemoveRedundantAliasAndProjectSuite` and added a regression test to `SubquerySuite`.
    
    Author: Herman van Hovell <hvanhovell@databricks.com>
    
    Closes #17278 from hvanhovell/SPARK-19933.
    
    (cherry picked from commit e04c05c)
    Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>
    hvanhovell committed Mar 14, 2017
    Configuration menu
    Copy the full SHA
    4545782 View commit details
    Browse the repository at this point in the history

Commits on Mar 15, 2017

  1. [SPARK-19887][SQL] dynamic partition keys can be null or empty string

    When dynamic partition value is null or empty string, we should write the data to a directory like `a=__HIVE_DEFAULT_PARTITION__`, when we read the data back, we should respect this special directory name and treat it as null.
    
    This is the same behavior of impala, see https://issues.apache.org/jira/browse/IMPALA-252
    
    new regression test
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes #17277 from cloud-fan/partition.
    
    (cherry picked from commit dacc382)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    cloud-fan committed Mar 15, 2017
    Configuration menu
    Copy the full SHA
    a0ce845 View commit details
    Browse the repository at this point in the history
  2. [SPARK-19944][SQL] Move SQLConf from sql/core to sql/catalyst (branch…

    …-2.1)
    
    ## What changes were proposed in this pull request?
    This patch moves SQLConf from sql/core to sql/catalyst. To minimize the changes, the patch used type alias to still keep CatalystConf (as a type alias) and SimpleCatalystConf (as a concrete class that extends SQLConf).
    
    Motivation for the change is that it is pretty weird to have SQLConf only in sql/core and then we have to duplicate config options that impact optimizer/analyzer in sql/catalyst using CatalystConf.
    
    This is a backport into branch-2.1 to minimize merge conflicts.
    
    ## How was this patch tested?
    N/A
    
    Author: Reynold Xin <rxin@databricks.com>
    
    Closes #17301 from rxin/branch-2.1-conf.
    rxin authored and hvanhovell committed Mar 15, 2017
    Configuration menu
    Copy the full SHA
    80ebca6 View commit details
    Browse the repository at this point in the history
  3. [SPARK-19872] [PYTHON] Use the correct deserializer for RDD construct…

    …ion for coalesce/repartition
    
    ## What changes were proposed in this pull request?
    
    This PR proposes to use the correct deserializer, `BatchedSerializer` for RDD construction for coalesce/repartition when the shuffle is enabled. Currently, it is passing `UTF8Deserializer` as is not `BatchedSerializer` from the copied one.
    
    with the file, `text.txt` below:
    
    ```
    a
    b
    
    d
    e
    f
    g
    h
    i
    j
    k
    l
    
    ```
    
    - Before
    
    ```python
    >>> sc.textFile('text.txt').repartition(1).collect()
    ```
    
    ```
    UTF8Deserializer(True)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File ".../spark/python/pyspark/rdd.py", line 811, in collect
        return list(_load_from_socket(port, self._jrdd_deserializer))
      File ".../spark/python/pyspark/serializers.py", line 549, in load_stream
        yield self.loads(stream)
      File ".../spark/python/pyspark/serializers.py", line 544, in loads
        return s.decode("utf-8") if self.use_unicode else s
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
        return codecs.utf_8_decode(input, errors, True)
    UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte
    ```
    
    - After
    
    ```python
    >>> sc.textFile('text.txt').repartition(1).collect()
    ```
    
    ```
    [u'a', u'b', u'', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l', u'']
    ```
    
    ## How was this patch tested?
    
    Unit test in `python/pyspark/tests.py`.
    
    Author: hyukjinkwon <gurwls223@gmail.com>
    
    Closes #17282 from HyukjinKwon/SPARK-19872.
    
    (cherry picked from commit 7387126)
    Signed-off-by: Davies Liu <davies.liu@gmail.com>
    HyukjinKwon authored and davies committed Mar 15, 2017
    Configuration menu
    Copy the full SHA
    0622546 View commit details
    Browse the repository at this point in the history

Commits on Mar 16, 2017

  1. [SPARK-19329][SQL][BRANCH-2.1] Reading from or writing to a datasourc…

    …e table with a non pre-existing location should succeed
    
    ## What changes were proposed in this pull request?
    
    This is a backport pr of #16672 into branch-2.1.
    
    ## How was this patch tested?
    Existing tests.
    
    Author: windpiger <songjun@outlook.com>
    
    Closes #17317 from windpiger/backport-insertnotexists.
    windpiger authored and gatorsmile committed Mar 16, 2017
    Configuration menu
    Copy the full SHA
    9d032d0 View commit details
    Browse the repository at this point in the history

Commits on Mar 17, 2017

  1. [SPARK-19765][SPARK-18549][SPARK-19093][SPARK-19736][BACKPORT-2.1][SQ…

    …L] Backport Three Cache-related PRs to Spark 2.1
    
    ### What changes were proposed in this pull request?
    
    Backport a few cache related PRs:
    
    ---
    [[SPARK-19093][SQL] Cached tables are not used in SubqueryExpression](#16493)
    
    Consider the plans inside subquery expressions while looking up cache manager to make
    use of cached data. Currently CacheManager.useCachedData does not consider the
    subquery expressions in the plan.
    
    ---
    [[SPARK-19736][SQL] refreshByPath should clear all cached plans with the specified path](#17064)
    
    Catalog.refreshByPath can refresh the cache entry and the associated metadata for all dataframes (if any), that contain the given data source path.
    
    However, CacheManager.invalidateCachedPath doesn't clear all cached plans with the specified path. It causes some strange behaviors reported in SPARK-15678.
    
    ---
    [[SPARK-19765][SPARK-18549][SQL] UNCACHE TABLE should un-cache all cached plans that refer to this table](#17097)
    
    When un-cache a table, we should not only remove the cache entry for this table, but also un-cache any other cached plans that refer to this table. The following commands trigger the table uncache: `DropTableCommand`, `TruncateTableCommand`, `AlterTableRenameCommand`, `UncacheTableCommand`, `RefreshTable` and `InsertIntoHiveTable`
    
    This PR also includes some refactors:
    - use java.util.LinkedList to store the cache entries, so that it's safer to remove elements while iterating
    - rename invalidateCache to recacheByPlan, which is more obvious about what it does.
    
    ### How was this patch tested?
    N/A
    
    Author: Xiao Li <gatorsmile@gmail.com>
    
    Closes #17319 from gatorsmile/backport-17097.
    gatorsmile authored and cloud-fan committed Mar 17, 2017
    Configuration menu
    Copy the full SHA
    4b977ff View commit details
    Browse the repository at this point in the history
  2. [SPARK-19721][SS][BRANCH-2.1] Good error message for version mismatch…

    … in log files
    
    ## Problem
    
    There are several places where we write out version identifiers in various logs for structured streaming (usually `v1`). However, in the places where we check for this, we throw a confusing error message.
    
    ## What changes were proposed in this pull request?
    
    This patch made two major changes:
    1. added a `parseVersion(...)` method, and based on this method, fixed the following places the way they did version checking (no other place needed to do this checking):
    ```
    HDFSMetadataLog
      - CompactibleFileStreamLog  ------------> fixed with this patch
        - FileStreamSourceLog  ---------------> inherited the fix of `CompactibleFileStreamLog`
        - FileStreamSinkLog  -----------------> inherited the fix of `CompactibleFileStreamLog`
      - OffsetSeqLog  ------------------------> fixed with this patch
      - anonymous subclass in KafkaSource  ---> fixed with this patch
    ```
    
    2. changed the type of `FileStreamSinkLog.VERSION`, `FileStreamSourceLog.VERSION` etc. from `String` to `Int`, so that we can identify newer versions via `version > 1` instead of `version != "v1"`
        - note this didn't break any backwards compatibility -- we are still writing out `"v1"` and reading back `"v1"`
    
    ## Exception message with this patch
    ```
    java.lang.IllegalStateException: Failed to read log file /private/var/folders/nn/82rmvkk568sd8p3p8tb33trw0000gn/T/spark-86867b65-0069-4ef1-b0eb-d8bd258ff5b8/0. UnsupportedLogVersion: maximum supported log version is v1, but encountered v99. The log file was produced by a newer version of Spark and cannot be read by this version. Please upgrade.
    	at org.apache.spark.sql.execution.streaming.HDFSMetadataLog.get(HDFSMetadataLog.scala:202)
    	at org.apache.spark.sql.execution.streaming.OffsetSeqLogSuite$$anonfun$3$$anonfun$apply$mcV$sp$2.apply(OffsetSeqLogSuite.scala:78)
    	at org.apache.spark.sql.execution.streaming.OffsetSeqLogSuite$$anonfun$3$$anonfun$apply$mcV$sp$2.apply(OffsetSeqLogSuite.scala:75)
    	at org.apache.spark.sql.test.SQLTestUtils$class.withTempDir(SQLTestUtils.scala:133)
    	at org.apache.spark.sql.execution.streaming.OffsetSeqLogSuite.withTempDir(OffsetSeqLogSuite.scala:26)
    	at org.apache.spark.sql.execution.streaming.OffsetSeqLogSuite$$anonfun$3.apply$mcV$sp(OffsetSeqLogSuite.scala:75)
    	at org.apache.spark.sql.execution.streaming.OffsetSeqLogSuite$$anonfun$3.apply(OffsetSeqLogSuite.scala:75)
    	at org.apache.spark.sql.execution.streaming.OffsetSeqLogSuite$$anonfun$3.apply(OffsetSeqLogSuite.scala:75)
    	at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
    	at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
    ```
    
    ## How was this patch tested?
    
    unit tests
    
    Author: Liwei Lin <lwlin7@gmail.com>
    
    Closes #17327 from lw-lin/good-msg-2.1.
    lw-lin authored and zsxwing committed Mar 17, 2017
    Configuration menu
    Copy the full SHA
    710b555 View commit details
    Browse the repository at this point in the history
  3. [SPARK-19986][TESTS] Make pyspark.streaming.tests.CheckpointTests mor…

    …e stable
    
    ## What changes were proposed in this pull request?
    
    Sometimes, CheckpointTests will hang on a busy machine because the streaming jobs are too slow and cannot catch up. I observed the scheduled delay was keeping increasing for dozens of seconds locally.
    
    This PR increases the batch interval from 0.5 seconds to 2 seconds to generate less Spark jobs. It should make `pyspark.streaming.tests.CheckpointTests` more stable. I also replaced `sleep` with `awaitTerminationOrTimeout` so that if the streaming job fails, it will also fail the test.
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes #17323 from zsxwing/SPARK-19986.
    
    (cherry picked from commit 376d782)
    Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
    zsxwing authored and tdas committed Mar 17, 2017
    Configuration menu
    Copy the full SHA
    5fb7083 View commit details
    Browse the repository at this point in the history

Commits on Mar 18, 2017

  1. [SQL][MINOR] Fix scaladoc for UDFRegistration

    ## What changes were proposed in this pull request?
    
    Fix scaladoc for UDFRegistration
    
    ## How was this patch tested?
    
    local build
    
    Author: Jacek Laskowski <jacek@japila.pl>
    
    Closes #17337 from jaceklaskowski/udfregistration-scaladoc.
    
    (cherry picked from commit 6326d40)
    Signed-off-by: Reynold Xin <rxin@databricks.com>
    jaceklaskowski authored and rxin committed Mar 18, 2017
    Configuration menu
    Copy the full SHA
    780f606 View commit details
    Browse the repository at this point in the history

Commits on Mar 19, 2017

  1. [SPARK-18817][SPARKR][SQL] change derby log output to temp dir

    ## What changes were proposed in this pull request?
    
    Passes R `tempdir()` (this is the R session temp dir, shared with other temp files/dirs) to JVM, set System.Property for derby home dir to move derby.log
    
    ## How was this patch tested?
    
    Manually, unit tests
    
    With this, these are relocated to under /tmp
    ```
    # ls /tmp/RtmpG2M0cB/
    derby.log
    ```
    And they are removed automatically when the R session is ended.
    
    Author: Felix Cheung <felixcheung_m@hotmail.com>
    
    Closes #16330 from felixcheung/rderby.
    
    (cherry picked from commit 422aa67)
    Signed-off-by: Felix Cheung <felixcheung@apache.org>
    felixcheung authored and Felix Cheung committed Mar 19, 2017
    Configuration menu
    Copy the full SHA
    b60f690 View commit details
    Browse the repository at this point in the history

Commits on Mar 20, 2017

  1. [SPARK-19994][SQL] Wrong outputOrdering for right/full outer smj

    ## What changes were proposed in this pull request?
    
    For right outer join, values of the left key will be filled with nulls if it can't match the value of the right key, so `nullOrdering` of the left key can't be guaranteed. We should output right key order instead of left key order.
    
    For full outer join, neither left key nor right key guarantees `nullOrdering`. We should not output any ordering.
    
    In tests, besides adding three test cases for left/right/full outer sort merge join, this patch also reorganizes code in `PlannerSuite` by putting together tests for `Sort`, and also extracts common logic in Sort tests into a method.
    
    ## How was this patch tested?
    
    Corresponding test cases are added.
    
    Author: wangzhenhua <wangzhenhua@huawei.com>
    Author: Zhenhua Wang <wzh_zju@163.com>
    
    Closes #17331 from wzhfy/wrongOrdering.
    
    (cherry picked from commit 965a5ab)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    wzhfy authored and cloud-fan committed Mar 20, 2017
    Configuration menu
    Copy the full SHA
    af8bf21 View commit details
    Browse the repository at this point in the history

Commits on Mar 21, 2017

  1. [SPARK-17204][CORE] Fix replicated off heap storage

    (Jira: https://issues.apache.org/jira/browse/SPARK-17204)
    
    ## What changes were proposed in this pull request?
    
    There are a couple of bugs in the `BlockManager` with respect to support for replicated off-heap storage. First, the locally-stored off-heap byte buffer is disposed of when it is replicated. It should not be. Second, the replica byte buffers are stored as heap byte buffers instead of direct byte buffers even when the storage level memory mode is off-heap. This PR addresses both of these problems.
    
    ## How was this patch tested?
    
    `BlockManagerReplicationSuite` was enhanced to fill in the coverage gaps. It now fails if either of the bugs in this PR exist.
    
    Author: Michael Allman <michael@videoamp.com>
    
    Closes #16499 from mallman/spark-17204-replicated_off_heap_storage.
    
    (cherry picked from commit 7fa116f)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    Michael Allman authored and cloud-fan committed Mar 21, 2017
    Configuration menu
    Copy the full SHA
    d205d40 View commit details
    Browse the repository at this point in the history
  2. [SPARK-19912][SQL] String literals should be escaped for Hive metasto…

    …re partition pruning
    
    ## What changes were proposed in this pull request?
    
    Since current `HiveShim`'s `convertFilters` does not escape the string literals. There exists the following correctness issues. This PR aims to return the correct result and also shows the more clear exception message.
    
    **BEFORE**
    
    ```scala
    scala> Seq((1, "p1", "q1"), (2, "p1\" and q=\"q1", "q2")).toDF("a", "p", "q").write.partitionBy("p", "q").saveAsTable("t1")
    
    scala> spark.table("t1").filter($"p" === "p1\" and q=\"q1").select($"a").show
    +---+
    |  a|
    +---+
    +---+
    
    scala> spark.table("t1").filter($"p" === "'\"").select($"a").show
    java.lang.RuntimeException: Caught Hive MetaException attempting to get partition metadata by filter from ...
    ```
    
    **AFTER**
    
    ```scala
    scala> spark.table("t1").filter($"p" === "p1\" and q=\"q1").select($"a").show
    +---+
    |  a|
    +---+
    |  2|
    +---+
    
    scala> spark.table("t1").filter($"p" === "'\"").select($"a").show
    java.lang.UnsupportedOperationException: Partition filter cannot have both `"` and `'` characters
    ```
    
    ## How was this patch tested?
    
    Pass the Jenkins test with new test cases.
    
    Author: Dongjoon Hyun <dongjoon@apache.org>
    
    Closes #17266 from dongjoon-hyun/SPARK-19912.
    
    (cherry picked from commit 21e366a)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    dongjoon-hyun authored and cloud-fan committed Mar 21, 2017
    Configuration menu
    Copy the full SHA
    c4c7b18 View commit details
    Browse the repository at this point in the history
  3. [SPARK-20017][SQL] change the nullability of function 'StringToMap' f…

    …rom 'false' to 'true'
    
    ## What changes were proposed in this pull request?
    
    Change the nullability of function `StringToMap` from `false` to `true`.
    
    Author: zhaorongsheng <334362872@qq.com>
    
    Closes #17350 from zhaorongsheng/bug-fix_strToMap_NPE.
    
    (cherry picked from commit 7dbc162)
    Signed-off-by: Xiao Li <gatorsmile@gmail.com>
    zhaorongsheng authored and gatorsmile committed Mar 21, 2017
    Configuration menu
    Copy the full SHA
    a88c88a View commit details
    Browse the repository at this point in the history
  4. [SPARK-19237][SPARKR][CORE] On Windows spark-submit should handle whe…

    …n java is not installed
    
    ## What changes were proposed in this pull request?
    
    When SparkR is installed as a R package there might not be any java runtime.
    If it is not there SparkR's `sparkR.session()` will block waiting for the connection timeout, hanging the R IDE/shell, without any notification or message.
    
    ## How was this patch tested?
    
    manually
    
    - [x] need to test on Windows
    
    Author: Felix Cheung <felixcheung_m@hotmail.com>
    
    Closes #16596 from felixcheung/rcheckjava.
    
    (cherry picked from commit a8877bd)
    Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
    felixcheung authored and shivaram committed Mar 21, 2017
    Configuration menu
    Copy the full SHA
    5c18b6c View commit details
    Browse the repository at this point in the history
  5. clarify array_contains function description

    ## What changes were proposed in this pull request?
    
    The description in the comment for array_contains is vague/incomplete (i.e., doesn't mention that it returns `null` if the array is `null`); this PR fixes that.
    
    ## How was this patch tested?
    
    No testing, since it merely changes a comment.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Will Manning <lwwmanning@gmail.com>
    
    Closes #17380 from lwwmanning/patch-1.
    
    (cherry picked from commit a04dcde)
    Signed-off-by: Reynold Xin <rxin@databricks.com>
    lwwmanning authored and rxin committed Mar 21, 2017
    Configuration menu
    Copy the full SHA
    9dfdd2a View commit details
    Browse the repository at this point in the history

Commits on Mar 22, 2017

  1. [SPARK-19980][SQL][BACKPORT-2.1] Add NULL checks in Bean serializer

    ## What changes were proposed in this pull request?
    A Bean serializer in `ExpressionEncoder`  could change values when Beans having NULL. A concrete example is as follows;
    ```
    scala> :paste
    class Outer extends Serializable {
      private var cls: Inner = _
      def setCls(c: Inner): Unit = cls = c
      def getCls(): Inner = cls
    }
    
    class Inner extends Serializable {
      private var str: String = _
      def setStr(s: String): Unit = str = str
      def getStr(): String = str
    }
    
    scala> Seq("""{"cls":null}""", """{"cls": {"str":null}}""").toDF().write.text("data")
    scala> val encoder = Encoders.bean(classOf[Outer])
    scala> val schema = encoder.schema
    scala> val df = spark.read.schema(schema).json("data").as[Outer](encoder)
    scala> df.show
    +------+
    |   cls|
    +------+
    |[null]|
    |  null|
    +------+
    
    scala> df.map(x => x)(encoder).show()
    +------+
    |   cls|
    +------+
    |[null]|
    |[null]|     // <-- Value changed
    +------+
    ```
    
    This is because the Bean serializer does not have the NULL-check expressions that the serializer of Scala's product types has. Actually, this value change does not happen in Scala's product types;
    
    ```
    scala> :paste
    case class Outer(cls: Inner)
    case class Inner(str: String)
    
    scala> val encoder = Encoders.product[Outer]
    scala> val schema = encoder.schema
    scala> val df = spark.read.schema(schema).json("data").as[Outer](encoder)
    scala> df.show
    +------+
    |   cls|
    +------+
    |[null]|
    |  null|
    +------+
    
    scala> df.map(x => x)(encoder).show()
    +------+
    |   cls|
    +------+
    |[null]|
    |  null|
    +------+
    ```
    
    This pr added the NULL-check expressions in Bean serializer along with the serializer of Scala's product types.
    
    ## How was this patch tested?
    Added tests in `JavaDatasetSuite`.
    
    Author: Takeshi Yamamuro <yamamuro@apache.org>
    
    Closes #17372 from maropu/SPARK-19980-BACKPORT2.1.
    maropu authored and cloud-fan committed Mar 22, 2017
    Configuration menu
    Copy the full SHA
    a04428f View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    30abb95 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    c4d2b83 View commit details
    Browse the repository at this point in the history
  4. [SPARK-19925][SPARKR] Fix SparkR spark.getSparkFiles fails when it wa…

    …s called on executors.
    
    ## What changes were proposed in this pull request?
    SparkR ```spark.getSparkFiles``` fails when it was called on executors, see details at [SPARK-19925](https://issues.apache.org/jira/browse/SPARK-19925).
    
    ## How was this patch tested?
    Add unit tests, and verify this fix at standalone and yarn cluster.
    
    Author: Yanbo Liang <ybliang8@gmail.com>
    
    Closes #17274 from yanboliang/spark-19925.
    
    (cherry picked from commit 478fbc8)
    Signed-off-by: Yanbo Liang <ybliang8@gmail.com>
    yanboliang committed Mar 22, 2017
    Configuration menu
    Copy the full SHA
    277ed37 View commit details
    Browse the repository at this point in the history
  5. [SPARK-20021][PYSPARK] Miss backslash in python code

    ## What changes were proposed in this pull request?
    
    Add backslash for line continuation in python code.
    
    ## How was this patch tested?
    
    Jenkins.
    
    Author: uncleGen <hustyugm@gmail.com>
    Author: dylon <hustyugm@gmail.com>
    
    Closes #17352 from uncleGen/python-example-doc.
    
    (cherry picked from commit facfd60)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    uncleGen authored and srowen committed Mar 22, 2017
    Configuration menu
    Copy the full SHA
    56f997f View commit details
    Browse the repository at this point in the history

Commits on Mar 23, 2017

  1. [SPARK-19970][SQL][BRANCH-2.1] Table owner should be USER instead of …

    …PRINCIPAL in kerberized clusters
    
    ## What changes were proposed in this pull request?
    
    In the kerberized hadoop cluster, when Spark creates tables, the owner of tables are filled with PRINCIPAL strings instead of USER names. This is inconsistent with Hive and causes problems when using [ROLE](https://cwiki.apache.org/confluence/display/Hive/SQL+Standard+Based+Hive+Authorization) in Hive. We had better to fix this.
    
    **BEFORE**
    ```scala
    scala> sql("create table t(a int)").show
    scala> sql("desc formatted t").show(false)
    ...
    |Owner:                      |sparkEXAMPLE.COM                                         |       |
    ```
    
    **AFTER**
    ```scala
    scala> sql("create table t(a int)").show
    scala> sql("desc formatted t").show(false)
    ...
    |Owner:                      |spark                                         |       |
    ```
    
    ## How was this patch tested?
    
    Manually do `create table` and `desc formatted` because this happens in Kerberized clusters.
    
    Author: Dongjoon Hyun <dongjoon@apache.org>
    
    Closes #17363 from dongjoon-hyun/SPARK-19970-2.
    dongjoon-hyun authored and Marcelo Vanzin committed Mar 23, 2017
    Configuration menu
    Copy the full SHA
    af960e8 View commit details
    Browse the repository at this point in the history

Commits on Mar 24, 2017

  1. [SPARK-19959][SQL] Fix to throw NullPointerException in df[java.lang.…

    …Long].collect
    
    ## What changes were proposed in this pull request?
    
    This PR fixes `NullPointerException` in the generated code by Catalyst. When we run the following code, we get the following `NullPointerException`. This is because there is no null checks for `inputadapter_value`  while `java.lang.Long inputadapter_value` at Line 30 may have `null`.
    
    This happen when a type of DataFrame is nullable primitive type such as `java.lang.Long` and the wholestage codegen is used. While the physical plan keeps `nullable=true` in `input[0, java.lang.Long, true].longValue`, `BoundReference.doGenCode` ignores `nullable=true`. Thus, nullcheck code will not be generated and `NullPointerException` will occur.
    
    This PR checks the nullability and correctly generates nullcheck if needed.
    ```java
    sparkContext.parallelize(Seq[java.lang.Long](0L, null, 2L), 1).toDF.collect
    ```
    
    ```java
    Caused by: java.lang.NullPointerException
    	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(generated.java:37)
    	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:393)
    ...
    ```
    
    Generated code without this PR
    ```java
    /* 005 */ final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator {
    /* 006 */   private Object[] references;
    /* 007 */   private scala.collection.Iterator[] inputs;
    /* 008 */   private scala.collection.Iterator inputadapter_input;
    /* 009 */   private UnsafeRow serializefromobject_result;
    /* 010 */   private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder serializefromobject_holder;
    /* 011 */   private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter serializefromobject_rowWriter;
    /* 012 */
    /* 013 */   public GeneratedIterator(Object[] references) {
    /* 014 */     this.references = references;
    /* 015 */   }
    /* 016 */
    /* 017 */   public void init(int index, scala.collection.Iterator[] inputs) {
    /* 018 */     partitionIndex = index;
    /* 019 */     this.inputs = inputs;
    /* 020 */     inputadapter_input = inputs[0];
    /* 021 */     serializefromobject_result = new UnsafeRow(1);
    /* 022 */     this.serializefromobject_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(serializefromobject_result, 0);
    /* 023 */     this.serializefromobject_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(serializefromobject_holder, 1);
    /* 024 */
    /* 025 */   }
    /* 026 */
    /* 027 */   protected void processNext() throws java.io.IOException {
    /* 028 */     while (inputadapter_input.hasNext() && !stopEarly()) {
    /* 029 */       InternalRow inputadapter_row = (InternalRow) inputadapter_input.next();
    /* 030 */       java.lang.Long inputadapter_value = (java.lang.Long)inputadapter_row.get(0, null);
    /* 031 */
    /* 032 */       boolean serializefromobject_isNull = true;
    /* 033 */       long serializefromobject_value = -1L;
    /* 034 */       if (!false) {
    /* 035 */         serializefromobject_isNull = false;
    /* 036 */         if (!serializefromobject_isNull) {
    /* 037 */           serializefromobject_value = inputadapter_value.longValue();
    /* 038 */         }
    /* 039 */
    /* 040 */       }
    /* 041 */       serializefromobject_rowWriter.zeroOutNullBytes();
    /* 042 */
    /* 043 */       if (serializefromobject_isNull) {
    /* 044 */         serializefromobject_rowWriter.setNullAt(0);
    /* 045 */       } else {
    /* 046 */         serializefromobject_rowWriter.write(0, serializefromobject_value);
    /* 047 */       }
    /* 048 */       append(serializefromobject_result);
    /* 049 */       if (shouldStop()) return;
    /* 050 */     }
    /* 051 */   }
    /* 052 */ }
    ```
    
    Generated code with this PR
    
    ```java
    /* 005 */ final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator {
    /* 006 */   private Object[] references;
    /* 007 */   private scala.collection.Iterator[] inputs;
    /* 008 */   private scala.collection.Iterator inputadapter_input;
    /* 009 */   private UnsafeRow serializefromobject_result;
    /* 010 */   private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder serializefromobject_holder;
    /* 011 */   private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter serializefromobject_rowWriter;
    /* 012 */
    /* 013 */   public GeneratedIterator(Object[] references) {
    /* 014 */     this.references = references;
    /* 015 */   }
    /* 016 */
    /* 017 */   public void init(int index, scala.collection.Iterator[] inputs) {
    /* 018 */     partitionIndex = index;
    /* 019 */     this.inputs = inputs;
    /* 020 */     inputadapter_input = inputs[0];
    /* 021 */     serializefromobject_result = new UnsafeRow(1);
    /* 022 */     this.serializefromobject_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(serializefromobject_result, 0);
    /* 023 */     this.serializefromobject_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(serializefromobject_holder, 1);
    /* 024 */
    /* 025 */   }
    /* 026 */
    /* 027 */   protected void processNext() throws java.io.IOException {
    /* 028 */     while (inputadapter_input.hasNext() && !stopEarly()) {
    /* 029 */       InternalRow inputadapter_row = (InternalRow) inputadapter_input.next();
    /* 030 */       boolean inputadapter_isNull = inputadapter_row.isNullAt(0);
    /* 031 */       java.lang.Long inputadapter_value = inputadapter_isNull ? null : ((java.lang.Long)inputadapter_row.get(0, null));
    /* 032 */
    /* 033 */       boolean serializefromobject_isNull = true;
    /* 034 */       long serializefromobject_value = -1L;
    /* 035 */       if (!inputadapter_isNull) {
    /* 036 */         serializefromobject_isNull = false;
    /* 037 */         if (!serializefromobject_isNull) {
    /* 038 */           serializefromobject_value = inputadapter_value.longValue();
    /* 039 */         }
    /* 040 */
    /* 041 */       }
    /* 042 */       serializefromobject_rowWriter.zeroOutNullBytes();
    /* 043 */
    /* 044 */       if (serializefromobject_isNull) {
    /* 045 */         serializefromobject_rowWriter.setNullAt(0);
    /* 046 */       } else {
    /* 047 */         serializefromobject_rowWriter.write(0, serializefromobject_value);
    /* 048 */       }
    /* 049 */       append(serializefromobject_result);
    /* 050 */       if (shouldStop()) return;
    /* 051 */     }
    /* 052 */   }
    /* 053 */ }
    ```
    
    ## How was this patch tested?
    
    Added new test suites in `DataFrameSuites`
    
    Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
    
    Closes #17302 from kiszk/SPARK-19959.
    
    (cherry picked from commit bb823ca)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    kiszk authored and cloud-fan committed Mar 24, 2017
    Configuration menu
    Copy the full SHA
    92f0b01 View commit details
    Browse the repository at this point in the history

Commits on Mar 25, 2017

  1. [SPARK-19674][SQL] Ignore driver accumulator updates don't belong to …

    [SPARK-19674][SQL] Ignore driver accumulator updates don't belong to the execution when merging all accumulator updates
    
    N.B. This is a backport to branch-2.1 of #17009.
    
    ## What changes were proposed in this pull request?
    In SQLListener.getExecutionMetrics, driver accumulator updates don't belong to the execution should be ignored when merging all accumulator updates to prevent NoSuchElementException.
    
    ## How was this patch tested?
    Updated unit test.
    
    Author: Carson Wang <carson.wangintel.com>
    
    Author: Carson Wang <carson.wang@intel.com>
    
    Closes #17418 from mallman/spark-19674-backport_2.1.
    carsonwang authored and cloud-fan committed Mar 25, 2017
    Configuration menu
    Copy the full SHA
    d989434 View commit details
    Browse the repository at this point in the history

Commits on Mar 26, 2017

  1. [SPARK-20086][SQL] CollapseWindow should not collapse dependent adjac…

    …ent windows
    
    ## What changes were proposed in this pull request?
    The `CollapseWindow` is currently to aggressive when collapsing adjacent windows. It also collapses windows in the which the parent produces a column that is consumed by the child; this creates an invalid window which will fail at runtime.
    
    This PR fixes this by adding a check for dependent adjacent windows to the `CollapseWindow` rule.
    
    ## How was this patch tested?
    Added a new test case to `CollapseWindowSuite`
    
    Author: Herman van Hovell <hvanhovell@databricks.com>
    
    Closes #17432 from hvanhovell/SPARK-20086.
    
    (cherry picked from commit 617ab64)
    Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>
    hvanhovell committed Mar 26, 2017
    Configuration menu
    Copy the full SHA
    b6d348e View commit details
    Browse the repository at this point in the history

Commits on Mar 27, 2017

  1. [SPARK-20102] Fix nightly packaging and RC packaging scripts w/ two m…

    …inor build fixes
    
    ## What changes were proposed in this pull request?
    
    The master snapshot publisher builds are currently broken due to two minor build issues:
    
    1. For unknown reasons, the LFTP `mkdir -p` command began throwing errors when the remote directory already exists. This change of behavior might have been caused by configuration changes in the ASF's SFTP server, but I'm not entirely sure of that. To work around this problem, this patch updates the script to ignore errors from the `lftp mkdir -p` commands.
    2. The PySpark `setup.py` file references a non-existent `pyspark.ml.stat` module, causing Python packaging to fail by complaining about a missing directory. The fix is to simply drop that line from the setup script.
    
    ## How was this patch tested?
    
    The LFTP fix was tested by manually running the failing commands on AMPLab Jenkins against the ASF SFTP server. The PySpark fix was tested locally.
    
    Author: Josh Rosen <joshrosen@databricks.com>
    
    Closes #17437 from JoshRosen/spark-20102.
    
    (cherry picked from commit 314cf51)
    Signed-off-by: Josh Rosen <joshrosen@databricks.com>
    JoshRosen committed Mar 27, 2017
    Configuration menu
    Copy the full SHA
    4056191 View commit details
    Browse the repository at this point in the history

Commits on Mar 28, 2017

  1. [SPARK-19995][YARN] Register tokens to current UGI to avoid re-issuin…

    …g of tokens in yarn client mode
    
    ## What changes were proposed in this pull request?
    
    In the current Spark on YARN code, we will obtain tokens from provided services, but we're not going to add these tokens to the current user's credentials. This will make all the following operations to these services still require TGT rather than delegation tokens. This is unnecessary since we already got the tokens, also this will lead to failure in user impersonation scenario, because the TGT is granted by real user, not proxy user.
    
    So here changing to put all the tokens to the current UGI, so that following operations to these services will honor tokens rather than TGT, and this will further handle the proxy user issue mentioned above.
    
    ## How was this patch tested?
    
    Local verified in secure cluster.
    
    vanzin tgravescs mridulm  dongjoon-hyun please help to review, thanks a lot.
    
    Author: jerryshao <sshao@hortonworks.com>
    
    Closes #17335 from jerryshao/SPARK-19995.
    
    (cherry picked from commit 17eddb3)
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
    jerryshao authored and Marcelo Vanzin committed Mar 28, 2017
    Configuration menu
    Copy the full SHA
    4bcb7d6 View commit details
    Browse the repository at this point in the history
  2. [SPARK-20125][SQL] Dataset of type option of map does not work

    When we build the deserializer expression for map type, we will use `StaticInvoke` to call `ArrayBasedMapData.toScalaMap`, and declare the return type as `scala.collection.immutable.Map`. If the map is inside an Option, we will wrap this `StaticInvoke` with `WrapOption`, which requires the input to be `scala.collect.Map`. Ideally this should be fine, as `scala.collection.immutable.Map` extends `scala.collect.Map`, but our `ObjectType` is too strict about this, this PR fixes it.
    
    new regression test
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes #17454 from cloud-fan/map.
    
    (cherry picked from commit d4fac41)
    Signed-off-by: Cheng Lian <lian@databricks.com>
    cloud-fan authored and liancheng committed Mar 28, 2017
    Configuration menu
    Copy the full SHA
    fd2e406 View commit details
    Browse the repository at this point in the history
  3. [SPARK-14536][SQL][BACKPORT-2.1] fix to handle null value in array ty…

    …pe column for postgres.
    
    ## What changes were proposed in this pull request?
    JDBC read is failing with NPE due to missing null value check for array data type if the source table has null values in the array type column. For null values Resultset.getArray() returns null.
    This PR adds null safe check to the Resultset.getArray() value before invoking method on the Array object
    
    ## How was this patch tested?
    Updated the PostgresIntegration test suite to test null values. Ran docker integration tests on my laptop.
    
    Author: sureshthalamati <suresh.thalamati@gmail.com>
    
    Closes #17460 from sureshthalamati/jdbc_array_null_fix_spark_2.1-SPARK-14536.
    sureshthalamati authored and gatorsmile committed Mar 28, 2017
    Configuration menu
    Copy the full SHA
    e669dd7 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    02b165d View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    4964dbe View commit details
    Browse the repository at this point in the history
  6. [SPARK-20043][ML] DecisionTreeModel: ImpurityCalculator builder fails…

    … for uppercase impurity type Gini
    
    Fix bug: DecisionTreeModel can't recongnize Impurity "Gini" when loading
    
    TODO:
    + [x] add unit test
    + [x] fix the bug
    
    Author: 颜发才(Yan Facai) <facai.yan@gmail.com>
    
    Closes #17407 from facaiy/BUG/decision_tree_loader_failer_with_Gini_impurity.
    
    (cherry picked from commit 7d432af)
    Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
    facaiy authored and jkbradley committed Mar 28, 2017
    Configuration menu
    Copy the full SHA
    3095480 View commit details
    Browse the repository at this point in the history

Commits on Mar 29, 2017

  1. [SPARK-20134][SQL] SQLMetrics.postDriverMetricUpdates to simplify dri…

    …ver side metric updates
    
    ## What changes were proposed in this pull request?
    It is not super intuitive how to update SQLMetric on the driver side. This patch introduces a new SQLMetrics.postDriverMetricUpdates function to do that, and adds documentation to make it more obvious.
    
    ## How was this patch tested?
    Updated a test case to use this method.
    
    Author: Reynold Xin <rxin@databricks.com>
    
    Closes #17464 from rxin/SPARK-20134.
    
    (cherry picked from commit 9712bd3)
    Signed-off-by: Reynold Xin <rxin@databricks.com>
    rxin committed Mar 29, 2017
    Configuration menu
    Copy the full SHA
    f8c1b3e View commit details
    Browse the repository at this point in the history
  2. [SPARK-20059][YARN] Use the correct classloader for HBaseCredentialPr…

    …ovider
    
    ## What changes were proposed in this pull request?
    
    Currently we use system classloader to find HBase jars, if it is specified by `--jars`, then it will be failed with ClassNotFound issue. So here changing to use child classloader.
    
    Also putting added jars and main jar into classpath of submitted application in yarn cluster mode, otherwise HBase jars specified with `--jars` will never be honored in cluster mode, and fetching tokens in client side will always be failed.
    
    ## How was this patch tested?
    
    Unit test and local verification.
    
    Author: jerryshao <sshao@hortonworks.com>
    
    Closes #17388 from jerryshao/SPARK-20059.
    
    (cherry picked from commit c622a87)
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
    jerryshao authored and Marcelo Vanzin committed Mar 29, 2017
    Configuration menu
    Copy the full SHA
    103ff54 View commit details
    Browse the repository at this point in the history

Commits on Mar 31, 2017

  1. [SPARK-20164][SQL] AnalysisException not tolerant of null query plan.

    The query plan in an `AnalysisException` may be `null` when an `AnalysisException` object is serialized and then deserialized, since `plan` is marked `transient`. Or when someone throws an `AnalysisException` with a null query plan (which should not happen).
    `def getMessage` is not tolerant of this and throws a `NullPointerException`, leading to loss of information about the original exception.
    The fix is to add a `null` check in `getMessage`.
    
    - Unit test
    
    Author: Kunal Khamar <kkhamar@outlook.com>
    
    Closes #17486 from kunalkhamar/spark-20164.
    
    (cherry picked from commit 254877c)
    Signed-off-by: Xiao Li <gatorsmile@gmail.com>
    kunalkhamar authored and gatorsmile committed Mar 31, 2017
    Configuration menu
    Copy the full SHA
    6a1b2eb View commit details
    Browse the repository at this point in the history
  2. [SPARK-20084][CORE] Remove internal.metrics.updatedBlockStatuses from…

    … history files.
    
    ## What changes were proposed in this pull request?
    
    Remove accumulator updates for internal.metrics.updatedBlockStatuses from SparkListenerTaskEnd entries in the history file. These can cause history files to grow to hundreds of GB because the value of the accumulator contains all tracked blocks.
    
    ## How was this patch tested?
    
    Current History UI tests cover use of the history file.
    
    Author: Ryan Blue <blue@apache.org>
    
    Closes #17412 from rdblue/SPARK-20084-remove-block-accumulator-info.
    
    (cherry picked from commit c4c03ee)
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
    rdblue authored and Marcelo Vanzin committed Mar 31, 2017
    Configuration menu
    Copy the full SHA
    e3cec18 View commit details
    Browse the repository at this point in the history

Commits on Apr 2, 2017

  1. [SPARK-19999][BACKPORT-2.1][CORE] Workaround JDK-8165231 to identify …

    …PPC64 architectures as supporting unaligned access
    
    ## What changes were proposed in this pull request?
    
    This PR is backport of #17472 to Spark 2.1
    
    java.nio.Bits.unaligned() does not return true for the ppc64le arch.
    see [https://bugs.openjdk.java.net/browse/JDK-8165231](https://bugs.openjdk.java.net/browse/JDK-8165231)
    Check architecture in Platform.java
    
    ## How was this patch tested?
    
    unit test
    
    Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
    
    Closes #17509 from kiszk/branch-2.1.
    kiszk authored and srowen committed Apr 2, 2017
    Configuration menu
    Copy the full SHA
    968eace View commit details
    Browse the repository at this point in the history

Commits on Apr 3, 2017

  1. [SPARK-20197][SPARKR][BRANCH-2.1] CRAN check fail with package instal…

    …lation
    
    ## What changes were proposed in this pull request?
    
    Test failed because SPARK_HOME is not set before Spark is installed.
    Also current directory is not == SPARK_HOME when tests are run with R CMD check, unlike in Jenkins, so disable that test for now. (that would also disable the test in Jenkins - so this change should not be ported to master as-is.)
    
    ## How was this patch tested?
    
    Manual run R CMD check
    
    Author: Felix Cheung <felixcheung_m@hotmail.com>
    
    Closes #17515 from felixcheung/rcrancheck.
    felixcheung authored and Felix Cheung committed Apr 3, 2017
    Configuration menu
    Copy the full SHA
    ca14410 View commit details
    Browse the repository at this point in the history
  2. [MINOR][DOCS] Replace non-breaking space to normal spaces that breaks…

    … rendering markdown
    
    # What changes were proposed in this pull request?
    
    It seems there are several non-breaking spaces were inserted into several `.md`s and they look breaking rendering markdown files.
    
    These are different. For example, this can be checked via `python` as below:
    
    ```python
    >>> " "
    '\xc2\xa0'
    >>> " "
    ' '
    ```
    
    _Note that it seems this PR description automatically replaces non-breaking spaces into normal spaces. Please open a `vi` and copy and paste it into `python` to verify this (do not copy the characters here)._
    
    I checked the output below in  Sapari and Chrome on Mac OS and, Internal Explorer on Windows 10.
    
    **Before**
    
    ![2017-04-03 12 37 17](https://cloud.githubusercontent.com/assets/6477701/24594655/50aaba02-186a-11e7-80bb-d34b17a3398a.png)
    ![2017-04-03 12 36 57](https://cloud.githubusercontent.com/assets/6477701/24594654/50a855e6-186a-11e7-94e2-661e56544b0f.png)
    
    **After**
    
    ![2017-04-03 12 36 46](https://cloud.githubusercontent.com/assets/6477701/24594657/53c2545c-186a-11e7-9a73-00529afbfd75.png)
    ![2017-04-03 12 36 31](https://cloud.githubusercontent.com/assets/6477701/24594658/53c286c0-186a-11e7-99c9-e66b1f510fe7.png)
    
    ## How was this patch tested?
    
    Manually checking.
    
    These instances were found via
    
    ```
    grep --include=*.scala --include=*.python --include=*.java --include=*.r --include=*.R --include=*.md --include=*.r -r -I " " .
    ```
    
    in Mac OS.
    
    It seems there are several instances more as below:
    
    ```
    ./docs/sql-programming-guide.md:        │   ├── ...
    ./docs/sql-programming-guide.md:        │   │
    ./docs/sql-programming-guide.md:        │   ├── country=US
    ./docs/sql-programming-guide.md:        │   │   └── data.parquet
    ./docs/sql-programming-guide.md:        │   ├── country=CN
    ./docs/sql-programming-guide.md:        │   │   └── data.parquet
    ./docs/sql-programming-guide.md:        │   └── ...
    ./docs/sql-programming-guide.md:            ├── ...
    ./docs/sql-programming-guide.md:            │
    ./docs/sql-programming-guide.md:            ├── country=US
    ./docs/sql-programming-guide.md:            │   └── data.parquet
    ./docs/sql-programming-guide.md:            ├── country=CN
    ./docs/sql-programming-guide.md:            │   └── data.parquet
    ./docs/sql-programming-guide.md:            └── ...
    ./sql/core/src/test/README.md:│   ├── *.avdl                  # Testing Avro IDL(s)
    ./sql/core/src/test/README.md:│   └── *.avpr                  # !! NO TOUCH !! Protocol files generated from Avro IDL(s)
    ./sql/core/src/test/README.md:│   ├── gen-avro.sh             # Script used to generate Java code for Avro
    ./sql/core/src/test/README.md:│   └── gen-thrift.sh           # Script used to generate Java code for Thrift
    ```
    
    These seems generated via `tree` command which inserts non-breaking spaces. They do not look causing any problem for rendering within code blocks and I did not fix it to reduce the overhead to manually replace it when it is overwritten via `tree` command in the future.
    
    Author: hyukjinkwon <gurwls223@gmail.com>
    
    Closes #17517 from HyukjinKwon/non-breaking-space.
    
    (cherry picked from commit 364b0db)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    HyukjinKwon authored and srowen committed Apr 3, 2017
    Configuration menu
    Copy the full SHA
    77700ea View commit details
    Browse the repository at this point in the history

Commits on Apr 4, 2017

  1. [SPARK-20190][APP-ID] applications//jobs' in rest api,status should b…

    …e [running|s…
    
    …ucceeded|failed|unknown]
    
    ## What changes were proposed in this pull request?
    
    '/applications/[app-id]/jobs' in rest api.status should be'[running|succeeded|failed|unknown]'.
    now status is '[complete|succeeded|failed]'.
    but '/applications/[app-id]/jobs?status=complete' the server return 'HTTP ERROR 404'.
    Added '?status=running' and '?status=unknown'.
    code :
    public enum JobExecutionStatus {
    RUNNING,
    SUCCEEDED,
    FAILED,
    UNKNOWN;
    
    ## How was this patch tested?
    
     manual tests
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: guoxiaolongzte <guo.xiaolong1@zte.com.cn>
    
    Closes #17507 from guoxiaolongzte/SPARK-20190.
    
    (cherry picked from commit c95fbea)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    guoxiaolongzte authored and srowen committed Apr 4, 2017
    Configuration menu
    Copy the full SHA
    f9546da View commit details
    Browse the repository at this point in the history
  2. [SPARK-20191][YARN] Crate wrapper for RackResolver so tests can overr…

    …ide it.
    
    Current test code tries to override the RackResolver used by setting
    configuration params, but because YARN libs statically initialize the
    resolver the first time it's used, that means that those configs don't
    really take effect during Spark tests.
    
    This change adds a wrapper class that easily allows tests to override the
    behavior of the resolver for the Spark code that uses it.
    
    Author: Marcelo Vanzin <vanzin@cloudera.com>
    
    Closes #17508 from vanzin/SPARK-20191.
    
    (cherry picked from commit 0736980)
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
    Marcelo Vanzin committed Apr 4, 2017
    Configuration menu
    Copy the full SHA
    00c1248 View commit details
    Browse the repository at this point in the history

Commits on Apr 5, 2017

  1. [SPARK-20042][WEB UI] Fix log page buttons for reverse proxy mode

    with spark.ui.reverseProxy=true, full path URLs like /log will point to
    the master web endpoint which is serving the worker UI as reverse proxy.
    To access a REST endpoint in the worker in reverse proxy mode , the
    leading /proxy/"target"/ part of the base URI must be retained.
    
    Added logic to log-view.js to handle this, similar to executorspage.js
    
    Patch was tested manually
    
    Author: Oliver Köth <okoeth@de.ibm.com>
    
    Closes #17370 from okoethibm/master.
    
    (cherry picked from commit 6f09dc7)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    okoethibm authored and srowen committed Apr 5, 2017
    Configuration menu
    Copy the full SHA
    efc72dc View commit details
    Browse the repository at this point in the history
  2. [SPARK-20223][SQL] Fix typo in tpcds q77.sql

    ## What changes were proposed in this pull request?
    
    Fix typo in tpcds q77.sql
    
    ## How was this patch tested?
    
    N/A
    
    Author: wangzhenhua <wangzhenhua@huawei.com>
    
    Closes #17538 from wzhfy/typoQ77.
    
    (cherry picked from commit a2d8d76)
    Signed-off-by: Xiao Li <gatorsmile@gmail.com>
    wzhfy authored and gatorsmile committed Apr 5, 2017
    Configuration menu
    Copy the full SHA
    2b85e05 View commit details
    Browse the repository at this point in the history

Commits on Apr 6, 2017

  1. [SPARK-20214][ML] Make sure converted csc matrix has sorted indices

    ## What changes were proposed in this pull request?
    
    `_convert_to_vector` converts a scipy sparse matrix to csc matrix for initializing `SparseVector`. However, it doesn't guarantee the converted csc matrix has sorted indices and so a failure happens when you do something like that:
    
        from scipy.sparse import lil_matrix
        lil = lil_matrix((4, 1))
        lil[1, 0] = 1
        lil[3, 0] = 2
        _convert_to_vector(lil.todok())
    
        File "/home/jenkins/workspace/python/pyspark/mllib/linalg/__init__.py", line 78, in _convert_to_vector
          return SparseVector(l.shape[0], csc.indices, csc.data)
        File "/home/jenkins/workspace/python/pyspark/mllib/linalg/__init__.py", line 556, in __init__
          % (self.indices[i], self.indices[i + 1]))
        TypeError: Indices 3 and 1 are not strictly increasing
    
    A simple test can confirm that `dok_matrix.tocsc()` won't guarantee sorted indices:
    
        >>> from scipy.sparse import lil_matrix
        >>> lil = lil_matrix((4, 1))
        >>> lil[1, 0] = 1
        >>> lil[3, 0] = 2
        >>> dok = lil.todok()
        >>> csc = dok.tocsc()
        >>> csc.has_sorted_indices
        0
        >>> csc.indices
        array([3, 1], dtype=int32)
    
    I checked the source codes of scipy. The only way to guarantee it is `csc_matrix.tocsr()` and `csr_matrix.tocsc()`.
    
    ## How was this patch tested?
    
    Existing tests.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Liang-Chi Hsieh <viirya@gmail.com>
    
    Closes #17532 from viirya/make-sure-sorted-indices.
    
    (cherry picked from commit 1220605)
    Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
    viirya authored and jkbradley committed Apr 6, 2017
    Configuration menu
    Copy the full SHA
    fb81a41 View commit details
    Browse the repository at this point in the history

Commits on Apr 7, 2017

  1. [SPARK-20218][DOC][APP-ID] applications//stages' in REST API,add desc…

    …ription.
    
    ## What changes were proposed in this pull request?
    
    1. '/applications/[app-id]/stages' in rest api.status should add description '?status=[active|complete|pending|failed] list only stages in the state.'
    
    Now the lack of this description, resulting in the use of this api do not know the use of the status through the brush stage list.
    
    2.'/applications/[app-id]/stages/[stage-id]' in REST API,remove redundant description ‘?status=[active|complete|pending|failed] list only stages in the state.’.
    Because only one stage is determined based on stage-id.
    
    code:
      GET
      def stageList(QueryParam("status") statuses: JList[StageStatus]): Seq[StageData] = {
        val listener = ui.jobProgressListener
        val stageAndStatus = AllStagesResource.stagesAndStatus(ui)
        val adjStatuses = {
          if (statuses.isEmpty()) {
            Arrays.asList(StageStatus.values(): _*)
          } else {
            statuses
          }
        };
    
    ## How was this patch tested?
    
    manual tests
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: 郭小龙 10207633 <guo.xiaolong1@zte.com.cn>
    
    Closes #17534 from guoxiaolongzte/SPARK-20218.
    
    (cherry picked from commit 9e0893b)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    郭小龙 10207633 authored and srowen committed Apr 7, 2017
    Configuration menu
    Copy the full SHA
    7791120 View commit details
    Browse the repository at this point in the history

Commits on Apr 8, 2017

  1. [SPARK-20246][SQL] should not push predicate down through aggregate w…

    …ith non-deterministic expressions
    
    ## What changes were proposed in this pull request?
    
    Similar to `Project`, when `Aggregate` has non-deterministic expressions, we should not push predicate down through it, as it will change the number of input rows and thus change the evaluation result of non-deterministic expressions in `Aggregate`.
    
    ## How was this patch tested?
    
    new regression test
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes #17562 from cloud-fan/filter.
    
    (cherry picked from commit 7577e9c)
    Signed-off-by: Xiao Li <gatorsmile@gmail.com>
    cloud-fan authored and gatorsmile committed Apr 8, 2017
    Configuration menu
    Copy the full SHA
    fc242cc View commit details
    Browse the repository at this point in the history
  2. [SPARK-20262][SQL] AssertNotNull should throw NullPointerException

    AssertNotNull currently throws RuntimeException. It should throw NullPointerException, which is more specific.
    
    N/A
    
    Author: Reynold Xin <rxin@databricks.com>
    
    Closes #17573 from rxin/SPARK-20262.
    
    (cherry picked from commit e1afc4d)
    Signed-off-by: Xiao Li <gatorsmile@gmail.com>
    rxin authored and gatorsmile committed Apr 8, 2017
    Configuration menu
    Copy the full SHA
    658b358 View commit details
    Browse the repository at this point in the history

Commits on Apr 9, 2017

  1. [SPARK-20260][MLLIB] String interpolation required for error message

    ## What changes were proposed in this pull request?
    This error message doesn't get properly formatted because of a missing `s`.  Currently the error looks like:
    
    ```
    Caused by: java.lang.IllegalArgumentException: requirement failed: indices should be one-based and in ascending order; found current=$current, previous=$previous; line="$line"
    ```
    (note the literal `$current` instead of the interpolated value)
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Vijay Ramesh <vramesh@demandbase.com>
    
    Closes #17572 from vijaykramesh/master.
    
    (cherry picked from commit 261eaf5)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    vijaykramesh authored and srowen committed Apr 9, 2017
    Configuration menu
    Copy the full SHA
    43a7fca View commit details
    Browse the repository at this point in the history

Commits on Apr 10, 2017

  1. [SPARK-20264][SQL] asm should be non-test dependency in sql/core

    ## What changes were proposed in this pull request?
    sq/core module currently declares asm as a test scope dependency. Transitively it should actually be a normal dependency since the actual core module defines it. This occasionally confuses IntelliJ.
    
    ## How was this patch tested?
    N/A - This is a build change.
    
    Author: Reynold Xin <rxin@databricks.com>
    
    Closes #17574 from rxin/SPARK-20264.
    
    (cherry picked from commit 7bfa05e)
    Signed-off-by: Xiao Li <gatorsmile@gmail.com>
    rxin authored and gatorsmile committed Apr 10, 2017
    Configuration menu
    Copy the full SHA
    1a73046 View commit details
    Browse the repository at this point in the history
  2. [SPARK-20280][CORE] FileStatusCache Weigher integer overflow

    ## What changes were proposed in this pull request?
    
    Weigher.weigh needs to return Int but it is possible for an Array[FileStatus] to have size > Int.maxValue. To avoid this, the size is scaled down by a factor of 32. The maximumWeight of the cache is also scaled down by the same factor.
    
    ## How was this patch tested?
    New test in FileIndexSuite
    
    Author: Bogdan Raducanu <bogdan@databricks.com>
    
    Closes #17591 from bogdanrdc/SPARK-20280.
    
    (cherry picked from commit f6dd8e0)
    Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>
    bogdanrdc authored and hvanhovell committed Apr 10, 2017
    Configuration menu
    Copy the full SHA
    bc7304e View commit details
    Browse the repository at this point in the history
  3. [SPARK-20285][TESTS] Increase the pyspark streaming test timeout to 3…

    …0 seconds
    
    ## What changes were proposed in this pull request?
    
    Saw the following failure locally:
    
    ```
    Traceback (most recent call last):
      File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 351, in test_cogroup
        self._test_func(input, func, expected, sort=True, input2=input2)
      File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 162, in _test_func
        self.assertEqual(expected, result)
    AssertionError: Lists differ: [[(1, ([1], [2])), (2, ([1], [... != []
    
    First list contains 3 additional elements.
    First extra element 0:
    [(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))]
    
    + []
    - [[(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))],
    -  [(1, ([1, 1, 1], [])), (2, ([1], [])), (4, ([], [1]))],
    -  [('', ([1, 1], [1, 2])), ('a', ([1, 1], [1, 1])), ('b', ([1], [1]))]]
    ```
    
    It also happened on Jenkins: http://spark-tests.appspot.com/builds/spark-branch-2.1-test-sbt-hadoop-2.7/120
    
    It's because when the machine is overloaded, the timeout is not enough. This PR just increases the timeout to 30 seconds.
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes #17597 from zsxwing/SPARK-20285.
    
    (cherry picked from commit f9a50ba)
    Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
    zsxwing committed Apr 10, 2017
    Configuration menu
    Copy the full SHA
    489c1f3 View commit details
    Browse the repository at this point in the history

Commits on Apr 11, 2017

  1. [SPARK-18555][SQL] DataFrameNaFunctions.fill miss up original values …

    …in long integers
    
    ## What changes were proposed in this pull request?
    
       DataSet.na.fill(0) used on a DataSet which has a long value column, it will change the original long value.
    
       The reason is that the type of the function fill's param is Double, and the numeric columns are always cast to double(`fillCol[Double](f, value)`) .
    ```
      def fill(value: Double, cols: Seq[String]): DataFrame = {
        val columnEquals = df.sparkSession.sessionState.analyzer.resolver
        val projections = df.schema.fields.map { f =>
          // Only fill if the column is part of the cols list.
          if (f.dataType.isInstanceOf[NumericType] && cols.exists(col => columnEquals(f.name, col))) {
            fillCol[Double](f, value)
          } else {
            df.col(f.name)
          }
        }
        df.select(projections : _*)
      }
    ```
    
     For example:
    ```
    scala> val df = Seq[(Long, Long)]((1, 2), (-1, -2), (9123146099426677101L, 9123146560113991650L)).toDF("a", "b")
    df: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint]
    
    scala> df.show
    +-------------------+-------------------+
    |                  a|                  b|
    +-------------------+-------------------+
    |                  1|                  2|
    |                 -1|                 -2|
    |9123146099426677101|9123146560113991650|
    +-------------------+-------------------+
    
    scala> df.na.fill(0).show
    +-------------------+-------------------+
    |                  a|                  b|
    +-------------------+-------------------+
    |                  1|                  2|
    |                 -1|                 -2|
    |9123146099426676736|9123146560113991680|
    +-------------------+-------------------+
     ```
    
    the original values changed [which is not we expected result]:
    ```
     9123146099426677101 -> 9123146099426676736
     9123146560113991650 -> 9123146560113991680
    ```
    
    ## How was this patch tested?
    
    unit test added.
    
    Author: root <root@iZbp1gsnrlfzjxh82cz80vZ.(none)>
    
    Closes #15994 from windpiger/nafillMissupOriginalValue.
    
    (cherry picked from commit 508de38)
    Signed-off-by: DB Tsai <dbtsai@dbtsai.com>
    root authored and dbtsai committed Apr 11, 2017
    Configuration menu
    Copy the full SHA
    b26f2c2 View commit details
    Browse the repository at this point in the history
  2. [SPARK-20270][SQL] na.fill should not change the values in long or in…

    …teger when the default value is in double
    
    ## What changes were proposed in this pull request?
    
    This bug was partially addressed in SPARK-18555 #15994, but the root cause isn't completely solved. This bug is pretty critical since it changes the member id in Long in our application if the member id can not be represented by Double losslessly when the member id is very big.
    
    Here is an example how this happens, with
    ```
          Seq[(java.lang.Long, java.lang.Double)]((null, 3.14), (9123146099426677101L, null),
            (9123146560113991650L, 1.6), (null, null)).toDF("a", "b").na.fill(0.2),
    ```
    the logical plan will be
    ```
    == Analyzed Logical Plan ==
    a: bigint, b: double
    Project [cast(coalesce(cast(a#232L as double), cast(0.2 as double)) as bigint) AS a#240L, cast(coalesce(nanvl(b#233, cast(null as double)), 0.2) as double) AS b#241]
    +- Project [_1#229L AS a#232L, _2#230 AS b#233]
       +- LocalRelation [_1#229L, _2#230]
    ```
    
    Note that even the value is not null, Spark will cast the Long into Double first. Then if it's not null, Spark will cast it back to Long which results in losing precision.
    
    The behavior should be that the original value should not be changed if it's not null, but Spark will change the value which is wrong.
    
    With the PR, the logical plan will be
    ```
    == Analyzed Logical Plan ==
    a: bigint, b: double
    Project [coalesce(a#232L, cast(0.2 as bigint)) AS a#240L, coalesce(nanvl(b#233, cast(null as double)), cast(0.2 as double)) AS b#241]
    +- Project [_1#229L AS a#232L, _2#230 AS b#233]
       +- LocalRelation [_1#229L, _2#230]
    ```
    which behaves correctly without changing the original Long values and also avoids extra cost of unnecessary casting.
    
    ## How was this patch tested?
    
    unit test added.
    
    +cc srowen rxin cloud-fan gatorsmile
    
    Thanks.
    
    Author: DB Tsai <dbt@netflix.com>
    
    Closes #17577 from dbtsai/fixnafill.
    
    (cherry picked from commit 1a0bc41)
    Signed-off-by: DB Tsai <dbtsai@dbtsai.com>
    DB Tsai authored and dbtsai committed Apr 11, 2017
    Configuration menu
    Copy the full SHA
    f40e44d View commit details
    Browse the repository at this point in the history
  3. [SPARK-17564][TESTS] Fix flaky RequestTimeoutIntegrationSuite.further…

    …RequestsDelay
    
    ## What changes were proposed in this pull request?
    
    This PR  fixs the following failure:
    ```
    sbt.ForkMain$ForkError: java.lang.AssertionError: null
    	at org.junit.Assert.fail(Assert.java:86)
    	at org.junit.Assert.assertTrue(Assert.java:41)
    	at org.junit.Assert.assertTrue(Assert.java:52)
    	at org.apache.spark.network.RequestTimeoutIntegrationSuite.furtherRequestsDelay(RequestTimeoutIntegrationSuite.java:230)
    	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    	at java.lang.reflect.Method.invoke(Method.java:497)
    	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
    	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
    	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
    	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
    	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
    	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
    	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
    	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
    	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
    	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
    	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
    	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
    	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
    	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
    	at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
    	at org.junit.runners.Suite.runChild(Suite.java:128)
    	at org.junit.runners.Suite.runChild(Suite.java:27)
    	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
    	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
    	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
    	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
    	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
    	at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
    	at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
    	at org.junit.runner.JUnitCore.run(JUnitCore.java:115)
    	at com.novocode.junit.JUnitRunner$1.execute(JUnitRunner.java:132)
    	at sbt.ForkMain$Run$2.call(ForkMain.java:296)
    	at sbt.ForkMain$Run$2.call(ForkMain.java:286)
    	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    	at java.lang.Thread.run(Thread.java:745)
    ```
    
    It happens several times per month on [Jenkins](http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.network.RequestTimeoutIntegrationSuite&test_name=furtherRequestsDelay). The failure is because `callback1` may not be called before `assertTrue(callback1.failure instanceof IOException);`. It's pretty easy to reproduce this error by adding a sleep before this line: https://github.com/apache/spark/blob/379b0b0bbdbba2278ce3bcf471bd75f6ffd9cf0d/common/network-common/src/test/java/org/apache/spark/network/RequestTimeoutIntegrationSuite.java#L267
    
    The fix is straightforward: just use the latch to wait until `callback1` is called.
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes #17599 from zsxwing/SPARK-17564.
    
    (cherry picked from commit 734dfbf)
    Signed-off-by: Reynold Xin <rxin@databricks.com>
    zsxwing authored and rxin committed Apr 11, 2017
    Configuration menu
    Copy the full SHA
    8eb71b8 View commit details
    Browse the repository at this point in the history
  4. [SPARK-18555][MINOR][SQL] Fix the @SInCE tag when backporting from 2.…

    …2 branch into 2.1 branch
    
    ## What changes were proposed in this pull request?
    
    Fix the since tag when backporting critical bugs (SPARK-18555) from 2.2 branch into 2.1 branch.
    
    ## How was this patch tested?
    
    N/A
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: DB Tsai <dbtsai@dbtsai.com>
    
    Closes #17600 from dbtsai/branch-2.1.
    dbtsai committed Apr 11, 2017
    Configuration menu
    Copy the full SHA
    03a42c0 View commit details
    Browse the repository at this point in the history

Commits on Apr 12, 2017

  1. [SPARK-20291][SQL] NaNvl(FloatType, NullType) should not be cast to N…

    …aNvl(DoubleType, DoubleType)
    
    ## What changes were proposed in this pull request?
    
    `NaNvl(float value, null)` will be converted into `NaNvl(float value, Cast(null, DoubleType))` and finally `NaNvl(Cast(float value, DoubleType), Cast(null, DoubleType))`.
    
    This will cause mismatching in the output type when the input type is float.
    
    By adding extra rule in TypeCoercion can resolve this issue.
    
    ## How was this patch tested?
    
    unite tests.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: DB Tsai <dbt@netflix.com>
    
    Closes #17606 from dbtsai/fixNaNvl.
    
    (cherry picked from commit 8ad63ee)
    Signed-off-by: DB Tsai <dbtsai@dbtsai.com>
    DB Tsai authored and dbtsai committed Apr 12, 2017
    Configuration menu
    Copy the full SHA
    46e212d View commit details
    Browse the repository at this point in the history
  2. [MINOR][DOCS] Fix spacings in Structured Streaming Programming Guide

    ## What changes were proposed in this pull request?
    
    1. Omitted space between the sentences: `... on static data.The Spark SQL engine will ...` -> `... on static data. The Spark SQL engine will ...`
    2. Omitted colon in Output Model section.
    
    ## How was this patch tested?
    
    None.
    
    Author: Lee Dongjin <dongjin@apache.org>
    
    Closes #17564 from dongjinleekr/feature/fix-programming-guide.
    
    (cherry picked from commit b938438)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    dongjinleekr authored and srowen committed Apr 12, 2017
    Configuration menu
    Copy the full SHA
    b2970d9 View commit details
    Browse the repository at this point in the history
  3. [SPARK-20296][TRIVIAL][DOCS] Count distinct error message for streaming

    ## What changes were proposed in this pull request?
    Update count distinct error message for streaming datasets/dataframes to match current behavior. These aggregations are not yet supported, regardless of whether the dataset/dataframe is aggregated.
    
    Author: jtoka <jason.tokayer@gmail.com>
    
    Closes #17609 from jtoka/master.
    
    (cherry picked from commit 2e1fd46)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    jtoka authored and srowen committed Apr 12, 2017
    Configuration menu
    Copy the full SHA
    dbb6d1b View commit details
    Browse the repository at this point in the history
  4. [SPARK-20304][SQL] AssertNotNull should not include path in string re…

    …presentation
    
    ## What changes were proposed in this pull request?
    AssertNotNull's toString/simpleString dumps the entire walkedTypePath. walkedTypePath is used for error message reporting and shouldn't be part of the output.
    
    ## How was this patch tested?
    Manually tested.
    
    Author: Reynold Xin <rxin@databricks.com>
    
    Closes #17616 from rxin/SPARK-20304.
    
    (cherry picked from commit 5408553)
    Signed-off-by: Xiao Li <gatorsmile@gmail.com>
    rxin authored and gatorsmile committed Apr 12, 2017
    Configuration menu
    Copy the full SHA
    7e0ddda View commit details
    Browse the repository at this point in the history

Commits on Apr 13, 2017

  1. [SPARK-20131][CORE] Don't use this lock in StandaloneSchedulerBacke…

    …nd.stop
    
    ## What changes were proposed in this pull request?
    
    `o.a.s.streaming.StreamingContextSuite.SPARK-18560 Receiver data should be deserialized properly` is flaky is because there is a potential dead-lock in StandaloneSchedulerBackend which causes `await` timeout. Here is the related stack trace:
    ```
    "Thread-31" #211 daemon prio=5 os_prio=31 tid=0x00007fedd4808000 nid=0x16403 waiting on condition [0x00007000239b7000]
       java.lang.Thread.State: TIMED_WAITING (parking)
    	at sun.misc.Unsafe.park(Native Method)
    	- parking to wait for  <0x000000079b49ca10> (a scala.concurrent.impl.Promise$CompletionLatch)
    	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
    	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
    	at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
    	at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
    	at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
    	at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
    	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)
    	at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
    	at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
    	at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)
    	at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:402)
    	at org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.org$apache$spark$scheduler$cluster$StandaloneSchedulerBackend$$stop(StandaloneSchedulerBackend.scala:213)
    	- locked <0x00000007066fca38> (a org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend)
    	at org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.stop(StandaloneSchedulerBackend.scala:116)
    	- locked <0x00000007066fca38> (a org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend)
    	at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:517)
    	at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1657)
    	at org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1921)
    	at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1302)
    	at org.apache.spark.SparkContext.stop(SparkContext.scala:1920)
    	at org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:708)
    	at org.apache.spark.streaming.StreamingContextSuite$$anonfun$43$$anonfun$apply$mcV$sp$66$$anon$3.run(StreamingContextSuite.scala:827)
    
    "dispatcher-event-loop-3" #18 daemon prio=5 os_prio=31 tid=0x00007fedd603a000 nid=0x6203 waiting for monitor entry [0x0000700003be4000]
       java.lang.Thread.State: BLOCKED (on object monitor)
    	at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.org$apache$spark$scheduler$cluster$CoarseGrainedSchedulerBackend$DriverEndpoint$$makeOffers(CoarseGrainedSchedulerBackend.scala:253)
    	- waiting to lock <0x00000007066fca38> (a org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend)
    	at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:124)
    	at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
    	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
    	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
    	at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    	at java.lang.Thread.run(Thread.java:745)
    ```
    
    This PR removes `synchronized` and changes `stopping` to AtomicBoolean to ensure idempotent to fix the dead-lock.
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes #17610 from zsxwing/SPARK-20131.
    
    (cherry picked from commit c5f1cc3)
    Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
    zsxwing committed Apr 13, 2017
    Configuration menu
    Copy the full SHA
    be36c2f View commit details
    Browse the repository at this point in the history
  2. [SPARK-19924][SQL][BACKPORT-2.1] Handle InvocationTargetException for…

    … all Hive Shim
    
    ### What changes were proposed in this pull request?
    
    This is to backport the PR #17265 to Spark 2.1 branch.
    
    ---
    Since we are using shim for most Hive metastore APIs, the exceptions thrown by the underlying method of Method.invoke() are wrapped by `InvocationTargetException`. Instead of doing it one by one, we should handle all of them in the `withClient`. If any of them is missing, the error message could looks unfriendly. For example, below is an example for dropping tables.
    
    ```
    Expected exception org.apache.spark.sql.AnalysisException to be thrown, but java.lang.reflect.InvocationTargetException was thrown.
    ScalaTestFailureLocation: org.apache.spark.sql.catalyst.catalog.ExternalCatalogSuite$$anonfun$14 at (ExternalCatalogSuite.scala:193)
    org.scalatest.exceptions.TestFailedException: Expected exception org.apache.spark.sql.AnalysisException to be thrown, but java.lang.reflect.InvocationTargetException was thrown.
    	at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:496)
    	at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
    	at org.scalatest.Assertions$class.intercept(Assertions.scala:1004)
    	at org.scalatest.FunSuite.intercept(FunSuite.scala:1555)
    	at org.apache.spark.sql.catalyst.catalog.ExternalCatalogSuite$$anonfun$14.apply$mcV$sp(ExternalCatalogSuite.scala:193)
    	at org.apache.spark.sql.catalyst.catalog.ExternalCatalogSuite$$anonfun$14.apply(ExternalCatalogSuite.scala:183)
    	at org.apache.spark.sql.catalyst.catalog.ExternalCatalogSuite$$anonfun$14.apply(ExternalCatalogSuite.scala:183)
    	at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
    	at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
    	at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
    	at org.scalatest.Transformer.apply(Transformer.scala:22)
    	at org.scalatest.Transformer.apply(Transformer.scala:20)
    	at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
    	at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
    	at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
    	at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
    	at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
    	at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
    	at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
    	at org.apache.spark.sql.catalyst.catalog.ExternalCatalogSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(ExternalCatalogSuite.scala:40)
    	at org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
    	at org.apache.spark.sql.catalyst.catalog.ExternalCatalogSuite.runTest(ExternalCatalogSuite.scala:40)
    	at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
    	at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
    	at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
    	at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
    	at scala.collection.immutable.List.foreach(List.scala:381)
    	at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
    	at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
    	at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
    	at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
    	at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
    	at org.scalatest.Suite$class.run(Suite.scala:1424)
    	at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
    	at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
    	at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
    	at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
    	at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
    	at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:31)
    	at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
    	at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
    	at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:31)
    	at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:55)
    	at org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$3.apply(Runner.scala:2563)
    	at org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$3.apply(Runner.scala:2557)
    	at scala.collection.immutable.List.foreach(List.scala:381)
    	at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:2557)
    	at org.scalatest.tools.Runner$$anonfun$runOptionallyWithPassFailReporter$2.apply(Runner.scala:1044)
    	at org.scalatest.tools.Runner$$anonfun$runOptionallyWithPassFailReporter$2.apply(Runner.scala:1043)
    	at org.scalatest.tools.Runner$.withClassLoaderAndDispatchReporter(Runner.scala:2722)
    	at org.scalatest.tools.Runner$.runOptionallyWithPassFailReporter(Runner.scala:1043)
    	at org.scalatest.tools.Runner$.run(Runner.scala:883)
    	at org.scalatest.tools.Runner.run(Runner.scala)
    	at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.runScalaTest2(ScalaTestRunner.java:138)
    	at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.main(ScalaTestRunner.java:28)
    	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    	at java.lang.reflect.Method.invoke(Method.java:498)
    	at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
    Caused by: java.lang.reflect.InvocationTargetException
    	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    	at java.lang.reflect.Method.invoke(Method.java:498)
    	at org.apache.spark.sql.hive.client.Shim_v0_14.dropTable(HiveShim.scala:736)
    	at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropTable$1.apply$mcV$sp(HiveClientImpl.scala:451)
    	at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropTable$1.apply(HiveClientImpl.scala:451)
    	at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropTable$1.apply(HiveClientImpl.scala:451)
    	at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:287)
    	at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:228)
    	at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:227)
    	at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:270)
    	at org.apache.spark.sql.hive.client.HiveClientImpl.dropTable(HiveClientImpl.scala:450)
    	at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$dropTable$1.apply$mcV$sp(HiveExternalCatalog.scala:456)
    	at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$dropTable$1.apply(HiveExternalCatalog.scala:454)
    	at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$dropTable$1.apply(HiveExternalCatalog.scala:454)
    	at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:94)
    	at org.apache.spark.sql.hive.HiveExternalCatalog.dropTable(HiveExternalCatalog.scala:454)
    	at org.apache.spark.sql.catalyst.catalog.ExternalCatalogSuite$$anonfun$14$$anonfun$apply$mcV$sp$8.apply$mcV$sp(ExternalCatalogSuite.scala:194)
    	at org.apache.spark.sql.catalyst.catalog.ExternalCatalogSuite$$anonfun$14$$anonfun$apply$mcV$sp$8.apply(ExternalCatalogSuite.scala:194)
    	at org.apache.spark.sql.catalyst.catalog.ExternalCatalogSuite$$anonfun$14$$anonfun$apply$mcV$sp$8.apply(ExternalCatalogSuite.scala:194)
    	at org.scalatest.Assertions$class.intercept(Assertions.scala:997)
    	... 57 more
    Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: NoSuchObjectException(message:db2.unknown_table table not found)
    	at org.apache.hadoop.hive.ql.metadata.Hive.dropTable(Hive.java:1038)
    	... 79 more
    Caused by: NoSuchObjectException(message:db2.unknown_table table not found)
    	at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_table_core(HiveMetaStore.java:1808)
    	at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_table(HiveMetaStore.java:1778)
    	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    	at java.lang.reflect.Method.invoke(Method.java:498)
    	at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107)
    	at com.sun.proxy.$Proxy10.get_table(Unknown Source)
    	at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:1208)
    	at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.getTable(SessionHiveMetaStoreClient.java:131)
    	at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.dropTable(HiveMetaStoreClient.java:952)
    	at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.dropTable(HiveMetaStoreClient.java:904)
    	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    	at java.lang.reflect.Method.invoke(Method.java:498)
    	at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156)
    	at com.sun.proxy.$Proxy11.dropTable(Unknown Source)
    	at org.apache.hadoop.hive.ql.metadata.Hive.dropTable(Hive.java:1035)
    	... 79 more
    ```
    
    After unwrapping the exception, the message is like
    ```
    org.apache.hadoop.hive.ql.metadata.HiveException: NoSuchObjectException(message:db2.unknown_table table not found);
    org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: NoSuchObjectException(message:db2.unknown_table table not found);
    	at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:100)
    	at org.apache.spark.sql.hive.HiveExternalCatalog.dropTable(HiveExternalCatalog.scala:460)
    	at org.apache.spark.sql.catalyst.catalog.ExternalCatalogSuite$$anonfun$14.apply$mcV$sp(ExternalCatalogSuite.scala:193)
    	at org.apache.spark.sql.catalyst.catalog.ExternalCatalogSuite$$anonfun$14.apply(ExternalCatalogSuite.scala:183)
    	at org.apache.spark.sql.catalyst.catalog.ExternalCatalogSuite$$anonfun$14.apply(ExternalCatalogSuite.scala:183)
    	at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
    ...
    ```
    ### How was this patch tested?
    N/A
    
    Author: Xiao Li <gatorsmile@gmail.com>
    
    Closes #17627 from gatorsmile/backport-17265.
    gatorsmile authored and cloud-fan committed Apr 13, 2017
    Configuration menu
    Copy the full SHA
    98ae548 View commit details
    Browse the repository at this point in the history
  3. [SPARK-19946][TESTS][BACKPORT-2.1] DebugFilesystem.assertNoOpenStream…

    …s should report the open streams to help debugging
    
    ## What changes were proposed in this pull request?
    Backport for PR #17292
    DebugFilesystem.assertNoOpenStreams throws an exception with a cause exception that actually shows the code line which leaked the stream.
    
    ## How was this patch tested?
    New test in SparkContextSuite to check there is a cause exception.
    
    Author: Bogdan Raducanu <bogdan@databricks.com>
    
    Closes #17632 from bogdanrdc/SPARK-19946-BRANCH2.1.
    bogdanrdc authored and hvanhovell committed Apr 13, 2017
    Configuration menu
    Copy the full SHA
    bca7ce2 View commit details
    Browse the repository at this point in the history

Commits on Apr 14, 2017

  1. [SPARK-20243][TESTS] DebugFilesystem.assertNoOpenStreams thread race

    ## What changes were proposed in this pull request?
    
    Synchronize access to openStreams map.
    
    ## How was this patch tested?
    
    Existing tests.
    
    Author: Bogdan Raducanu <bogdan@databricks.com>
    
    Closes #17592 from bogdanrdc/SPARK-20243.
    bogdanrdc authored and hvanhovell committed Apr 14, 2017
    Configuration menu
    Copy the full SHA
    6f715c0 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    2ed19cf View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    2a3e50e View commit details
    Browse the repository at this point in the history

Commits on Apr 17, 2017

  1. [SPARK-20335][SQL][BACKPORT-2.1] Children expressions of Hive UDF imp…

    …acts the determinism of Hive UDF
    
    ### What changes were proposed in this pull request?
    
    This PR is to backport #17635 to Spark 2.1
    
    ---
    ```JAVA
      /**
       * Certain optimizations should not be applied if UDF is not deterministic.
       * Deterministic UDF returns same result each time it is invoked with a
       * particular input. This determinism just needs to hold within the context of
       * a query.
       *
       * return true if the UDF is deterministic
       */
      boolean deterministic() default true;
    ```
    
    Based on the definition of [UDFType](https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/UDFType.java#L42-L50), when Hive UDF's children are non-deterministic, Hive UDF is also non-deterministic.
    
    ### How was this patch tested?
    Added test cases.
    
    Author: Xiao Li <gatorsmile@gmail.com>
    
    Closes #17652 from gatorsmile/backport-17635.
    gatorsmile authored and cloud-fan committed Apr 17, 2017
    Configuration menu
    Copy the full SHA
    efa11a4 View commit details
    Browse the repository at this point in the history
  2. [SPARK-20349][SQL] ListFunctions returns duplicate functions after us…

    …ing persistent functions
    
    ### What changes were proposed in this pull request?
    The session catalog caches some persistent functions in the `FunctionRegistry`, so there can be duplicates. Our Catalog API `listFunctions` does not handle it.
    
    It would be better if `SessionCatalog` API can de-duplciate the records, instead of doing it by each API caller. In `FunctionRegistry`, our functions are identified by the unquoted string. Thus, this PR is try to parse it using our parser interface and then de-duplicate the names.
    
    ### How was this patch tested?
    Added test cases.
    
    Author: Xiao Li <gatorsmile@gmail.com>
    
    Closes #17646 from gatorsmile/showFunctions.
    
    (cherry picked from commit 01ff035)
    Signed-off-by: Xiao Li <gatorsmile@gmail.com>
    gatorsmile committed Apr 17, 2017
    Configuration menu
    Copy the full SHA
    7aad057 View commit details
    Browse the repository at this point in the history
  3. [SPARK-17647][SQL] Fix backslash escaping in 'LIKE' patterns.

    This patch fixes a bug in the way LIKE patterns are translated to Java regexes. The bug causes any character following an escaped backslash to be escaped, i.e. there is double-escaping.
    A concrete example is the following pattern:`'%\\%'`. The expected Java regex that this pattern should correspond to (according to the behavior described below) is `'.*\\.*'`, however the current situation leads to `'.*\\%'` instead.
    
    ---
    
    Update: in light of the discussion that ensued, we should explicitly define the expected behaviour of LIKE expressions, especially in certain edge cases. With the help of gatorsmile, we put together a list of different RDBMS and their variations wrt to certain standard features.
    
    | RDBMS\Features | Wildcards | Default escape [1] | Case sensitivity |
    | --- | --- | --- | --- |
    | [MS SQL Server](https://msdn.microsoft.com/en-us/library/ms179859.aspx) | _, %, [], [^] | none | no |
    | [Oracle](https://docs.oracle.com/cd/B12037_01/server.101/b10759/conditions016.htm) | _, % | none | yes |
    | [DB2 z/OS](http://www.ibm.com/support/knowledgecenter/SSEPEK_11.0.0/sqlref/src/tpc/db2z_likepredicate.html) | _, % | none | yes |
    | [MySQL](http://dev.mysql.com/doc/refman/5.7/en/string-comparison-functions.html) | _, % | none | no |
    | [PostreSQL](https://www.postgresql.org/docs/9.0/static/functions-matching.html) | _, % | \ | yes |
    | [Hive](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF) | _, % | none | yes |
    | Current Spark | _, % | \ | yes |
    
    [1] Default escape character: most systems do not have a default escape character, instead the user can specify one by calling a like expression with an escape argument [A] LIKE [B] ESCAPE [C]. This syntax is currently not supported by Spark, however I would volunteer to implement this feature in a separate ticket.
    
    The specifications are often quite terse and certain scenarios are undocumented, so here is a list of scenarios that I am uncertain about and would appreciate any input. Specifically I am looking for feedback on whether or not Spark's current behavior should be changed.
    1. [x] Ending a pattern with the escape sequence, e.g. `like 'a\'`.
       PostreSQL gives an error: 'LIKE pattern must not end with escape character', which I personally find logical. Currently, Spark allows "non-terminated" escapes and simply ignores them as part of the pattern.
       According to [DB2's documentation](http://www.ibm.com/support/knowledgecenter/SSEPGG_9.7.0/com.ibm.db2.luw.messages.sql.doc/doc/msql00130n.html), ending a pattern in an escape character is invalid.
       _Proposed new behaviour in Spark: throw AnalysisException_
    2. [x] Empty input, e.g. `'' like ''`
       Postgres and DB2 will match empty input only if the pattern is empty as well, any other combination of empty input will not match. Spark currently follows this rule.
    3. [x] Escape before a non-special character, e.g. `'a' like '\a'`.
       Escaping a non-wildcard character is not really documented but PostgreSQL just treats it verbatim, which I also find the least surprising behavior. Spark does the same.
       According to [DB2's documentation](http://www.ibm.com/support/knowledgecenter/SSEPGG_9.7.0/com.ibm.db2.luw.messages.sql.doc/doc/msql00130n.html), it is invalid to follow an escape character with anything other than an escape character, an underscore or a percent sign.
       _Proposed new behaviour in Spark: throw AnalysisException_
    
    The current specification is also described in the operator's source code in this patch.
    
    Extra case in regex unit tests.
    
    Author: Jakob Odersky <jakob@odersky.com>
    
    This patch had conflicts when merged, resolved by
    Committer: Reynold Xin <rxin@databricks.com>
    
    Closes #15398 from jodersky/SPARK-17647.
    
    (cherry picked from commit e5fee3e)
    Signed-off-by: Reynold Xin <rxin@databricks.com>
    jodersky authored and rxin committed Apr 17, 2017
    Configuration menu
    Copy the full SHA
    db9517c View commit details
    Browse the repository at this point in the history
  4. [HOTFIX] Fix compilation.

    rxin committed Apr 17, 2017
    Configuration menu
    Copy the full SHA
    622d7a8 View commit details
    Browse the repository at this point in the history

Commits on Apr 18, 2017

  1. [SPARK-20349][SQL][REVERT-BRANCH2.1] ListFunctions returns duplicate …

    …functions after using persistent functions
    
    Revert the changes of #17646 made in Branch 2.1, because it breaks the build. It needs the parser interface, but SessionCatalog in branch 2.1 does not have it.
    
    ### What changes were proposed in this pull request?
    
    The session catalog caches some persistent functions in the `FunctionRegistry`, so there can be duplicates. Our Catalog API `listFunctions` does not handle it.
    
    It would be better if `SessionCatalog` API can de-duplciate the records, instead of doing it by each API caller. In `FunctionRegistry`, our functions are identified by the unquoted string. Thus, this PR is try to parse it using our parser interface and then de-duplicate the names.
    
    ### How was this patch tested?
    Added test cases.
    
    Author: Xiao Li <gatorsmile@gmail.com>
    
    Closes #17661 from gatorsmile/compilationFix17646.
    gatorsmile authored and rxin committed Apr 18, 2017
    Configuration menu
    Copy the full SHA
    3808b47 View commit details
    Browse the repository at this point in the history
  2. [SPARK-17647][SQL][FOLLOWUP][MINOR] fix typo

    ## What changes were proposed in this pull request?
    
    fix typo
    
    ## How was this patch tested?
    
    manual
    
    Author: Felix Cheung <felixcheung_m@hotmail.com>
    
    Closes #17663 from felixcheung/likedoctypo.
    
    (cherry picked from commit b0a1e93)
    Signed-off-by: Felix Cheung <felixcheung@apache.org>
    felixcheung authored and Felix Cheung committed Apr 18, 2017
    Configuration menu
    Copy the full SHA
    a4c1ebc View commit details
    Browse the repository at this point in the history

Commits on Apr 19, 2017

  1. [SPARK-20359][SQL] Avoid unnecessary execution in EliminateOuterJoin …

    …optimization that can lead to NPE
    
    Avoid necessary execution that can lead to NPE in EliminateOuterJoin and add test in DataFrameSuite to confirm NPE is no longer thrown
    
    ## What changes were proposed in this pull request?
    Change leftHasNonNullPredicate and rightHasNonNullPredicate to lazy so they are only executed when needed.
    
    ## How was this patch tested?
    
    Added test in DataFrameSuite that failed before this fix and now succeeds. Note that a test in catalyst project would be better but i am unsure how to do this.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Koert Kuipers <koert@tresata.com>
    
    Closes #17660 from koertkuipers/feat-catch-npe-in-eliminate-outer-join.
    
    (cherry picked from commit 608bf30)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    koertkuipers authored and cloud-fan committed Apr 19, 2017
    Configuration menu
    Copy the full SHA
    171bf65 View commit details
    Browse the repository at this point in the history

Commits on Apr 20, 2017

  1. [MINOR][SS] Fix a missing space in UnsupportedOperationChecker error …

    …message
    
    ## What changes were proposed in this pull request?
    
    Also went through the same file to ensure other string concatenation are correct.
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes #17691 from zsxwing/fix-error-message.
    
    (cherry picked from commit 39e303a)
    Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
    zsxwing committed Apr 20, 2017
    Configuration menu
    Copy the full SHA
    9e5dc82 View commit details
    Browse the repository at this point in the history
  2. [SPARK-20409][SQL] fail early if aggregate function in GROUP BY

    ## What changes were proposed in this pull request?
    
    It's illegal to have aggregate function in GROUP BY, and we should fail at analysis phase, if this happens.
    
    ## How was this patch tested?
    
    new regression test
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes #17704 from cloud-fan/minor.
    cloud-fan authored and hvanhovell committed Apr 20, 2017
    Configuration menu
    Copy the full SHA
    66e7a8f View commit details
    Browse the repository at this point in the history

Commits on Apr 21, 2017

  1. Small rewording about history server use case

    Hello
    PR #10991 removed the built-in history view from Spark Standalone, so the history server is no longer useful to Yarn or Mesos only.
    
    Author: Hervé <dud225@users.noreply.github.com>
    
    Closes #17709 from dud225/patch-1.
    
    (cherry picked from commit 3476799)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    dud225 authored and srowen committed Apr 21, 2017
    Configuration menu
    Copy the full SHA
    fb0351a View commit details
    Browse the repository at this point in the history

Commits on Apr 22, 2017

  1. [SPARK-20407][TESTS][BACKPORT-2.1] ParquetQuerySuite 'Enabling/disabl…

    …ing ignoreCorruptFiles' flaky test
    
    ## What changes were proposed in this pull request?
    
    SharedSQLContext.afterEach now calls DebugFilesystem.assertNoOpenStreams inside eventually.
    SQLTestUtils withTempDir calls waitForTasksToFinish before deleting the directory.
    
    ## How was this patch tested?
    New test but marked as ignored because it takes 30s. Can be unignored for review.
    
    Author: Bogdan Raducanu <bogdan@databricks.com>
    
    Closes #17720 from bogdanrdc/SPARK-20407-BACKPORT2.1.
    bogdanrdc authored and gatorsmile committed Apr 22, 2017
    Configuration menu
    Copy the full SHA
    ba50580 View commit details
    Browse the repository at this point in the history

Commits on Apr 24, 2017

  1. [SPARK-20450][SQL] Unexpected first-query schema inference cost with …

    …2.1.1
    
    ## What changes were proposed in this pull request?
    
    https://issues.apache.org/jira/browse/SPARK-19611 fixes a regression from 2.0 where Spark silently fails to read case-sensitive fields missing a case-sensitive schema in the table properties. The fix is to detect this situation, infer the schema, and write the case-sensitive schema into the metastore.
    
    However this can incur an unexpected performance hit the first time such a problematic table is queried (and there is a high false-positive rate here since most tables don't actually have case-sensitive fields).
    
    This PR changes the default to NEVER_INFER (same behavior as 2.1.0). In 2.2, we can consider leaving the default to INFER_AND_SAVE.
    
    ## How was this patch tested?
    
    Unit tests.
    
    Author: Eric Liang <ekl@databricks.com>
    
    Closes #17749 from ericl/spark-20450.
    ericl authored and hvanhovell committed Apr 24, 2017
    Configuration menu
    Copy the full SHA
    d99b49b View commit details
    Browse the repository at this point in the history

Commits on Apr 25, 2017

  1. [SPARK-20451] Filter out nested mapType datatypes from sort order in …

    …randomSplit
    
    ## What changes were proposed in this pull request?
    
    In `randomSplit`, It is possible that the underlying dataset doesn't guarantee the ordering of rows in its constituent partitions each time a split is materialized which could result in overlapping
    splits.
    
    To prevent this, as part of SPARK-12662, we explicitly sort each input partition to make the ordering deterministic. Given that `MapTypes` cannot be sorted this patch explicitly prunes them out from the sort order. Additionally, if the resulting sort order is empty, this patch then materializes the dataset to guarantee determinism.
    
    ## How was this patch tested?
    
    Extended `randomSplit on reordered partitions` in `DataFrameStatSuite` to also test for dataframes with mapTypes nested mapTypes.
    
    Author: Sameer Agarwal <sameerag@cs.berkeley.edu>
    
    Closes #17751 from sameeragarwal/randomsplit2.
    
    (cherry picked from commit 31345fd)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    sameeragarwal authored and cloud-fan committed Apr 25, 2017
    Configuration menu
    Copy the full SHA
    4279665 View commit details
    Browse the repository at this point in the history
  2. [SPARK-20455][DOCS] Fix Broken Docker IT Docs

    ## What changes were proposed in this pull request?
    
    Just added the Maven `test`goal.
    
    ## How was this patch tested?
    
    No test needed, just a trivial documentation fix.
    
    Author: Armin Braun <me@obrown.io>
    
    Closes #17756 from original-brownbear/SPARK-20455.
    
    (cherry picked from commit c8f1219)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    original-brownbear authored and srowen committed Apr 25, 2017
    Configuration menu
    Copy the full SHA
    65990fc View commit details
    Browse the repository at this point in the history
  3. [SPARK-20404][CORE] Using Option(name) instead of Some(name)

    Using Option(name) instead of Some(name) to prevent runtime failures when using accumulators created like the following
    ```
    sparkContext.accumulator(0, null)
    ```
    
    Author: Sergey Zhemzhitsky <szhemzhitski@gmail.com>
    
    Closes #17740 from szhem/SPARK-20404-null-acc-names.
    
    (cherry picked from commit 0bc7a90)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    szhem authored and srowen committed Apr 25, 2017
    Configuration menu
    Copy the full SHA
    2d47e1a View commit details
    Browse the repository at this point in the history
  4. [SPARK-20239][CORE][2.1-BACKPORT] Improve HistoryServer's ACL mechanism

    Current SHS (Spark History Server) has two different ACLs:
    
    * ACL of base URL, it is controlled by "spark.acls.enabled" or "spark.ui.acls.enabled", and with this enabled, only user configured with "spark.admin.acls" (or group) or "spark.ui.view.acls" (or group), or the user who started SHS could list all the applications, otherwise none of them can be listed. This will also affect REST APIs which listing the summary of all apps and one app.
    * Per application ACL. This is controlled by "spark.history.ui.acls.enabled". With this enabled only history admin user and user/group who ran this app can access the details of this app.
    
    With this two ACLs, we may encounter several unexpected behaviors:
    
    1. if base URL's ACL (`spark.acls.enable`) is enabled but user A has no view permission. User "A" cannot see the app list but could still access details of it's own app.
    2. if ACLs of base URL (`spark.acls.enable`) is disabled, then user "A" could download any application's event log, even it is not run by user "A".
    3. The changes of Live UI's ACL will affect History UI's ACL which share the same conf file.
    
    The unexpected behaviors is mainly because we have two different ACLs, ideally we should have only one to manage all.
    
    So to improve SHS's ACL mechanism, here in this PR proposed to:
    
    1. Disable "spark.acls.enable" and only use "spark.history.ui.acls.enable" for history server.
    2. Check permission for event-log download REST API.
    
    With this PR:
    
    1. Admin user could see/download the list of all applications, as well as application details.
    2. Normal user could see the list of all applications, but can only download and check the details of applications accessible to him.
    
    New UTs are added, also verified in real cluster.
    
    CC tgravescs vanzin please help to review, this PR changes the semantics you did previously. Thanks a lot.
    
    Author: jerryshao <sshao@hortonworks.com>
    
    Closes #17755 from jerryshao/SPARK-20239-2.1-backport.
    jerryshao authored and Marcelo Vanzin committed Apr 25, 2017
    Configuration menu
    Copy the full SHA
    359382c View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    267aca5 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    8460b09 View commit details
    Browse the repository at this point in the history

Commits on Apr 26, 2017

  1. [SPARK-20439][SQL][BACKPORT-2.1] Fix Catalog API listTables and getTa…

    …ble when failed to fetch table metadata
    
    ### What changes were proposed in this pull request?
    
    This PR is to backport #17730 to Spark 2.1
    --- --
    `spark.catalog.listTables` and `spark.catalog.getTable` does not work if we are unable to retrieve table metadata due to any reason (e.g., table serde class is not accessible or the table type is not accepted by Spark SQL). After this PR, the APIs still return the corresponding Table without the description and tableType)
    
    ### How was this patch tested?
    Added a test case
    
    Author: Xiao Li <gatorsmile@gmail.com>
    
    Closes #17760 from gatorsmile/backport-17730.
    gatorsmile authored and cloud-fan committed Apr 26, 2017
    Configuration menu
    Copy the full SHA
    6696ad0 View commit details
    Browse the repository at this point in the history

Commits on Apr 28, 2017

  1. [SPARK-20496][SS] Bug in KafkaWriter Looks at Unanalyzed Plans

    ## What changes were proposed in this pull request?
    
    We didn't enforce analyzed plans in Spark 2.1 when writing out to Kafka.
    
    ## How was this patch tested?
    
    New unit test.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Bill Chambers <bill@databricks.com>
    
    Closes #17804 from anabranch/SPARK-20496-2.
    
    (cherry picked from commit 733b81b)
    Signed-off-by: Burak Yavuz <brkyvz@gmail.com>
    Bill Chambers authored and brkyvz committed Apr 28, 2017
    Configuration menu
    Copy the full SHA
    5131b0a View commit details
    Browse the repository at this point in the history

Commits on May 1, 2017

  1. [SPARK-20517][UI] Fix broken history UI download link

    The download link in history server UI is concatenated with:
    
    ```
     <td><a href="{{uiroot}}/api/v1/applications/{{id}}/{{num}}/logs" class="btn btn-info btn-mini">Download</a></td>
    ```
    
    Here `num` field represents number of attempts, this is not equal to REST APIs. In the REST API, if attempt id is not existed the URL should be `api/v1/applications/<id>/logs`, otherwise the URL should be `api/v1/applications/<id>/<attemptId>/logs`. Using `<num>` to represent `<attemptId>` will lead to the issue of "no such app".
    
    Manual verification.
    
    CC ajbozarth can you please review this change, since you add this feature before? Thanks!
    
    Author: jerryshao <sshao@hortonworks.com>
    
    Closes #17795 from jerryshao/SPARK-20517.
    
    (cherry picked from commit ab30590)
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
    jerryshao authored and Marcelo Vanzin committed May 1, 2017
    Configuration menu
    Copy the full SHA
    868b4a1 View commit details
    Browse the repository at this point in the history
  2. [SPARK-20540][CORE] Fix unstable executor requests.

    There are two problems fixed in this commit. First, the
    ExecutorAllocationManager sets a timeout to avoid requesting executors
    too often. However, the timeout is always updated based on its value and
    a timeout, not the current time. If the call is delayed by locking for
    more than the ongoing scheduler timeout, the manager will request more
    executors on every run. This seems to be the main cause of SPARK-20540.
    
    The second problem is that the total number of requested executors is
    not tracked by the CoarseGrainedSchedulerBackend. Instead, it calculates
    the value based on the current status of 3 variables: the number of
    known executors, the number of executors that have been killed, and the
    number of pending executors. But, the number of pending executors is
    never less than 0, even though there may be more known than requested.
    When executors are killed and not replaced, this can cause the request
    sent to YARN to be incorrect because there were too many executors due
    to the scheduler's state being slightly out of date. This is fixed by tracking
    the currently requested size explicitly.
    
    ## How was this patch tested?
    
    Existing tests.
    
    Author: Ryan Blue <blue@apache.org>
    
    Closes #17813 from rdblue/SPARK-20540-fix-dynamic-allocation.
    
    (cherry picked from commit 2b2dd08)
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
    rdblue authored and Marcelo Vanzin committed May 1, 2017
    Configuration menu
    Copy the full SHA
    5915588 View commit details
    Browse the repository at this point in the history

Commits on May 3, 2017

  1. [SPARK-20558][CORE] clear InheritableThreadLocal variables in SparkCo…

    …ntext when stopping it
    
    ## What changes were proposed in this pull request?
    
    To better understand this problem, let's take a look at an example first:
    ```
    object Main {
      def main(args: Array[String]): Unit = {
        var t = new Test
        new Thread(new Runnable {
          override def run() = {}
        }).start()
        println("first thread finished")
    
        t.a = null
        t = new Test
        new Thread(new Runnable {
          override def run() = {}
        }).start()
      }
    
    }
    
    class Test {
      var a = new InheritableThreadLocal[String] {
        override protected def childValue(parent: String): String = {
          println("parent value is: " + parent)
          parent
        }
      }
      a.set("hello")
    }
    ```
    The result is:
    ```
    parent value is: hello
    first thread finished
    parent value is: hello
    parent value is: hello
    ```
    
    Once an `InheritableThreadLocal` has been set value, child threads will inherit its value as long as it has not been GCed, so setting the variable which holds the `InheritableThreadLocal` to `null` doesn't work as we expected.
    
    In `SparkContext`, we have an `InheritableThreadLocal` for local properties, we should clear it when stopping `SparkContext`, or all the future child threads will still inherit it and copy the properties and waste memory.
    
    This is the root cause of https://issues.apache.org/jira/browse/SPARK-20548 , which creates/stops `SparkContext` many times and finally have a lot of `InheritableThreadLocal` alive, and cause OOM when starting new threads in the internal thread pools.
    
    ## How was this patch tested?
    
    N/A
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes #17833 from cloud-fan/core.
    
    (cherry picked from commit b946f31)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    cloud-fan committed May 3, 2017
    Configuration menu
    Copy the full SHA
    d10b0f6 View commit details
    Browse the repository at this point in the history

Commits on May 5, 2017

  1. [SPARK-20546][DEPLOY] spark-class gets syntax error in posix mode

    ## What changes were proposed in this pull request?
    
    Updated spark-class to turn off posix mode so the process substitution doesn't cause a syntax error.
    
    ## How was this patch tested?
    
    Existing unit tests, manual spark-shell testing with posix mode on
    
    Author: jyu00 <jessieyu@us.ibm.com>
    
    Closes #17852 from jyu00/master.
    
    (cherry picked from commit 5773ab1)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    jyu00 authored and srowen committed May 5, 2017
    Configuration menu
    Copy the full SHA
    179f537 View commit details
    Browse the repository at this point in the history
  2. [SPARK-20613] Remove excess quotes in Windows executable

    ## What changes were proposed in this pull request?
    
    Quotes are already added to the RUNNER variable on line 54. There is no need to put quotes on line 67. If you do, you will get an error when launching Spark.
    
    '""C:\Program' is not recognized as an internal or external command, operable program or batch file.
    
    ## How was this patch tested?
    
    Tested manually on Windows 10.
    
    Author: Jarrett Meyer <jarrettmeyer@gmail.com>
    
    Closes #17861 from jarrettmeyer/fix-windows-cmd.
    
    (cherry picked from commit b9ad2d1)
    Signed-off-by: Felix Cheung <felixcheung@apache.org>
    jarrettmeyer authored and Felix Cheung committed May 5, 2017
    Configuration menu
    Copy the full SHA
    2a7f5da View commit details
    Browse the repository at this point in the history
  3. [SPARK-20603][SS][TEST] Set default number of topic partitions to 1 t…

    …o reduce the load
    
    ## What changes were proposed in this pull request?
    
    I checked the logs of https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.2-test-maven-hadoop-2.7/47/ and found it took several seconds to create Kafka internal topic `__consumer_offsets`. As Kafka creates this topic lazily, the topic creation happens in the first test `deserialization of initial offset with Spark 2.1.0` and causes it timeout.
    
    This PR changes `offsets.topic.num.partitions` from the default value 50 to 1 to make creating `__consumer_offsets` (50 partitions -> 1 partition) much faster.
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes #17863 from zsxwing/fix-kafka-flaky-test.
    
    (cherry picked from commit bd57882)
    Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
    zsxwing committed May 5, 2017
    Configuration menu
    Copy the full SHA
    704b249 View commit details
    Browse the repository at this point in the history
  4. [SPARK-20616] RuleExecutor logDebug of batch results should show diff…

    … to start of batch
    
    ## What changes were proposed in this pull request?
    
    Due to a likely typo, the logDebug msg printing the diff of query plans shows a diff to the initial plan, not diff to the start of batch.
    
    ## How was this patch tested?
    
    Now the debug message prints the diff between start and end of batch.
    
    Author: Juliusz Sompolski <julek@databricks.com>
    
    Closes #17875 from juliuszsompolski/SPARK-20616.
    
    (cherry picked from commit 5d75b14)
    Signed-off-by: Reynold Xin <rxin@databricks.com>
    juliuszsompolski authored and rxin committed May 5, 2017
    Configuration menu
    Copy the full SHA
    a1112c6 View commit details
    Browse the repository at this point in the history

Commits on May 9, 2017

  1. [SPARK-20615][ML][TEST] SparseVector.argmax throws IndexOutOfBoundsEx…

    …ception
    
    ## What changes were proposed in this pull request?
    
    Added a check for for the number of defined values.  Previously the argmax function assumed that at least one value was defined if the vector size was greater than zero.
    
    ## How was this patch tested?
    
    Tests were added to the existing VectorsSuite to cover this case.
    
    Author: Jon McLean <jon.mclean@atsid.com>
    
    Closes #17877 from jonmclean/vectorArgmaxIndexBug.
    
    (cherry picked from commit be53a78)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    Jon McLean authored and srowen committed May 9, 2017
    Configuration menu
    Copy the full SHA
    f7a91a1 View commit details
    Browse the repository at this point in the history
  2. [SPARK-20627][PYSPARK] Drop the hadoop distirbution name from the Pyt…

    …hon version
    
    ## What changes were proposed in this pull request?
    
    Drop the hadoop distirbution name from the Python version (PEP440 - https://www.python.org/dev/peps/pep-0440/). We've been using the local version string to disambiguate between different hadoop versions packaged with PySpark, but PEP0440 states that local versions should not be used when publishing up-stream. Since we no longer make PySpark pip packages for different hadoop versions, we can simply drop the hadoop information. If at a later point we need to start publishing different hadoop versions we can look at make different packages or similar.
    
    ## How was this patch tested?
    
    Ran `make-distribution` locally
    
    Author: Holden Karau <holden@us.ibm.com>
    
    Closes #17885 from holdenk/SPARK-20627-remove-pip-local-version-string.
    
    (cherry picked from commit 1b85bcd)
    Signed-off-by: Holden Karau <holden@us.ibm.com>
    holdenk committed May 9, 2017
    Configuration menu
    Copy the full SHA
    12c937e View commit details
    Browse the repository at this point in the history

Commits on May 10, 2017

  1. [SPARK-17685][SQL] Make SortMergeJoinExec's currentVars is null when …

    …calling createJoinKey
    
    ## What changes were proposed in this pull request?
    
    The following SQL query cause `IndexOutOfBoundsException` issue when `LIMIT > 1310720`:
    ```sql
    CREATE TABLE tab1(int int, int2 int, str string);
    CREATE TABLE tab2(int int, int2 int, str string);
    INSERT INTO tab1 values(1,1,'str');
    INSERT INTO tab1 values(2,2,'str');
    INSERT INTO tab2 values(1,1,'str');
    INSERT INTO tab2 values(2,3,'str');
    
    SELECT
      count(*)
    FROM
      (
        SELECT t1.int, t2.int2
        FROM (SELECT * FROM tab1 LIMIT 1310721) t1
        INNER JOIN (SELECT * FROM tab2 LIMIT 1310721) t2
        ON (t1.int = t2.int AND t1.int2 = t2.int2)
      ) t;
    ```
    
    This pull request fix this issue.
    
    ## How was this patch tested?
    
    unit tests
    
    Author: Yuming Wang <wgyumg@gmail.com>
    
    Closes #17920 from wangyum/SPARK-17685.
    
    (cherry picked from commit 771abeb)
    Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>
    wangyum authored and hvanhovell committed May 10, 2017
    Configuration menu
    Copy the full SHA
    50f28df View commit details
    Browse the repository at this point in the history
  2. [SPARK-20686][SQL] PropagateEmptyRelation incorrectly handles aggrega…

    …te without grouping
    
    The query
    
    ```
    SELECT 1 FROM (SELECT COUNT(*) WHERE FALSE) t1
    ```
    
    should return a single row of output because the subquery is an aggregate without a group-by and thus should return a single row. However, Spark incorrectly returns zero rows.
    
    This is caused by SPARK-16208 / #13906, a patch which added an optimizer rule to propagate EmptyRelation through operators. The logic for handling aggregates is wrong: it checks whether aggregate expressions are non-empty for deciding whether the output should be empty, whereas it should be checking grouping expressions instead:
    
    An aggregate with non-empty grouping expression will return one output row per group. If the input to the grouped aggregate is empty then all groups will be empty and thus the output will be empty. It doesn't matter whether the aggregation output columns include aggregate expressions since that won't affect the number of output rows.
    
    If the grouping expressions are empty, however, then the aggregate will always produce a single output row and thus we cannot propagate the EmptyRelation.
    
    The current implementation is incorrect and also misses an optimization opportunity by not propagating EmptyRelation in the case where a grouped aggregate has aggregate expressions (in other words, `SELECT COUNT(*) from emptyRelation GROUP BY x` would _not_ be optimized to `EmptyRelation` in the old code, even though it safely could be).
    
    This patch resolves this issue by modifying `PropagateEmptyRelation` to consider only the presence/absence of grouping expressions, not the aggregate functions themselves, when deciding whether to propagate EmptyRelation.
    
    - Added end-to-end regression tests in `SQLQueryTest`'s `group-by.sql` file.
    - Updated unit tests in `PropagateEmptyRelationSuite`.
    
    Author: Josh Rosen <joshrosen@databricks.com>
    
    Closes #17929 from JoshRosen/fix-PropagateEmptyRelation.
    
    (cherry picked from commit a90c5cd)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    JoshRosen authored and cloud-fan committed May 10, 2017
    Configuration menu
    Copy the full SHA
    8e09789 View commit details
    Browse the repository at this point in the history
  3. [SPARK-20631][PYTHON][ML] LogisticRegression._checkThresholdConsisten…

    …cy should use values not Params
    
    ## What changes were proposed in this pull request?
    
    - Replace `getParam` calls with `getOrDefault` calls.
    - Fix exception message to avoid unintended `TypeError`.
    - Add unit tests
    
    ## How was this patch tested?
    
    New unit tests.
    
    Author: zero323 <zero323@users.noreply.github.com>
    
    Closes #17891 from zero323/SPARK-20631.
    
    (cherry picked from commit 804949c)
    Signed-off-by: Yanbo Liang <ybliang8@gmail.com>
    zero323 authored and yanboliang committed May 10, 2017
    Configuration menu
    Copy the full SHA
    69786ea View commit details
    Browse the repository at this point in the history
  4. [SPARK-20688][SQL] correctly check analysis for scalar sub-queries

    In `CheckAnalysis`, we should call `checkAnalysis` for `ScalarSubquery` at the beginning, as later we will call `plan.output` which is invalid if `plan` is not resolved.
    
    new regression test
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes #17930 from cloud-fan/tmp.
    
    (cherry picked from commit 789bdbe)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    cloud-fan committed May 10, 2017
    Configuration menu
    Copy the full SHA
    bdc08ab View commit details
    Browse the repository at this point in the history
  5. [SPARK-20685] Fix BatchPythonEvaluation bug in case of single UDF w/ …

    …repeated arg.
    
    ## What changes were proposed in this pull request?
    
    There's a latent corner-case bug in PySpark UDF evaluation where executing a `BatchPythonEvaluation` with a single multi-argument UDF where _at least one argument value is repeated_ will crash at execution with a confusing error.
    
    This problem was introduced in #12057: the code there has a fast path for handling a "batch UDF evaluation consisting of a single Python UDF", but that branch incorrectly assumes that a single UDF won't have repeated arguments and therefore skips the code for unpacking arguments from the input row (whose schema may not necessarily match the UDF inputs due to de-duplication of repeated arguments which occurred in the JVM before sending UDF inputs to Python).
    
    This fix here is simply to remove this special-casing: it turns out that the code in the "multiple UDFs" branch just so happens to work for the single-UDF case because Python treats `(x)` as equivalent to `x`, not as a single-argument tuple.
    
    ## How was this patch tested?
    
    New regression test in `pyspark.python.sql.tests` module (tested and confirmed that it fails before my fix).
    
    Author: Josh Rosen <joshrosen@databricks.com>
    
    Closes #17927 from JoshRosen/SPARK-20685.
    
    (cherry picked from commit 8ddbc43)
    Signed-off-by: Xiao Li <gatorsmile@gmail.com>
    JoshRosen authored and gatorsmile committed May 10, 2017
    Configuration menu
    Copy the full SHA
    92a71a6 View commit details
    Browse the repository at this point in the history

Commits on May 12, 2017

  1. [SPARK-20665][SQL] Bround" and "Round" function return NULL

       spark-sql>select bround(12.3, 2);
       spark-sql>NULL
    For this case,  the expected result is 12.3, but it is null.
    So ,when the second parameter is bigger than "decimal.scala", the result is not we expected.
    "round" function  has the same problem. This PR can solve the problem for both of them.
    
    unit test cases in MathExpressionsSuite and MathFunctionsSuite
    
    Author: liuxian <liu.xian3@zte.com.cn>
    
    Closes #17906 from 10110346/wip_lx_0509.
    
    (cherry picked from commit 2b36eb6)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    10110346 authored and cloud-fan committed May 12, 2017
    Configuration menu
    Copy the full SHA
    6e89d57 View commit details
    Browse the repository at this point in the history
  2. [SPARK-17424] Fix unsound substitution bug in ScalaReflection.

    ## What changes were proposed in this pull request?
    
    This method gets a type's primary constructor and fills in type parameters with concrete types. For example, `MapPartitions[T, U] -> MapPartitions[Int, String]`. This Substitution fails when the actual type args are empty because they are still unknown. Instead, when there are no resolved types to subsitute, this returns the original args with unresolved type parameters.
    ## How was this patch tested?
    
    This doesn't affect substitutions where the type args are determined. This fixes our case where the actual type args are empty and our job runs successfully.
    
    Author: Ryan Blue <blue@apache.org>
    
    Closes #15062 from rdblue/SPARK-17424-fix-unsound-reflect-substitution.
    
    (cherry picked from commit b236933)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    rdblue authored and cloud-fan committed May 12, 2017
    Configuration menu
    Copy the full SHA
    95de467 View commit details
    Browse the repository at this point in the history

Commits on May 15, 2017

  1. [SPARK-20705][WEB-UI] The sort function can not be used in the master…

    … page when you use Firefox or Google Chrome.
    
    ## What changes were proposed in this pull request?
    When you open the master page, when you use Firefox or Google Chrom, the console of Firefox or Google Chrome is wrong. But The IE  is no problem.
    e.g.
    ![error](https://cloud.githubusercontent.com/assets/26266482/25946143/74467a5c-367c-11e7-8f9f-d3585b1aea88.png)
    
    My Firefox version is 48.0.2.
    My Google Chrome version  is 49.0.2623.75 m.
    
    ## How was this patch tested?
    
    manual tests
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: guoxiaolong <guo.xiaolong1@zte.com.cn>
    Author: 郭小龙 10207633 <guo.xiaolong1@zte.com.cn>
    Author: guoxiaolongzte <guo.xiaolong1@zte.com.cn>
    
    Closes #17952 from guoxiaolongzte/SPARK-20705.
    
    (cherry picked from commit 99d5799)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    guoxiaolong authored and srowen committed May 15, 2017
    Configuration menu
    Copy the full SHA
    62969e9 View commit details
    Browse the repository at this point in the history
  2. [SPARK-20735][SQL][TEST] Enable cross join in TPCDSQueryBenchmark

    ## What changes were proposed in this pull request?
    
    Since [SPARK-17298](https://issues.apache.org/jira/browse/SPARK-17298), some queries (q28, q61, q77, q88, q90) in the test suites fail with a message "_Use the CROSS JOIN syntax to allow cartesian products between these relations_".
    
    This benchmark is used as a reference model for Spark TPC-DS, so this PR aims to enable the correct configuration in `TPCDSQueryBenchmark.scala`.
    
    ## How was this patch tested?
    
    Manual. (Run TPCDSQueryBenchmark)
    
    Author: Dongjoon Hyun <dongjoon@apache.org>
    
    Closes #17977 from dongjoon-hyun/SPARK-20735.
    
    (cherry picked from commit bbd163d)
    Signed-off-by: Xiao Li <gatorsmile@gmail.com>
    dongjoon-hyun authored and gatorsmile committed May 15, 2017
    Configuration menu
    Copy the full SHA
    14b6a9d View commit details
    Browse the repository at this point in the history

Commits on May 17, 2017

  1. [SPARK-20769][DOC] Incorrect documentation for using Jupyter notebook

    ## What changes were proposed in this pull request?
    
    SPARK-13973 incorrectly removed the required PYSPARK_DRIVER_PYTHON_OPTS=notebook from documentation to use pyspark with Jupyter notebook. This patch corrects the documentation error.
    
    ## How was this patch tested?
    
    Tested invocation locally with
    ```bash
    PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS=notebook ./bin/pyspark
    ```
    
    Author: Andrew Ray <ray.andrew@gmail.com>
    
    Closes #18001 from aray/patch-1.
    
    (cherry picked from commit 1995417)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    aray authored and srowen committed May 17, 2017
    Configuration menu
    Copy the full SHA
    ba35c6b View commit details
    Browse the repository at this point in the history

Commits on May 18, 2017

  1. [SPARK-20796] the location of start-master.sh in spark-standalone.md …

    …is wrong
    
    [https://issues.apache.org/jira/browse/SPARK-20796](https://issues.apache.org/jira/browse/SPARK-20796)
    the location of start-master.sh in spark-standalone.md should be "sbin/start-master.sh" rather than "bin/start-master.sh".
    
    Author: liuzhaokun <liu.zhaokun@zte.com.cn>
    
    Closes #18027 from liu-zhaokun/sbin.
    
    (cherry picked from commit 99452df)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    liu-zhaokun authored and srowen committed May 18, 2017
    Configuration menu
    Copy the full SHA
    e06d936 View commit details
    Browse the repository at this point in the history

Commits on May 19, 2017

  1. [SPARK-20798] GenerateUnsafeProjection should check if a value is nul…

    …l before calling the getter
    
    ## What changes were proposed in this pull request?
    
    GenerateUnsafeProjection.writeStructToBuffer() did not honor the assumption that the caller must make sure that a value is not null before using the getter. This could lead to various errors. This change fixes that behavior.
    
    Example of code generated before:
    ```scala
    /* 059 */         final UTF8String fieldName = value.getUTF8String(0);
    /* 060 */         if (value.isNullAt(0)) {
    /* 061 */           rowWriter1.setNullAt(0);
    /* 062 */         } else {
    /* 063 */           rowWriter1.write(0, fieldName);
    /* 064 */         }
    ```
    
    Example of code generated now:
    ```scala
    /* 060 */         boolean isNull1 = value.isNullAt(0);
    /* 061 */         UTF8String value1 = isNull1 ? null : value.getUTF8String(0);
    /* 062 */         if (isNull1) {
    /* 063 */           rowWriter1.setNullAt(0);
    /* 064 */         } else {
    /* 065 */           rowWriter1.write(0, value1);
    /* 066 */         }
    ```
    
    ## How was this patch tested?
    
    Adds GenerateUnsafeProjectionSuite.
    
    Author: Ala Luszczak <ala@databricks.com>
    
    Closes #18030 from ala/fix-generate-unsafe-projection.
    
    (cherry picked from commit ce8edb8)
    Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>
    ala authored and hvanhovell committed May 19, 2017
    Configuration menu
    Copy the full SHA
    e326de4 View commit details
    Browse the repository at this point in the history
  2. [SPARK-20759] SCALA_VERSION in _config.yml should be consistent with …

    …pom.xml
    
    [https://issues.apache.org/jira/browse/SPARK-20759](https://issues.apache.org/jira/browse/SPARK-20759)
    SCALA_VERSION in _config.yml is 2.11.7, but 2.11.8 in pom.xml. So I think SCALA_VERSION in _config.yml should be consistent with pom.xml.
    
    Author: liuzhaokun <liu.zhaokun@zte.com.cn>
    
    Closes #17992 from liu-zhaokun/new.
    
    (cherry picked from commit dba2ca2)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    liu-zhaokun authored and srowen committed May 19, 2017
    Configuration menu
    Copy the full SHA
    c53fe79 View commit details
    Browse the repository at this point in the history
  3. [SPARK-20781] the location of Dockerfile in docker.properties.templat…

    … is wrong
    
    [https://issues.apache.org/jira/browse/SPARK-20781](https://issues.apache.org/jira/browse/SPARK-20781)
    the location of Dockerfile in docker.properties.template should be "../external/docker/spark-mesos/Dockerfile"
    
    Author: liuzhaokun <liu.zhaokun@zte.com.cn>
    
    Closes #18013 from liu-zhaokun/dockerfile_location.
    
    (cherry picked from commit 749418d)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    liu-zhaokun authored and srowen committed May 19, 2017
    Configuration menu
    Copy the full SHA
    e9804b3 View commit details
    Browse the repository at this point in the history

Commits on May 22, 2017

  1. [SPARK-20687][MLLIB] mllib.Matrices.fromBreeze may crash when convert…

    …ing from Breeze sparse matrix
    
    ## What changes were proposed in this pull request?
    
    When two Breeze SparseMatrices are operated, the result matrix may contain provisional 0 values extra in rowIndices and data arrays. This causes an incoherence with the colPtrs data, but Breeze get away with this incoherence by keeping a counter of the valid data.
    
    In spark, when this matrices are converted to SparseMatrices, Sparks relies solely on rowIndices, data, and colPtrs, but these might be incorrect because of breeze internal hacks. Therefore, we need to slice both rowIndices and data, using their counter of active data
    
    This method is at least called by BlockMatrix when performing distributed block operations, causing exceptions on valid operations.
    
    See http://stackoverflow.com/questions/33528555/error-thrown-when-using-blockmatrix-add
    
    ## How was this patch tested?
    
    Added a test to MatricesSuite that verifies that the conversions are valid and that code doesn't crash. Originally the same code would crash on Spark.
    
    Bugfix for https://issues.apache.org/jira/browse/SPARK-20687
    
    Author: Ignacio Bermudez <ignaciobermudez@gmail.com>
    Author: Ignacio Bermudez Corrales <icorrales@splunk.com>
    
    Closes #17940 from ghoto/bug-fix/SPARK-20687.
    
    (cherry picked from commit 06dda1d)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    ghoto authored and srowen committed May 22, 2017
    Configuration menu
    Copy the full SHA
    c3a986b View commit details
    Browse the repository at this point in the history
  2. [SPARK-20756][YARN] yarn-shuffle jar references unshaded guava

    and contains scala classes
    
    ## What changes were proposed in this pull request?
    This change ensures that all references to guava from within the yarn shuffle jar pointed to the shaded guava class already provided in the jar.
    
    Also, it explicitly excludes scala classes from being added to the jar.
    
    ## How was this patch tested?
    Ran unit tests on the module and they passed.
    javap now returns the expected result - reference to the shaded guava under `org/spark_project` (previously this was referring to `com.google...`
    ```
    javap -cp common/network-yarn/target/scala-2.11/spark-2.3.0-SNAPSHOT-yarn-shuffle.jar -c org/apache/spark/network/yarn/YarnShuffleService | grep Lists
          57: invokestatic  #138                // Method org/spark_project/guava/collect/Lists.newArrayList:()Ljava/util/ArrayList;
    ```
    
    Guava is still shaded in the jar:
    ```
    jar -tf common/network-yarn/target/scala-2.11/spark-2.3.0-SNAPSHOT-yarn-shuffle.jar | grep guava | head
    META-INF/maven/com.google.guava/
    META-INF/maven/com.google.guava/guava/
    META-INF/maven/com.google.guava/guava/pom.properties
    META-INF/maven/com.google.guava/guava/pom.xml
    org/spark_project/guava/
    org/spark_project/guava/annotations/
    org/spark_project/guava/annotations/Beta.class
    org/spark_project/guava/annotations/GwtCompatible.class
    org/spark_project/guava/annotations/GwtIncompatible.class
    org/spark_project/guava/annotations/VisibleForTesting.class
    ```
    (not sure if the above META-INF/* is a problem or not)
    
    I took this jar, deployed it on a yarn cluster with shuffle service enabled, and made sure the YARN node managers came up. An application with a shuffle was run and it succeeded.
    
    Author: Mark Grover <mark@apache.org>
    
    Closes #17990 from markgrover/spark-20756.
    
    (cherry picked from commit 3630911)
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
    markgrover authored and Marcelo Vanzin committed May 22, 2017
    Configuration menu
    Copy the full SHA
    f5ef076 View commit details
    Browse the repository at this point in the history

Commits on May 23, 2017

  1. [SPARK-20763][SQL][BACKPORT-2.1] The function of month and day re…

    …turn the value which is not we expected.
    
    What changes were proposed in this pull request?
    
    This PR is to backport #17997 to Spark 2.1
    
    when the date before "1582-10-04", the function of month and day return the value which is not we expected.
    How was this patch tested?
    
    unit tests
    
    Author: liuxian <liu.xian3@zte.com.cn>
    
    Closes #18054 from 10110346/wip-lx-0522.
    10110346 authored and ueshin committed May 23, 2017
    Configuration menu
    Copy the full SHA
    f4538c9 View commit details
    Browse the repository at this point in the history

Commits on May 24, 2017

  1. [SPARK-20862][MLLIB][PYTHON] Avoid passing float to ndarray.reshape i…

    …n LogisticRegressionModel
    
    ## What changes were proposed in this pull request?
    
    Fixed TypeError with python3 and numpy 1.12.1. Numpy's `reshape` no longer takes floats as arguments as of 1.12. Also, python3 uses float division for `/`, we should be using `//` to ensure that `_dataWithBiasSize` doesn't get set to a float.
    
    ## How was this patch tested?
    
    Existing tests run using python3 and numpy 1.12.
    
    Author: Bago Amirbekian <bago@databricks.com>
    
    Closes #18081 from MrBago/BF-py3floatbug.
    
    (cherry picked from commit bc66a77)
    Signed-off-by: Yanbo Liang <ybliang8@gmail.com>
    MrBago authored and yanboliang committed May 24, 2017
    Configuration menu
    Copy the full SHA
    13adc0f View commit details
    Browse the repository at this point in the history
  2. [SPARK-20848][SQL] Shutdown the pool after reading parquet files

    ## What changes were proposed in this pull request?
    
    From JIRA: On each call to spark.read.parquet, a new ForkJoinPool is created. One of the threads in the pool is kept in the WAITING state, and never stopped, which leads to unbounded growth in number of threads.
    
    We should shutdown the pool after reading parquet files.
    
    ## How was this patch tested?
    
    Added a test to ParquetFileFormatSuite.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Liang-Chi Hsieh <viirya@gmail.com>
    
    Closes #18073 from viirya/SPARK-20848.
    
    (cherry picked from commit f72ad30)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    viirya authored and cloud-fan committed May 24, 2017
    Configuration menu
    Copy the full SHA
    2f68631 View commit details
    Browse the repository at this point in the history

Commits on May 25, 2017

  1. [SPARK-18406][CORE][BACKPORT-2.1] Race between end-of-task and comple…

    …tion iterator read lock release
    
    This is a backport PR of  #18076 to 2.1.
    
    ## What changes were proposed in this pull request?
    
    When a TaskContext is not propagated properly to all child threads for the task, just like the reported cases in this issue, we fail to get to TID from TaskContext and that causes unable to release the lock and assertion failures. To resolve this, we have to explicitly pass the TID value to the `unlock` method.
    
    ## How was this patch tested?
    
    Add new failing regression test case in `RDDSuite`.
    
    Author: Xingbo Jiang <xingbo.jiang@databricks.com>
    
    Closes #18099 from jiangxb1987/completion-iterator-2.1.
    jiangxb1987 authored and cloud-fan committed May 25, 2017
    Configuration menu
    Copy the full SHA
    c3302e8 View commit details
    Browse the repository at this point in the history
  2. [SPARK-20848][SQL][FOLLOW-UP] Shutdown the pool after reading parquet…

    … files
    
    ## What changes were proposed in this pull request?
    
    This is a follow-up to #18073. Taking a safer approach to shutdown the pool to prevent possible issue. Also using `ThreadUtils.newForkJoinPool` instead to set a better thread name.
    
    ## How was this patch tested?
    
    Manually test.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Liang-Chi Hsieh <viirya@gmail.com>
    
    Closes #18100 from viirya/SPARK-20848-followup.
    
    (cherry picked from commit 6b68d61)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    viirya authored and cloud-fan committed May 25, 2017
    Configuration menu
    Copy the full SHA
    7015f6f View commit details
    Browse the repository at this point in the history
  3. [SPARK-20250][CORE] Improper OOM error when a task been killed while …

    …spilling data
    
    Currently, when a task is calling spill() but it receives a killing request from driver (e.g., speculative task), the `TaskMemoryManager` will throw an `OOM` exception.  And we don't catch `Fatal` exception when a error caused by `Thread.interrupt`. So for `ClosedByInterruptException`, we should throw `RuntimeException` instead of `OutOfMemoryError`.
    
    https://issues.apache.org/jira/browse/SPARK-20250?jql=project%20%3D%20SPARK
    
    Existing unit tests.
    
    Author: Xianyang Liu <xianyang.liu@intel.com>
    
    Closes #18090 from ConeyLiu/SPARK-20250.
    
    (cherry picked from commit 731462a)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    ConeyLiu authored and cloud-fan committed May 25, 2017
    Configuration menu
    Copy the full SHA
    7fc2347 View commit details
    Browse the repository at this point in the history
  4. [SPARK-20874][EXAMPLES] Add Structured Streaming Kafka Source to exam…

    …ples project
    
    ## What changes were proposed in this pull request?
    
    Add Structured Streaming Kafka Source to the `examples` project so that people can run `bin/run-example StructuredKafkaWordCount ...`.
    
    ## How was this patch tested?
    
    manually tested it.
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes #18101 from zsxwing/add-missing-example-dep.
    
    (cherry picked from commit 98c3852)
    Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
    zsxwing committed May 25, 2017
    Configuration menu
    Copy the full SHA
    4f6fccf View commit details
    Browse the repository at this point in the history

Commits on May 26, 2017

  1. [SPARK-20868][CORE] UnsafeShuffleWriter should verify the position af…

    …ter FileChannel.transferTo
    
    ## What changes were proposed in this pull request?
    
    Long time ago we fixed a [bug](https://issues.apache.org/jira/browse/SPARK-3948) in shuffle writer about `FileChannel.transferTo`. We were not very confident about that fix, so we added a position check after the writing, try to discover the bug earlier.
    
     However this checking is missing in the new `UnsafeShuffleWriter`, this PR adds it.
    
    https://issues.apache.org/jira/browse/SPARK-18105 maybe related to that `FileChannel.transferTo` bug, hopefully we can find out the root cause after adding this position check.
    
    ## How was this patch tested?
    
    N/A
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes #18091 from cloud-fan/shuffle.
    
    (cherry picked from commit d9ad789)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    cloud-fan committed May 26, 2017
    Configuration menu
    Copy the full SHA
    6e6adcc View commit details
    Browse the repository at this point in the history

Commits on May 27, 2017

  1. [SPARK-20843][CORE] Add a config to set driver terminate timeout

    ## What changes were proposed in this pull request?
    
    Add a `worker` configuration to set how long to wait before forcibly killing driver.
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes #18126 from zsxwing/SPARK-20843.
    
    (cherry picked from commit 6c1dbd6)
    Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
    zsxwing committed May 27, 2017
    Configuration menu
    Copy the full SHA
    ebd72f4 View commit details
    Browse the repository at this point in the history
  2. [SPARK-20393][WEBU UI] Strengthen Spark to prevent XSS vulnerabilities

    Add stripXSS and stripXSSMap to Spark Core's UIUtils. Calling these functions at any point that getParameter is called against a HttpServletRequest.
    
    Unit tests, IBM Security AppScan Standard no longer showing vulnerabilities, manual verification of WebUI pages.
    
    Author: NICHOLAS T. MARION <nmarion@us.ibm.com>
    
    Closes #17686 from n-marion/xss-fix.
    
    (cherry picked from commit b512233)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    n-marion authored and srowen committed May 27, 2017
    Configuration menu
    Copy the full SHA
    38f37c5 View commit details
    Browse the repository at this point in the history

Commits on May 31, 2017

  1. [SPARK-20275][UI] Do not display "Completed" column for in-progress a…

    …pplications
    
    ## What changes were proposed in this pull request?
    
    Current HistoryServer will display completed date of in-progress application as `1969-12-31 23:59:59`, which is not so meaningful. Instead of unnecessarily showing this incorrect completed date, here propose to make this column invisible for in-progress applications.
    
    The purpose of only making this column invisible rather than deleting this field is that: this data is fetched through REST API, and in the REST API  the format is like below shows, in which `endTime` matches `endTimeEpoch`. So instead of changing REST API to break backward compatibility, here choosing a simple solution to only make this column invisible.
    
    ```
    [ {
      "id" : "local-1491805439678",
      "name" : "Spark shell",
      "attempts" : [ {
        "startTime" : "2017-04-10T06:23:57.574GMT",
        "endTime" : "1969-12-31T23:59:59.999GMT",
        "lastUpdated" : "2017-04-10T06:23:57.574GMT",
        "duration" : 0,
        "sparkUser" : "",
        "completed" : false,
        "startTimeEpoch" : 1491805437574,
        "endTimeEpoch" : -1,
        "lastUpdatedEpoch" : 1491805437574
      } ]
    } ]%
    ```
    
    Here is UI before changed:
    
    <img width="1317" alt="screen shot 2017-04-10 at 3 45 57 pm" src="https://cloud.githubusercontent.com/assets/850797/24851938/17d46cc0-1e08-11e7-84c7-90120e171b41.png">
    
    And after:
    
    <img width="1281" alt="screen shot 2017-04-10 at 4 02 35 pm" src="https://cloud.githubusercontent.com/assets/850797/24851945/1fe9da58-1e08-11e7-8d0d-9262324f9074.png">
    
    ## How was this patch tested?
    
    Manual verification.
    
    Author: jerryshao <sshao@hortonworks.com>
    
    Closes #17588 from jerryshao/SPARK-20275.
    
    (cherry picked from commit 52ed9b2)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    jerryshao authored and cloud-fan committed May 31, 2017
    Configuration menu
    Copy the full SHA
    4640086 View commit details
    Browse the repository at this point in the history

Commits on Jun 1, 2017

  1. [SPARK-20940][CORE] Replace IllegalAccessError with IllegalStateExcep…

    …tion
    
    ## What changes were proposed in this pull request?
    
    `IllegalAccessError` is a fatal error (a subclass of LinkageError) and its meaning is `Thrown if an application attempts to access or modify a field, or to call a method that it does not have access to`. Throwing a fatal error for AccumulatorV2 is not necessary and is pretty bad because it usually will just kill executors or SparkContext ([SPARK-20666](https://issues.apache.org/jira/browse/SPARK-20666) is an example of killing SparkContext due to `IllegalAccessError`). I think the correct type of exception in AccumulatorV2 should be `IllegalStateException`.
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes #18168 from zsxwing/SPARK-20940.
    
    (cherry picked from commit 24db358)
    Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
    zsxwing committed Jun 1, 2017
    Configuration menu
    Copy the full SHA
    dade85f View commit details
    Browse the repository at this point in the history
  2. [SPARK-20922][CORE] Add whitelist of classes that can be deserialized…

    … by the launcher.
    
    Blindly deserializing classes using Java serialization opens the code up to
    issues in other libraries, since just deserializing data from a stream may
    end up execution code (think readObject()).
    
    Since the launcher protocol is pretty self-contained, there's just a handful
    of classes it legitimately needs to deserialize, and they're in just two
    packages, so add a filter that throws errors if classes from any other
    package show up in the stream.
    
    This also maintains backwards compatibility (the updated launcher code can
    still communicate with the backend code in older Spark releases).
    
    Tested with new and existing unit tests.
    
    Author: Marcelo Vanzin <vanzin@cloudera.com>
    
    Closes #18166 from vanzin/SPARK-20922.
    
    (cherry picked from commit 8efc6e9)
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
    Marcelo Vanzin committed Jun 1, 2017
    Configuration menu
    Copy the full SHA
    772a9b9 View commit details
    Browse the repository at this point in the history
  3. [SPARK-20922][CORE][HOTFIX] Don't use Java 8 lambdas in older branches.

    Author: Marcelo Vanzin <vanzin@cloudera.com>
    
    Closes #18178 from vanzin/SPARK-20922-hotfix.
    Marcelo Vanzin committed Jun 1, 2017
    Configuration menu
    Copy the full SHA
    0b25a7d View commit details
    Browse the repository at this point in the history

Commits on Jun 3, 2017

  1. [SPARK-20974][BUILD] we should run REPL tests if SQL module has code …

    …changes
    
    ## What changes were proposed in this pull request?
    
    REPL module depends on SQL module, so we should run REPL tests if SQL module has code changes.
    
    ## How was this patch tested?
    
    N/A
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes #18191 from cloud-fan/test.
    
    (cherry picked from commit 864d94f)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    cloud-fan committed Jun 3, 2017
    Configuration menu
    Copy the full SHA
    afab855 View commit details
    Browse the repository at this point in the history

Commits on Jun 8, 2017

  1. [SPARK-20914][DOCS] Javadoc contains code that is invalid

    ## What changes were proposed in this pull request?
    
    Fix Java, Scala Dataset examples in scaladoc, which didn't compile.
    
    ## How was this patch tested?
    
    Existing compilation/test
    
    Author: Sean Owen <sowen@cloudera.com>
    
    Closes #18215 from srowen/SPARK-20914.
    
    (cherry picked from commit 847efe1)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    srowen committed Jun 8, 2017
    Configuration menu
    Copy the full SHA
    03cc18b View commit details
    Browse the repository at this point in the history

Commits on Jun 13, 2017

  1. [SPARK-20920][SQL] ForkJoinPool pools are leaked when writing hive ta…

    …bles with many partitions
    
    ## What changes were proposed in this pull request?
    
    Don't leave thread pool running from AlterTableRecoverPartitionsCommand DDL command
    
    ## How was this patch tested?
    
    Existing tests.
    
    Author: Sean Owen <sowen@cloudera.com>
    
    Closes #18216 from srowen/SPARK-20920.
    
    (cherry picked from commit 7b7c85e)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    srowen committed Jun 13, 2017
    Configuration menu
    Copy the full SHA
    58a8a37 View commit details
    Browse the repository at this point in the history
  2. [SPARK-21064][CORE][TEST] Fix the default value bug in NettyBlockTran…

    …sferServiceSuite
    
    ## What changes were proposed in this pull request?
    
    The default value for `spark.port.maxRetries` is 100,
    but we use 10 in the suite file.
    So we change it to 100 to avoid test failure.
    
    ## How was this patch tested?
    No test
    
    Author: DjvuLee <lihu@bytedance.com>
    
    Closes #18280 from djvulee/NettyTestBug.
    
    (cherry picked from commit b36ce2a)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    DjvuLee authored and srowen committed Jun 13, 2017
    Configuration menu
    Copy the full SHA
    ee0e74e View commit details
    Browse the repository at this point in the history

Commits on Jun 14, 2017

  1. [SPARK-20211][SQL][BACKPORT-2.2] Fix the Precision and Scale of Decim…

    …al Values when the Input is BigDecimal between -1.0 and 1.0
    
    ### What changes were proposed in this pull request?
    
    This PR is to backport #18244 to 2.2
    
    ---
    
    The precision and scale of decimal values are wrong when the input is BigDecimal between -1.0 and 1.0.
    
    The BigDecimal's precision is the digit count starts from the leftmost nonzero digit based on the [JAVA's BigDecimal definition](https://docs.oracle.com/javase/7/docs/api/java/math/BigDecimal.html). However, our Decimal decision follows the database decimal standard, which is the total number of digits, including both to the left and the right of the decimal point. Thus, this PR is to fix the issue by doing the conversion.
    
    Before this PR, the following queries failed:
    ```SQL
    select 1 > 0.0001
    select floor(0.0001)
    select ceil(0.0001)
    ```
    
    ### How was this patch tested?
    Added test cases.
    
    Author: gatorsmile <gatorsmile@gmail.com>
    
    Closes #18297 from gatorsmile/backport18244.
    
    (cherry picked from commit 6265119)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    gatorsmile authored and cloud-fan committed Jun 14, 2017
    Configuration menu
    Copy the full SHA
    a890466 View commit details
    Browse the repository at this point in the history

Commits on Jun 15, 2017

  1. [SPARK-16251][SPARK-20200][CORE][TEST] Flaky test: org.apache.spark.r…

    …dd.LocalCheckpointSuite.missing checkpoint block fails with informative message
    
    ## What changes were proposed in this pull request?
    
    Currently we don't wait to confirm the removal of the block from the slave's BlockManager, if the removal takes too much time, we will fail the assertion in this test case.
    The failure can be easily reproduced if we sleep for a while before we remove the block in BlockManagerSlaveEndpoint.receiveAndReply().
    
    ## How was this patch tested?
    N/A
    
    Author: Xingbo Jiang <xingbo.jiang@databricks.com>
    
    Closes #18314 from jiangxb1987/LocalCheckpointSuite.
    
    (cherry picked from commit 7dc3e69)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    jiangxb1987 authored and cloud-fan committed Jun 15, 2017
    Configuration menu
    Copy the full SHA
    62f2b80 View commit details
    Browse the repository at this point in the history

Commits on Jun 16, 2017

  1. [SPARK-21072][SQL] TreeNode.mapChildren should only apply to the chil…

    …dren node.
    
    ## What changes were proposed in this pull request?
    
    Just as the function name and comments of `TreeNode.mapChildren` mentioned, the function should be apply to all currently node children. So, the follow code should judge whether it is the children node.
    
    https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L342
    
    ## How was this patch tested?
    
    Existing tests.
    
    Author: Xianyang Liu <xianyang.liu@intel.com>
    
    Closes #18284 from ConeyLiu/treenode.
    
    (cherry picked from commit 87ab0ce)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    ConeyLiu authored and cloud-fan committed Jun 16, 2017
    Configuration menu
    Copy the full SHA
    915a201 View commit details
    Browse the repository at this point in the history
  2. [SPARK-21114][TEST][2.1] Fix test failure in Spark 2.1/2.0 due to nam…

    …e mismatch
    
    ## What changes were proposed in this pull request?
    Name mismatch between 2.1/2.0 and 2.2. Thus, the test cases failed after we backport a fix to 2.1/2.0. This PR is to fix the issue.
    
    https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.1-test-maven-hadoop-2.7/lastCompletedBuild/testReport/org.apache.spark.sql/SQLQueryTestSuite/arithmetic_sql/
    
    https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.0-test-maven-hadoop-2.2/lastCompletedBuild/testReport/org.apache.spark.sql/SQLQueryTestSuite/arithmetic_sql/
    
    ## How was this patch tested?
    N/A
    
    Author: gatorsmile <gatorsmile@gmail.com>
    
    Closes #18319 from gatorsmile/fixDecimal.
    gatorsmile authored and cloud-fan committed Jun 16, 2017
    Configuration menu
    Copy the full SHA
    0ebb3b8 View commit details
    Browse the repository at this point in the history

Commits on Jun 19, 2017

  1. [SPARK-19688][STREAMING] Not to read spark.yarn.credentials.file fr…

    …om checkpoint.
    
    ## What changes were proposed in this pull request?
    
    Reload the `spark.yarn.credentials.file` property when restarting a streaming application from checkpoint.
    
    ## How was this patch tested?
    
    Manual tested with 1.6.3 and 2.1.1.
    I didn't test this with master because of some compile problems, but I think it will be the same result.
    
    ## Notice
    
    This should be merged into maintenance branches too.
    
    jira: [SPARK-21008](https://issues.apache.org/jira/browse/SPARK-21008)
    
    Author: saturday_s <shi.indetail@gmail.com>
    
    Closes #18230 from saturday-shi/SPARK-21008.
    
    (cherry picked from commit e92ffe6)
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
    saturday_s authored and Marcelo Vanzin committed Jun 19, 2017
    Configuration menu
    Copy the full SHA
    a44c118 View commit details
    Browse the repository at this point in the history
  2. [SPARK-21138][YARN] Cannot delete staging dir when the clusters of "s…

    …park.yarn.stagingDir" and "spark.hadoop.fs.defaultFS" are different
    
    ## What changes were proposed in this pull request?
    
    When I set different clusters for "spark.hadoop.fs.defaultFS" and "spark.yarn.stagingDir" as follows:
    ```
    spark.hadoop.fs.defaultFS  hdfs://tl-nn-tdw.tencent-distribute.com:54310
    spark.yarn.stagingDir hdfs://ss-teg-2-v2/tmp/spark
    ```
    The staging dir can not be deleted, it will prompt following message:
    ```
    java.lang.IllegalArgumentException: Wrong FS: hdfs://ss-teg-2-v2/tmp/spark/.sparkStaging/application_1496819138021_77618, expected: hdfs://tl-nn-tdw.tencent-distribute.com:54310
    ```
    
    ## How was this patch tested?
    
    Existing tests
    
    Author: sharkdtu <sharkdtu@tencent.com>
    
    Closes #18352 from sharkdtu/master.
    
    (cherry picked from commit 3d4d11a)
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
    sharkdtu authored and Marcelo Vanzin committed Jun 19, 2017
    Configuration menu
    Copy the full SHA
    7799f35 View commit details
    Browse the repository at this point in the history

Commits on Jun 20, 2017

  1. [SPARK-21123][DOCS][STRUCTURED STREAMING] Options for file stream sou…

    …rce are in a wrong table - version to fix 2.1
    
    ## What changes were proposed in this pull request?
    
    The description for several options of File Source for structured streaming appeared in the File Sink description instead.
    
    This commit continues on PR #18342 and targets the fixes for the documentation of version spark version 2.1
    
    ## How was this patch tested?
    
    Built the documentation by SKIP_API=1 jekyll build and visually inspected the structured streaming programming guide.
    
    zsxwing This is the PR to fix version 2.1 as discussed in PR #18342
    
    Author: assafmendelson <assaf.mendelson@gmail.com>
    
    Closes #18363 from assafmendelson/spark-21123-for-spark2.1.
    assafmendelson authored and zsxwing committed Jun 20, 2017
    Configuration menu
    Copy the full SHA
    8923bac View commit details
    Browse the repository at this point in the history

Commits on Jun 22, 2017

  1. [SPARK-18016][SQL][CATALYST][BRANCH-2.1] Code Generation: Constant Po…

    …ol Limit - Class Splitting
    
    ## What changes were proposed in this pull request?
    
    This is a backport patch for Spark 2.1.x of the class splitting feature over excess generated code as was merged in #18075.
    
    ## How was this patch tested?
    
    The same test provided in #18075 is included in this patch.
    
    Author: ALeksander Eskilson <alek.eskilson@cerner.com>
    
    Closes #18354 from bdrillard/class_splitting_2.1.
    ALeksander Eskilson authored and cloud-fan committed Jun 22, 2017
    Configuration menu
    Copy the full SHA
    6b37c86 View commit details
    Browse the repository at this point in the history
  2. [SPARK-21167][SS] Decode the path generated by File sink to handle sp…

    …ecial characters
    
    ## What changes were proposed in this pull request?
    
    Decode the path generated by File sink to handle special characters.
    
    ## How was this patch tested?
    
    The added unit test.
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes #18381 from zsxwing/SPARK-21167.
    
    (cherry picked from commit d66b143)
    Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
    zsxwing committed Jun 22, 2017
    Configuration menu
    Copy the full SHA
    1a98d5d View commit details
    Browse the repository at this point in the history

Commits on Jun 23, 2017

  1. [SPARK-21181] Release byteBuffers to suppress netty error messages

    ## What changes were proposed in this pull request?
    We are explicitly calling release on the byteBuf's used to encode the string to Base64 to suppress the memory leak error message reported by netty. This is to make it less confusing for the user.
    
    ### Changes proposed in this fix
    By explicitly invoking release on the byteBuf's we are decrement the internal reference counts for the wrappedByteBuf's. Now, when the GC kicks in, these would be reclaimed as before, just that netty wouldn't report any memory leak error messages as the internal ref. counts are now 0.
    
    ## How was this patch tested?
    Ran a few spark-applications and examined the logs. The error message no longer appears.
    
    Original PR was opened against branch-2.1 => #18392
    
    Author: Dhruve Ashar <dhruveashar@gmail.com>
    
    Closes #18407 from dhruve/master.
    
    (cherry picked from commit 1ebe7ff)
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
    dhruve authored and Marcelo Vanzin committed Jun 23, 2017
    Configuration menu
    Copy the full SHA
    f8fd3b4 View commit details
    Browse the repository at this point in the history
  2. [MINOR][DOCS] Docs in DataFrameNaFunctions.scala use wrong method

    ## What changes were proposed in this pull request?
    
    * Following the first few examples in this file, the remaining methods should also be methods of `df.na` not `df`.
    * Filled in some missing parentheses
    
    ## How was this patch tested?
    
    N/A
    
    Author: Ong Ming Yang <me@ongmingyang.com>
    
    Closes #18398 from ongmingyang/master.
    
    (cherry picked from commit 4cc6295)
    Signed-off-by: Xiao Li <gatorsmile@gmail.com>
    ongmingyang authored and gatorsmile committed Jun 23, 2017
    Configuration menu
    Copy the full SHA
    bcaf06c View commit details
    Browse the repository at this point in the history

Commits on Jun 24, 2017

  1. [SPARK-20555][SQL] Fix mapping of Oracle DECIMAL types to Spark types…

    … in read path
    
    This PR is to revert some code changes in the read path of #14377. The original fix is #17830
    
    When merging this PR, please give the credit to gaborfeher
    
    Added a test case to OracleIntegrationSuite.scala
    
    Author: Gabor Feher <gabor.feher@lynxanalytics.com>
    Author: gatorsmile <gatorsmile@gmail.com>
    
    Closes #18408 from gatorsmile/OracleType.
    Gabor Feher authored and gatorsmile committed Jun 24, 2017
    Configuration menu
    Copy the full SHA
    f12883e View commit details
    Browse the repository at this point in the history
  2. [SPARK-21159][CORE] Don't try to connect to launcher in standalone cl…

    …uster mode.
    
    Monitoring for standalone cluster mode is not implemented (see SPARK-11033), but
    the same scheduler implementation is used, and if it tries to connect to the
    launcher it will fail. So fix the scheduler so it only tries that in client mode;
    cluster mode applications will be correctly launched and will work, but monitoring
    through the launcher handle will not be available.
    
    Tested by running a cluster mode app with "SparkLauncher.startApplication".
    
    Author: Marcelo Vanzin <vanzin@cloudera.com>
    
    Closes #18397 from vanzin/SPARK-21159.
    
    (cherry picked from commit bfd73a7)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    Marcelo Vanzin authored and cloud-fan committed Jun 24, 2017
    Configuration menu
    Copy the full SHA
    6750db3 View commit details
    Browse the repository at this point in the history
  3. [SPARK-21203][SQL] Fix wrong results of insertion of Array of Struct

    ### What changes were proposed in this pull request?
    ```SQL
    CREATE TABLE `tab1`
    (`custom_fields` ARRAY<STRUCT<`id`: BIGINT, `value`: STRING>>)
    USING parquet
    
    INSERT INTO `tab1`
    SELECT ARRAY(named_struct('id', 1, 'value', 'a'), named_struct('id', 2, 'value', 'b'))
    
    SELECT custom_fields.id, custom_fields.value FROM tab1
    ```
    
    The above query always return the last struct of the array, because the rule `SimplifyCasts` incorrectly rewrites the query. The underlying cause is we always use the same `GenericInternalRow` object when doing the cast.
    
    ### How was this patch tested?
    
    Author: gatorsmile <gatorsmile@gmail.com>
    
    Closes #18412 from gatorsmile/castStruct.
    
    (cherry picked from commit 2e1586f)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    gatorsmile authored and cloud-fan committed Jun 24, 2017
    Configuration menu
    Copy the full SHA
    0d6b701 View commit details
    Browse the repository at this point in the history

Commits on Jun 25, 2017

  1. Revert "[SPARK-18016][SQL][CATALYST][BRANCH-2.1] Code Generation: Con…

    …stant Pool Limit - Class Splitting"
    
    This reverts commit 6b37c86.
    cloud-fan committed Jun 25, 2017
    Configuration menu
    Copy the full SHA
    26f4f34 View commit details
    Browse the repository at this point in the history

Commits on Jun 30, 2017

  1. [SPARK-21176][WEB UI] Limit number of selector threads for admin ui p…

    …roxy servlets to 8
    
    ## What changes were proposed in this pull request?
    Please see also https://issues.apache.org/jira/browse/SPARK-21176
    
    This change limits the number of selector threads that jetty creates to maximum 8 per proxy servlet (Jetty default is number of processors / 2).
    The newHttpClient for Jettys ProxyServlet class is overwritten to avoid the Jetty defaults (which are designed for high-performance http servers).
    Once jetty/jetty.project#1643 is available, the code could be cleaned up to avoid the method override.
    
    I really need this on v2.1.1 - what is the best way for a backport automatic merge works fine)? Shall I create another PR?
    
    ## How was this patch tested?
    (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
    The patch was tested manually on a Spark cluster with a head node that has 88 processors using JMX to verify that the number of selector threads is now limited to 8 per proxy.
    
    gurvindersingh zsxwing can you please review the change?
    
    Author: IngoSchuster <ingo.schuster@de.ibm.com>
    Author: Ingo Schuster <ingo.schuster@de.ibm.com>
    
    Closes #18437 from IngoSchuster/master.
    
    (cherry picked from commit 88a536b)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    IngoSchuster authored and cloud-fan committed Jun 30, 2017
    Configuration menu
    Copy the full SHA
    083adb0 View commit details
    Browse the repository at this point in the history
  2. [SPARK-21258][SQL] Fix WindowExec complex object aggregation with spi…

    …lling
    
    ## What changes were proposed in this pull request?
    `WindowExec` currently improperly stores complex objects (UnsafeRow, UnsafeArrayData, UnsafeMapData, UTF8String) during aggregation by keeping a reference in the buffer used by `GeneratedMutableProjections` to the actual input data. Things go wrong when the input object (or the backing bytes) are reused for other things. This could happen in window functions when it starts spilling to disk. When reading the back the spill files the `UnsafeSorterSpillReader` reuses the buffer to which the `UnsafeRow` points, leading to weird corruption scenario's. Note that this only happens for aggregate functions that preserve (parts of) their input, for example `FIRST`, `LAST`, `MIN` & `MAX`.
    
    This was not seen before, because the spilling logic was not doing actual spills as much and actually used an in-memory page. This page was not cleaned up during window processing and made sure unsafe objects point to their own dedicated memory location. This was changed by #16909, after this PR Spark spills more eagerly.
    
    This PR provides a surgical fix because we are close to releasing Spark 2.2. This change just makes sure that there cannot be any object reuse at the expensive of a little bit of performance. We will follow-up with a more subtle solution at a later point.
    
    ## How was this patch tested?
    Added a regression test to `DataFrameWindowFunctionsSuite`.
    
    Author: Herman van Hovell <hvanhovell@databricks.com>
    
    Closes #18470 from hvanhovell/SPARK-21258.
    
    (cherry picked from commit e2f32ee)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    hvanhovell authored and cloud-fan committed Jun 30, 2017
    Configuration menu
    Copy the full SHA
    d995dac View commit details
    Browse the repository at this point in the history
  3. Revert "[SPARK-21258][SQL] Fix WindowExec complex object aggregation …

    …with spilling"
    
    This reverts commit d995dac.
    cloud-fan committed Jun 30, 2017
    Configuration menu
    Copy the full SHA
    3ecef24 View commit details
    Browse the repository at this point in the history

Commits on Jul 5, 2017

  1. [SPARK-20256][SQL][BRANCH-2.1] SessionState should be created more la…

    …zily
    
    ## What changes were proposed in this pull request?
    
    `SessionState` is designed to be created lazily. However, in reality, it created immediately in `SparkSession.Builder.getOrCreate` ([here](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L943)).
    
    This PR aims to recover the lazy behavior by keeping the options into `initialSessionOptions`. The benefit is like the following. Users can start `spark-shell` and use RDD operations without any problems.
    
    **BEFORE**
    ```scala
    $ bin/spark-shell
    java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder'
    ...
    Caused by: org.apache.spark.sql.AnalysisException:
        org.apache.hadoop.hive.ql.metadata.HiveException:
           MetaException(message:java.security.AccessControlException:
              Permission denied: user=spark, access=READ,
                 inode="/apps/hive/warehouse":hive:hdfs:drwx------
    ```
    As reported in SPARK-20256, this happens when the warehouse directory is not allowed for this user.
    
    **AFTER**
    ```scala
    $ bin/spark-shell
    ...
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /___/ .__/\_,_/_/ /_/\_\   version 2.1.2-SNAPSHOT
          /_/
    
    Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131)
    Type in expressions to have them evaluated.
    Type :help for more information.
    
    scala> sc.range(0, 10, 1).count()
    res0: Long = 10
    ```
    
    ## How was this patch tested?
    
    Manual.
    
    Author: Dongjoon Hyun <dongjoon@apache.org>
    
    Closes #18530 from dongjoon-hyun/SPARK-20256-BRANCH-2.1.
    dongjoon-hyun authored and cloud-fan committed Jul 5, 2017
    Configuration menu
    Copy the full SHA
    8f1ca69 View commit details
    Browse the repository at this point in the history

Commits on Jul 6, 2017

  1. [SPARK-21312][SQL] correct offsetInBytes in UnsafeRow.writeToStream

    ## What changes were proposed in this pull request?
    
    Corrects offsetInBytes calculation in UnsafeRow.writeToStream. Known failures include writes to some DataSources that have own SparkPlan implementations and cause EXCHANGE in writes.
    
    ## How was this patch tested?
    
    Extended UnsafeRowSuite.writeToStream to include an UnsafeRow over byte array having non-zero offset.
    
    Author: Sumedh Wale <swale@snappydata.io>
    
    Closes #18535 from sumwale/SPARK-21312.
    
    (cherry picked from commit 14a3bb3)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    Sumedh Wale authored and cloud-fan committed Jul 6, 2017
    Configuration menu
    Copy the full SHA
    7f7b63b View commit details
    Browse the repository at this point in the history

Commits on Jul 9, 2017

  1. [SPARK-21345][SQL][TEST][TEST-MAVEN][BRANCH-2.1] SparkSessionBuilderS…

    …uite should clean up stopped sessions.
    
    ## What changes were proposed in this pull request?
    
    `SparkSessionBuilderSuite` should clean up stopped sessions. Otherwise, it leaves behind some stopped `SparkContext`s interfereing with other test suites using `ShardSQLContext`.
    
    Recently, master branch fails consequtively.
    - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/
    
    ## How was this patch tested?
    
    Pass the Jenkins with a updated suite.
    
    Author: Dongjoon Hyun <dongjoon@apache.org>
    
    Closes #18572 from dongjoon-hyun/SPARK-21345-BRANCH-2.1.
    dongjoon-hyun authored and cloud-fan committed Jul 9, 2017
    Configuration menu
    Copy the full SHA
    5e2bfd5 View commit details
    Browse the repository at this point in the history

Commits on Jul 10, 2017

  1. [SPARK-21083][SQL][BRANCH-2.1] Store zero size and row count when ana…

    …lyzing empty table
    
    ## What changes were proposed in this pull request?
    
    We should be able to store zero size and row count after analyzing empty table.
    This is a backport for 9fccc36.
    
    ## How was this patch tested?
    
    Added new test.
    
    Author: Zhenhua Wang <wzh_zju@163.com>
    
    Closes #18577 from wzhfy/analyzeEmptyTable-2.1.
    wzhfy authored and cloud-fan committed Jul 10, 2017
    Configuration menu
    Copy the full SHA
    2c28462 View commit details
    Browse the repository at this point in the history

Commits on Jul 15, 2017

  1. [SPARK-21344][SQL] BinaryType comparison does signed byte array compa…

    …rison
    
    ## What changes were proposed in this pull request?
    
    This PR fixes a wrong comparison for `BinaryType`. This PR enables unsigned comparison and unsigned prefix generation for an array for `BinaryType`. Previous implementations uses signed operations.
    
    ## How was this patch tested?
    
    Added a test suite in `OrderingSuite`.
    
    Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
    
    Closes #18571 from kiszk/SPARK-21344.
    
    (cherry picked from commit ac5d5d7)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    kiszk authored and gatorsmile committed Jul 15, 2017
    Configuration menu
    Copy the full SHA
    ca4d2aa View commit details
    Browse the repository at this point in the history

Commits on Jul 18, 2017

  1. [SPARK-19104][BACKPORT-2.1][SQL] Lambda variables in ExternalMapToCat…

    …alyst should be global
    
    ## What changes were proposed in this pull request?
    
    This PR is backport of #18418 to Spark 2.1. [SPARK-21391](https://issues.apache.org/jira/browse/SPARK-21391) reported this problem in Spark 2.1.
    
    The issue happens in `ExternalMapToCatalyst`. For example, the following codes create ExternalMap`ExternalMapToCatalyst`ToCatalyst to convert Scala Map to catalyst map format.
    
    ```
    val data = Seq.tabulate(10)(i => NestedData(1, Map("key" -> InnerData("name", i + 100))))
    val ds = spark.createDataset(data)
    ```
    The `valueConverter` in `ExternalMapToCatalyst` looks like:
    
    ```
    if (isnull(lambdavariable(ExternalMapToCatalyst_value52, ExternalMapToCatalyst_value_isNull52, ObjectType(class org.apache.spark.sql.InnerData), true))) null else named_struct(name, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(lambdavariable(ExternalMapToCatalyst_value52, ExternalMapToCatalyst_value_isNull52, ObjectType(class org.apache.spark.sql.InnerData), true)).name, true), value, assertnotnull(lambdavariable(ExternalMapToCatalyst_value52, ExternalMapToCatalyst_value_isNull52, ObjectType(class org.apache.spark.sql.InnerData), true)).value)
    ```
    There is a `CreateNamedStruct` expression (`named_struct`) to create a row of `InnerData.name` and `InnerData.value` that are referred by `ExternalMapToCatalyst_value52`.
    
    Because `ExternalMapToCatalyst_value52` are local variable, when `CreateNamedStruct` splits expressions to individual functions, the local variable can't be accessed anymore.
    
    ## How was this patch tested?
    
    Added a new test suite into `DatasetPrimitiveSuite`
    
    Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
    
    Closes #18627 from kiszk/SPARK-21391.
    kiszk authored and cloud-fan committed Jul 18, 2017
    Configuration menu
    Copy the full SHA
    a9efce4 View commit details
    Browse the repository at this point in the history
  2. [SPARK-21332][SQL] Incorrect result type inferred for some decimal ex…

    …pressions
    
    ## What changes were proposed in this pull request?
    
    This PR changes the direction of expression transformation in the DecimalPrecision rule. Previously, the expressions were transformed down, which led to incorrect result types when decimal expressions had other decimal expressions as their operands. The root cause of this issue was in visiting outer nodes before their children. Consider the example below:
    
    ```
        val inputSchema = StructType(StructField("col", DecimalType(26, 6)) :: Nil)
        val sc = spark.sparkContext
        val rdd = sc.parallelize(1 to 2).map(_ => Row(BigDecimal(12)))
        val df = spark.createDataFrame(rdd, inputSchema)
    
        // Works correctly since no nested decimal expression is involved
        // Expected result type: (26, 6) * (26, 6) = (38, 12)
        df.select($"col" * $"col").explain(true)
        df.select($"col" * $"col").printSchema()
    
        // Gives a wrong result since there is a nested decimal expression that should be visited first
        // Expected result type: ((26, 6) * (26, 6)) * (26, 6) = (38, 12) * (26, 6) = (38, 18)
        df.select($"col" * $"col" * $"col").explain(true)
        df.select($"col" * $"col" * $"col").printSchema()
    ```
    
    The example above gives the following output:
    
    ```
    // Correct result without sub-expressions
    == Parsed Logical Plan ==
    'Project [('col * 'col) AS (col * col)#4]
    +- LogicalRDD [col#1]
    
    == Analyzed Logical Plan ==
    (col * col): decimal(38,12)
    Project [CheckOverflow((promote_precision(cast(col#1 as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) AS (col * col)#4]
    +- LogicalRDD [col#1]
    
    == Optimized Logical Plan ==
    Project [CheckOverflow((col#1 * col#1), DecimalType(38,12)) AS (col * col)#4]
    +- LogicalRDD [col#1]
    
    == Physical Plan ==
    *Project [CheckOverflow((col#1 * col#1), DecimalType(38,12)) AS (col * col)#4]
    +- Scan ExistingRDD[col#1]
    
    // Schema
    root
     |-- (col * col): decimal(38,12) (nullable = true)
    
    // Incorrect result with sub-expressions
    == Parsed Logical Plan ==
    'Project [(('col * 'col) * 'col) AS ((col * col) * col)#11]
    +- LogicalRDD [col#1]
    
    == Analyzed Logical Plan ==
    ((col * col) * col): decimal(38,12)
    Project [CheckOverflow((promote_precision(cast(CheckOverflow((promote_precision(cast(col#1 as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) AS ((col * col) * col)#11]
    +- LogicalRDD [col#1]
    
    == Optimized Logical Plan ==
    Project [CheckOverflow((cast(CheckOverflow((col#1 * col#1), DecimalType(38,12)) as decimal(26,6)) * col#1), DecimalType(38,12)) AS ((col * col) * col)#11]
    +- LogicalRDD [col#1]
    
    == Physical Plan ==
    *Project [CheckOverflow((cast(CheckOverflow((col#1 * col#1), DecimalType(38,12)) as decimal(26,6)) * col#1), DecimalType(38,12)) AS ((col * col) * col)#11]
    +- Scan ExistingRDD[col#1]
    
    // Schema
    root
     |-- ((col * col) * col): decimal(38,12) (nullable = true)
    ```
    
    ## How was this patch tested?
    
    This PR was tested with available unit tests. Moreover, there are tests to cover previously failing scenarios.
    
    Author: aokolnychyi <anton.okolnychyi@sap.com>
    
    Closes #18583 from aokolnychyi/spark-21332.
    
    (cherry picked from commit 0be5fb4)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    aokolnychyi authored and gatorsmile committed Jul 18, 2017
    Configuration menu
    Copy the full SHA
    caf32b3 View commit details
    Browse the repository at this point in the history

Commits on Jul 19, 2017

  1. [SPARK-21441][SQL] Incorrect Codegen in SortMergeJoinExec results fai…

    …lures in some cases
    
    ## What changes were proposed in this pull request?
    
    https://issues.apache.org/jira/projects/SPARK/issues/SPARK-21441
    
    This issue can be reproduced by the following example:
    
    ```
    val spark = SparkSession
       .builder()
       .appName("smj-codegen")
       .master("local")
       .config("spark.sql.autoBroadcastJoinThreshold", "1")
       .getOrCreate()
    val df1 = spark.createDataFrame(Seq((1, 1), (2, 2), (3, 3))).toDF("key", "int")
    val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"), (3, "3"))).toDF("key", "str")
    val df = df1.join(df2, df1("key") === df2("key"))
       .filter("int = 2 or reflect('java.lang.Integer', 'valueOf', str) = 1")
       .select("int")
       df.show()
    ```
    
    To conclude, the issue happens when:
    (1) SortMergeJoin condition contains CodegenFallback expressions.
    (2) In PhysicalPlan tree, SortMergeJoin node  is the child of root node, e.g., the Project in above example.
    
    This patch fixes the logic in `CollapseCodegenStages` rule.
    
    ## How was this patch tested?
    Unit test and manual verification in our cluster.
    
    Author: donnyzone <wellfengzhu@gmail.com>
    
    Closes #18656 from DonnyZone/Fix_SortMergeJoinExec.
    
    (cherry picked from commit 6b6dd68)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    DonnyZone authored and cloud-fan committed Jul 19, 2017
    Configuration menu
    Copy the full SHA
    ac20693 View commit details
    Browse the repository at this point in the history
  2. [SPARK-21446][SQL] Fix setAutoCommit never executed

    ## What changes were proposed in this pull request?
    JIRA Issue: https://issues.apache.org/jira/browse/SPARK-21446
    options.asConnectionProperties can not have fetchsize,because fetchsize belongs to Spark-only options, and Spark-only options have been excluded in connection properities.
    So change properties of beforeFetch from  options.asConnectionProperties.asScala.toMap to options.asProperties.asScala.toMap
    
    ## How was this patch tested?
    
    Author: DFFuture <albert.zhang23@gmail.com>
    
    Closes #18665 from DFFuture/sparksql_pg.
    
    (cherry picked from commit c972918)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    DFFuture authored and gatorsmile committed Jul 19, 2017
    Configuration menu
    Copy the full SHA
    9498798 View commit details
    Browse the repository at this point in the history

Commits on Jul 28, 2017

  1. [SPARK-21306][ML] OneVsRest should support setWeightCol

    ## What changes were proposed in this pull request?
    
    add `setWeightCol` method for OneVsRest.
    
    `weightCol` is ignored if classifier doesn't inherit HasWeightCol trait.
    
    ## How was this patch tested?
    
    + [x] add an unit test.
    
    Author: Yan Facai (颜发才) <facai.yan@gmail.com>
    
    Closes #18554 from facaiy/BUG/oneVsRest_missing_weightCol.
    
    (cherry picked from commit a5a3189)
    Signed-off-by: Yanbo Liang <ybliang8@gmail.com>
    facaiy authored and yanboliang committed Jul 28, 2017
    7 Configuration menu
    Copy the full SHA
    8520d7c View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    258ca40 View commit details
    Browse the repository at this point in the history

Commits on Jul 29, 2017

  1. [SPARK-21555][SQL] RuntimeReplaceable should be compared semantically…

    … by its canonicalized child
    
    ## What changes were proposed in this pull request?
    
    When there are aliases (these aliases were added for nested fields) as parameters in `RuntimeReplaceable`, as they are not in the children expression, those aliases can't be cleaned up in analyzer rule `CleanupAliases`.
    
    An expression `nvl(foo.foo1, "value")` can be resolved to two semantically different expressions in a group by query because they contain different aliases.
    
    Because those aliases are not children of `RuntimeReplaceable` which is an `UnaryExpression`. So we can't trim the aliases out by simple transforming the expressions in `CleanupAliases`.
    
    If we want to replace the non-children aliases in `RuntimeReplaceable`, we need to add more codes to `RuntimeReplaceable` and modify all expressions of `RuntimeReplaceable`. It makes the interface ugly IMO.
    
    Consider those aliases will be replaced later at optimization and so they're no harm, this patch chooses to simply override `canonicalized` of `RuntimeReplaceable`.
    
    One concern is about `CleanupAliases`. Because it actually cannot clean up ALL aliases inside a plan. To make caller of this rule notice that, this patch adds a comment to `CleanupAliases`.
    
    ## How was this patch tested?
    
    Added test.
    
    Author: Liang-Chi Hsieh <viirya@gmail.com>
    
    Closes #18761 from viirya/SPARK-21555.
    
    (cherry picked from commit 9c8109e)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    viirya authored and gatorsmile committed Jul 29, 2017
    Configuration menu
    Copy the full SHA
    78f7cdf View commit details
    Browse the repository at this point in the history

Commits on Aug 1, 2017

  1. [SPARK-21522][CORE] Fix flakiness in LauncherServerSuite.

    Handle the case where the server closes the socket before the full message
    has been written by the client.
    
    Author: Marcelo Vanzin <vanzin@cloudera.com>
    
    Closes #18727 from vanzin/SPARK-21522.
    
    (cherry picked from commit b133501)
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
    Marcelo Vanzin committed Aug 1, 2017
    Configuration menu
    Copy the full SHA
    b31b302 View commit details
    Browse the repository at this point in the history

Commits on Aug 3, 2017

  1. [SPARK-12717][PYTHON][BRANCH-2.1] Adding thread-safe broadcast pickle…

    … registry
    
    ## What changes were proposed in this pull request?
    
    When using PySpark broadcast variables in a multi-threaded environment,  `SparkContext._pickled_broadcast_vars` becomes a shared resource.  A race condition can occur when broadcast variables that are pickled from one thread get added to the shared ` _pickled_broadcast_vars` and become part of the python command from another thread.  This PR introduces a thread-safe pickled registry using thread local storage so that when python command is pickled (causing the broadcast variable to be pickled and added to the registry) each thread will have their own view of the pickle registry to retrieve and clear the broadcast variables used.
    
    ## How was this patch tested?
    
    Added a unit test that causes this race condition using another thread.
    
    Author: Bryan Cutler <cutlerb@gmail.com>
    
    Closes #18825 from BryanCutler/pyspark-bcast-threadsafe-SPARK-12717-2_1.
    BryanCutler authored and HyukjinKwon committed Aug 3, 2017
    Configuration menu
    Copy the full SHA
    d93e45b View commit details
    Browse the repository at this point in the history

Commits on Aug 4, 2017

  1. [SPARK-21330][SQL] Bad partitioning does not allow to read a JDBC tab…

    …le with extreme values on the partition column
    
    ## What changes were proposed in this pull request?
    
    An overflow of the difference of bounds on the partitioning column leads to no data being read. This
    patch checks for this overflow.
    
    ## How was this patch tested?
    
    New unit test.
    
    Author: Andrew Ray <ray.andrew@gmail.com>
    
    Closes #18800 from aray/SPARK-21330.
    
    (cherry picked from commit 25826c7)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    aray authored and srowen committed Aug 4, 2017
    Configuration menu
    Copy the full SHA
    734b144 View commit details
    Browse the repository at this point in the history