Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Branch 2.2 #20569

Closed
wants to merge 585 commits into from
Closed

Branch 2.2 #20569

wants to merge 585 commits into from
This pull request is big! We’re only showing the most recent 250 commits.

Commits on Jul 6, 2017

  1. [SPARK-21312][SQL] correct offsetInBytes in UnsafeRow.writeToStream

    ## What changes were proposed in this pull request?
    
    Corrects offsetInBytes calculation in UnsafeRow.writeToStream. Known failures include writes to some DataSources that have own SparkPlan implementations and cause EXCHANGE in writes.
    
    ## How was this patch tested?
    
    Extended UnsafeRowSuite.writeToStream to include an UnsafeRow over byte array having non-zero offset.
    
    Author: Sumedh Wale <swale@snappydata.io>
    
    Closes #18535 from sumwale/SPARK-21312.
    
    (cherry picked from commit 14a3bb3)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    Sumedh Wale authored and cloud-fan committed Jul 6, 2017
    Configuration menu
    Copy the full SHA
    6e1081c View commit details
    Browse the repository at this point in the history
  2. [SS][MINOR] Fix flaky test in DatastreamReaderWriterSuite. temp check…

    …point dir should be deleted
    
    ## What changes were proposed in this pull request?
    
    Stopping query while it is being initialized can throw interrupt exception, in which case temporary checkpoint directories will not be deleted, and the test will fail.
    
    Author: Tathagata Das <tathagata.das1565@gmail.com>
    
    Closes #18442 from tdas/DatastreamReaderWriterSuite-fix.
    
    (cherry picked from commit 60043f2)
    Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
    tdas committed Jul 6, 2017
    Configuration menu
    Copy the full SHA
    4e53a4e View commit details
    Browse the repository at this point in the history

Commits on Jul 7, 2017

  1. [SPARK-21267][SS][DOCS] Update Structured Streaming Documentation

    ## What changes were proposed in this pull request?
    
    Few changes to the Structured Streaming documentation
    - Clarify that the entire stream input table is not materialized
    - Add information for Ganglia
    - Add Kafka Sink to the main docs
    - Removed a couple of leftover experimental tags
    - Added more associated reading material and talk videos.
    
    In addition, #16856 broke the link to the RDD programming guide in several places while renaming the page. This PR fixes those sameeragarwal cloud-fan.
    - Added a redirection to avoid breaking internal and possible external links.
    - Removed unnecessary redirection pages that were there since the separate scala, java, and python programming guides were merged together in 2013 or 2014.
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Tathagata Das <tathagata.das1565@gmail.com>
    
    Closes #18485 from tdas/SPARK-21267.
    
    (cherry picked from commit 0217dfd)
    Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
    tdas authored and zsxwing committed Jul 7, 2017
    Configuration menu
    Copy the full SHA
    576fd4c View commit details
    Browse the repository at this point in the history

Commits on Jul 8, 2017

  1. [SPARK-21069][SS][DOCS] Add rate source to programming guide.

    ## What changes were proposed in this pull request?
    
    SPARK-20979 added a new structured streaming source: Rate source. This patch adds the corresponding documentation to programming guide.
    
    ## How was this patch tested?
    
    Tested by running jekyll locally.
    
    Author: Prashant Sharma <prashant@apache.org>
    Author: Prashant Sharma <prashsh1@in.ibm.com>
    
    Closes #18562 from ScrapCodes/spark-21069/rate-source-docs.
    
    (cherry picked from commit d0bfc67)
    Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
    ScrapCodes authored and zsxwing committed Jul 8, 2017
    Configuration menu
    Copy the full SHA
    ab12848 View commit details
    Browse the repository at this point in the history
  2. [SPARK-21228][SQL][BRANCH-2.2] InSet incorrect handling of structs

    ## What changes were proposed in this pull request?
    
    This is backport of #18455
    When data type is struct, InSet now uses TypeUtils.getInterpretedOrdering (similar to EqualTo) to build a TreeSet. In other cases it will use a HashSet as before (which should be faster). Similarly, In.eval uses Ordering.equiv instead of equals.
    
    ## How was this patch tested?
    New test in SQLQuerySuite.
    
    Author: Bogdan Raducanu <bogdan@databricks.com>
    
    Closes #18563 from bogdanrdc/SPARK-21228-BRANCH2.2.
    bogdanrdc authored and cloud-fan committed Jul 8, 2017
    Configuration menu
    Copy the full SHA
    7d0b1c9 View commit details
    Browse the repository at this point in the history
  3. [SPARK-21345][SQL][TEST][TEST-MAVEN] SparkSessionBuilderSuite should …

    …clean up stopped sessions.
    
    `SparkSessionBuilderSuite` should clean up stopped sessions. Otherwise, it leaves behind some stopped `SparkContext`s interfereing with other test suites using `ShardSQLContext`.
    
    Recently, master branch fails consequtively.
    - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/
    
    Pass the Jenkins with a updated suite.
    
    Author: Dongjoon Hyun <dongjoon@apache.org>
    
    Closes #18567 from dongjoon-hyun/SPARK-SESSION.
    
    (cherry picked from commit 0b8dd2d)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    dongjoon-hyun authored and cloud-fan committed Jul 8, 2017
    Configuration menu
    Copy the full SHA
    a64f108 View commit details
    Browse the repository at this point in the history
  4. [SPARK-20342][CORE] Update task accumulators before sending task end …

    …event.
    
    This makes sures that listeners get updated task information; otherwise it's
    possible to write incomplete task information into event logs, for example,
    making the information in a replayed UI inconsistent with the original
    application.
    
    Added a new unit test to try to detect the problem, but it's not guaranteed
    to fail since it's a race; but it fails pretty reliably for me without the
    scheduler changes.
    
    Author: Marcelo Vanzin <vanzin@cloudera.com>
    
    Closes #18393 from vanzin/SPARK-20342.try2.
    
    (cherry picked from commit 9131bdb)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    Marcelo Vanzin authored and cloud-fan committed Jul 8, 2017
    Configuration menu
    Copy the full SHA
    c8d7855 View commit details
    Browse the repository at this point in the history
  5. [SPARK-21343] Refine the document for spark.reducer.maxReqSizeShuffle…

    …ToMem.
    
    ## What changes were proposed in this pull request?
    
    In current code, reducer can break the old shuffle service when `spark.reducer.maxReqSizeShuffleToMem` is enabled. Let's refine document.
    
    Author: jinxing <jinxing6042@126.com>
    
    Closes #18566 from jinxing64/SPARK-21343.
    
    (cherry picked from commit 062c336)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    jinxing authored and cloud-fan committed Jul 8, 2017
    Configuration menu
    Copy the full SHA
    964332b View commit details
    Browse the repository at this point in the history

Commits on Jul 9, 2017

  1. [SPARK-21083][SQL][BRANCH-2.2] Store zero size and row count when ana…

    …lyzing empty table
    
    ## What changes were proposed in this pull request?
    
    We should be able to store zero size and row count after analyzing empty table.
    This is a backport for 9fccc36.
    
    ## How was this patch tested?
    
    Added new test.
    
    Author: Zhenhua Wang <wangzhenhua@huawei.com>
    
    Closes #18575 from wzhfy/analyzeEmptyTable-2.2.
    wzhfy authored and cloud-fan committed Jul 9, 2017
    Configuration menu
    Copy the full SHA
    3bfad9d View commit details
    Browse the repository at this point in the history

Commits on Jul 10, 2017

  1. [SPARK-21342] Fix DownloadCallback to work well with RetryingBlockFet…

    …cher.
    
    When `RetryingBlockFetcher` retries fetching blocks. There could be two `DownloadCallback`s download the same content to the same target file. It could cause `ShuffleBlockFetcherIterator` reading a partial result.
    
    This pr proposes to create and delete the tmp files in `OneForOneBlockFetcher`
    
    Author: jinxing <jinxing6042@126.com>
    Author: Shixiong Zhu <zsxwing@gmail.com>
    
    Closes #18565 from jinxing64/SPARK-21342.
    
    (cherry picked from commit 6a06c4b)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    jinxing authored and cloud-fan committed Jul 10, 2017
    Configuration menu
    Copy the full SHA
    40fd0ce View commit details
    Browse the repository at this point in the history
  2. [SPARK-21272] SortMergeJoin LeftAnti does not update numOutputRows

    ## What changes were proposed in this pull request?
    
    Updating numOutputRows metric was missing from one return path of LeftAnti SortMergeJoin.
    
    ## How was this patch tested?
    
    Non-zero output rows manually seen in metrics.
    
    Author: Juliusz Sompolski <julek@databricks.com>
    
    Closes #18494 from juliuszsompolski/SPARK-21272.
    juliuszsompolski authored and gatorsmile committed Jul 10, 2017
    Configuration menu
    Copy the full SHA
    a05edf4 View commit details
    Browse the repository at this point in the history

Commits on Jul 11, 2017

  1. [SPARK-21369][CORE] Don't use Scala Tuple2 in common/network-*

    ## What changes were proposed in this pull request?
    
    Remove all usages of Scala Tuple2 from common/network-* projects. Otherwise, Yarn users cannot use `spark.reducer.maxReqSizeShuffleToMem`.
    
    ## How was this patch tested?
    
    Jenkins.
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes #18593 from zsxwing/SPARK-21369.
    
    (cherry picked from commit 833eab2)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    zsxwing authored and cloud-fan committed Jul 11, 2017
    Configuration menu
    Copy the full SHA
    edcd9fb View commit details
    Browse the repository at this point in the history
  2. [SPARK-21366][SQL][TEST] Add sql test for window functions

    ## What changes were proposed in this pull request?
    
    Add sql test for window functions, also remove uncecessary test cases in `WindowQuerySuite`.
    
    ## How was this patch tested?
    
    Added `window.sql` and the corresponding output file.
    
    Author: Xingbo Jiang <xingbo.jiang@databricks.com>
    
    Closes #18591 from jiangxb1987/window.
    
    (cherry picked from commit 66d2168)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    jiangxb1987 authored and cloud-fan committed Jul 11, 2017
    Configuration menu
    Copy the full SHA
    399aa01 View commit details
    Browse the repository at this point in the history

Commits on Jul 12, 2017

  1. [SPARK-21219][CORE] Task retry occurs on same executor due to race co…

    …ndition with blacklisting
    
    There's a race condition in the current TaskSetManager where a failed task is added for retry (addPendingTask), and can asynchronously be assigned to an executor *prior* to the blacklist state (updateBlacklistForFailedTask), the result is the task might re-execute on the same executor.  This is particularly problematic if the executor is shutting down since the retry task immediately becomes a lost task (ExecutorLostFailure).  Another side effect is that the actual failure reason gets obscured by the retry task which never actually executed.  There are sample logs showing the issue in the https://issues.apache.org/jira/browse/SPARK-21219
    
    The fix is to change the ordering of the addPendingTask and updatingBlackListForFailedTask calls in TaskSetManager.handleFailedTask
    
    Implemented a unit test that verifies the task is black listed before it is added to the pending task.  Ran the unit test without the fix and it fails.  Ran the unit test with the fix and it passes.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Eric Vandenberg <ericvandenbergfb.com>
    
    Closes #18427 from ericvandenbergfb/blacklistFix.
    
    ## What changes were proposed in this pull request?
    
    This is a backport of the fix to SPARK-21219, already checked in as 96d58f2.
    
    ## How was this patch tested?
    
    Ran TaskSetManagerSuite tests locally.
    
    Author: Eric Vandenberg <ericvandenberg@fb.com>
    
    Closes #18604 from jsoltren/branch-2.2.
    Eric Vandenberg authored and cloud-fan committed Jul 12, 2017
    Configuration menu
    Copy the full SHA
    cb6fc89 View commit details
    Browse the repository at this point in the history

Commits on Jul 13, 2017

  1. [SPARK-18646][REPL] Set parent classloader as null for ExecutorClassL…

    …oader
    
    ## What changes were proposed in this pull request?
    
    `ClassLoader` will preferentially load class from `parent`. Only when `parent` is null or the load failed, that it will call the overridden `findClass` function. To avoid the potential issue caused by loading class using inappropriate class loader, we should set the `parent` of `ClassLoader` to null, so that we can fully control which class loader is used.
    
    This is take over of #17074,  the primary author of this PR is taroplus .
    
    Should close #17074 after this PR get merged.
    
    ## How was this patch tested?
    
    Add test case in `ExecutorClassLoaderSuite`.
    
    Author: Kohki Nishio <taroplus@me.com>
    Author: Xingbo Jiang <xingbo.jiang@databricks.com>
    
    Closes #18614 from jiangxb1987/executor_classloader.
    
    (cherry picked from commit e08d06b)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    taroplus authored and cloud-fan committed Jul 13, 2017
    Configuration menu
    Copy the full SHA
    39eba30 View commit details
    Browse the repository at this point in the history
  2. Revert "[SPARK-18646][REPL] Set parent classloader as null for Execut…

    …orClassLoader"
    
    This reverts commit 39eba30.
    cloud-fan committed Jul 13, 2017
    Configuration menu
    Copy the full SHA
    cf0719b View commit details
    Browse the repository at this point in the history

Commits on Jul 14, 2017

  1. [SPARK-21376][YARN] Fix yarn client token expire issue when cleaning …

    …the staging files in long running scenario
    
    ## What changes were proposed in this pull request?
    
    This issue happens in long running application with yarn cluster mode, because yarn#client doesn't sync token with AM, so it will always keep the initial token, this token may be expired in the long running scenario, so when yarn#client tries to clean up staging directory after application finished, it will use this expired token and meet token expire issue.
    
    ## How was this patch tested?
    
    Manual verification is secure cluster.
    
    Author: jerryshao <sshao@hortonworks.com>
    
    Closes #18617 from jerryshao/SPARK-21376.
    
    (cherry picked from commit cb8d5cc)
    jerryshao authored and Marcelo Vanzin committed Jul 14, 2017
    Configuration menu
    Copy the full SHA
    bfe3ba8 View commit details
    Browse the repository at this point in the history

Commits on Jul 15, 2017

  1. [SPARK-21344][SQL] BinaryType comparison does signed byte array compa…

    …rison
    
    ## What changes were proposed in this pull request?
    
    This PR fixes a wrong comparison for `BinaryType`. This PR enables unsigned comparison and unsigned prefix generation for an array for `BinaryType`. Previous implementations uses signed operations.
    
    ## How was this patch tested?
    
    Added a test suite in `OrderingSuite`.
    
    Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
    
    Closes #18571 from kiszk/SPARK-21344.
    
    (cherry picked from commit ac5d5d7)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    kiszk authored and gatorsmile committed Jul 15, 2017
    Configuration menu
    Copy the full SHA
    1cb4369 View commit details
    Browse the repository at this point in the history
  2. [SPARK-21267][DOCS][MINOR] Follow up to avoid referencing programming…

    …-guide redirector
    
    ## What changes were proposed in this pull request?
    
    Update internal references from programming-guide to rdd-programming-guide
    
    See apache/spark-website@5ddf243 and #18485 (comment)
    
    Let's keep the redirector even if it's problematic to build, but not rely on it internally.
    
    ## How was this patch tested?
    
    (Doc build)
    
    Author: Sean Owen <sowen@cloudera.com>
    
    Closes #18625 from srowen/SPARK-21267.2.
    
    (cherry picked from commit 74ac1fb)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    srowen committed Jul 15, 2017
    Configuration menu
    Copy the full SHA
    8e85ce6 View commit details
    Browse the repository at this point in the history

Commits on Jul 17, 2017

  1. [SPARK-21321][SPARK CORE] Spark very verbose on shutdown

    ## What changes were proposed in this pull request?
    
    The current code is very verbose on shutdown.
    
    The changes I propose is to change the log level when the driver is shutting down and the RPC connections are closed (RpcEnvStoppedException).
    
    ## How was this patch tested?
    
    Tested with word count(deploy-mode = cluster, master = yarn, num-executors = 4) with 300GB of data.
    
    Author: John Lee <jlee2@yahoo-inc.com>
    
    Closes #18547 from yoonlee95/SPARK-21321.
    
    (cherry picked from commit 0e07a29)
    Signed-off-by: Tom Graves <tgraves@yahoo-inc.com>
    John Lee authored and Tom Graves committed Jul 17, 2017
    Configuration menu
    Copy the full SHA
    0ef98fd View commit details
    Browse the repository at this point in the history

Commits on Jul 18, 2017

  1. [SPARK-21332][SQL] Incorrect result type inferred for some decimal ex…

    …pressions
    
    ## What changes were proposed in this pull request?
    
    This PR changes the direction of expression transformation in the DecimalPrecision rule. Previously, the expressions were transformed down, which led to incorrect result types when decimal expressions had other decimal expressions as their operands. The root cause of this issue was in visiting outer nodes before their children. Consider the example below:
    
    ```
        val inputSchema = StructType(StructField("col", DecimalType(26, 6)) :: Nil)
        val sc = spark.sparkContext
        val rdd = sc.parallelize(1 to 2).map(_ => Row(BigDecimal(12)))
        val df = spark.createDataFrame(rdd, inputSchema)
    
        // Works correctly since no nested decimal expression is involved
        // Expected result type: (26, 6) * (26, 6) = (38, 12)
        df.select($"col" * $"col").explain(true)
        df.select($"col" * $"col").printSchema()
    
        // Gives a wrong result since there is a nested decimal expression that should be visited first
        // Expected result type: ((26, 6) * (26, 6)) * (26, 6) = (38, 12) * (26, 6) = (38, 18)
        df.select($"col" * $"col" * $"col").explain(true)
        df.select($"col" * $"col" * $"col").printSchema()
    ```
    
    The example above gives the following output:
    
    ```
    // Correct result without sub-expressions
    == Parsed Logical Plan ==
    'Project [('col * 'col) AS (col * col)#4]
    +- LogicalRDD [col#1]
    
    == Analyzed Logical Plan ==
    (col * col): decimal(38,12)
    Project [CheckOverflow((promote_precision(cast(col#1 as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) AS (col * col)#4]
    +- LogicalRDD [col#1]
    
    == Optimized Logical Plan ==
    Project [CheckOverflow((col#1 * col#1), DecimalType(38,12)) AS (col * col)#4]
    +- LogicalRDD [col#1]
    
    == Physical Plan ==
    *Project [CheckOverflow((col#1 * col#1), DecimalType(38,12)) AS (col * col)#4]
    +- Scan ExistingRDD[col#1]
    
    // Schema
    root
     |-- (col * col): decimal(38,12) (nullable = true)
    
    // Incorrect result with sub-expressions
    == Parsed Logical Plan ==
    'Project [(('col * 'col) * 'col) AS ((col * col) * col)#11]
    +- LogicalRDD [col#1]
    
    == Analyzed Logical Plan ==
    ((col * col) * col): decimal(38,12)
    Project [CheckOverflow((promote_precision(cast(CheckOverflow((promote_precision(cast(col#1 as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) AS ((col * col) * col)#11]
    +- LogicalRDD [col#1]
    
    == Optimized Logical Plan ==
    Project [CheckOverflow((cast(CheckOverflow((col#1 * col#1), DecimalType(38,12)) as decimal(26,6)) * col#1), DecimalType(38,12)) AS ((col * col) * col)#11]
    +- LogicalRDD [col#1]
    
    == Physical Plan ==
    *Project [CheckOverflow((cast(CheckOverflow((col#1 * col#1), DecimalType(38,12)) as decimal(26,6)) * col#1), DecimalType(38,12)) AS ((col * col) * col)#11]
    +- Scan ExistingRDD[col#1]
    
    // Schema
    root
     |-- ((col * col) * col): decimal(38,12) (nullable = true)
    ```
    
    ## How was this patch tested?
    
    This PR was tested with available unit tests. Moreover, there are tests to cover previously failing scenarios.
    
    Author: aokolnychyi <anton.okolnychyi@sap.com>
    
    Closes #18583 from aokolnychyi/spark-21332.
    
    (cherry picked from commit 0be5fb4)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    aokolnychyi authored and gatorsmile committed Jul 18, 2017
    Configuration menu
    Copy the full SHA
    83bdb04 View commit details
    Browse the repository at this point in the history
  2. [SPARK-21445] Make IntWrapper and LongWrapper in UTF8String Serializable

    ## What changes were proposed in this pull request?
    
    Making those two classes will avoid Serialization issues like below:
    ```
    Caused by: java.io.NotSerializableException: org.apache.spark.unsafe.types.UTF8String$IntWrapper
    Serialization stack:
        - object not serializable (class: org.apache.spark.unsafe.types.UTF8String$IntWrapper, value: org.apache.spark.unsafe.types.UTF8String$IntWrapper326450e)
        - field (class: org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToInt$1, name: result$2, type: class org.apache.spark.unsafe.types.UTF8String$IntWrapper)
        - object (class org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToInt$1, <function1>)
    ```
    
    ## How was this patch tested?
    
    - [x] Manual testing
    - [ ] Unit test
    
    Author: Burak Yavuz <brkyvz@gmail.com>
    
    Closes #18660 from brkyvz/serializableutf8.
    
    (cherry picked from commit 26cd2ca)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    brkyvz authored and cloud-fan committed Jul 18, 2017
    Configuration menu
    Copy the full SHA
    99ce551 View commit details
    Browse the repository at this point in the history
  3. [SPARK-21457][SQL] ExternalCatalog.listPartitions should correctly ha…

    …ndle partition values with dot
    
    ## What changes were proposed in this pull request?
    
    When we list partitions from hive metastore with a partial partition spec, we are expecting exact matching according to the partition values. However, hive treats dot specially and match any single character for dot. We should do an extra filter to drop unexpected partitions.
    
    ## How was this patch tested?
    
    new regression test.
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes #18671 from cloud-fan/hive.
    
    (cherry picked from commit f18b905)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    cloud-fan authored and gatorsmile committed Jul 18, 2017
    Configuration menu
    Copy the full SHA
    df061fd View commit details
    Browse the repository at this point in the history

Commits on Jul 19, 2017

  1. [SPARK-21414] Refine SlidingWindowFunctionFrame to avoid OOM.

    ## What changes were proposed in this pull request?
    
    In `SlidingWindowFunctionFrame`, it is now adding all rows to the buffer for which the input row value is equal to or less than the output row upper bound, then drop all rows from the buffer for which the input row value is smaller than the output row lower bound.
    This could result in the buffer is very big though the window is small.
    For example:
    ```
    select a, b, sum(a)
    over (partition by b order by a range between 1000000 following and 1000001 following)
    from table
    ```
    We can refine the logic and just add the qualified rows into buffer.
    
    ## How was this patch tested?
    Manual test:
    Run sql
    `select shop, shopInfo, district, sum(revenue) over(partition by district order by revenue range between 100 following and 200 following) from revenueList limit 10`
    against a table with 4  columns(shop: String, shopInfo: String, district: String, revenue: Int). The biggest partition is around 2G bytes, containing 200k lines.
    Configure the executor with 2G bytes memory.
    With the change in this pr, it works find. Without this change, below exception will be thrown.
    ```
    MemoryError: Java heap space
    	at org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:504)
    	at org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:62)
    	at org.apache.spark.sql.execution.window.SlidingWindowFunctionFrame.write(WindowFunctionFrame.scala:201)
    	at org.apache.spark.sql.execution.window.WindowExec$$anonfun$14$$anon$1.next(WindowExec.scala:365)
    	at org.apache.spark.sql.execution.window.WindowExec$$anonfun$14$$anon$1.next(WindowExec.scala:289)
    	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
    	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
    	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
    	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
    	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
    	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
    	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    	at org.apache.spark.scheduler.Task.run(Task.scala:108)
    	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:341)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    ```
    
    Author: jinxing <jinxing6042@126.com>
    
    Closes #18634 from jinxing64/SPARK-21414.
    
    (cherry picked from commit 4eb081c)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    jinxing authored and cloud-fan committed Jul 19, 2017
    Configuration menu
    Copy the full SHA
    5a0a76f View commit details
    Browse the repository at this point in the history
  2. [SPARK-21441][SQL] Incorrect Codegen in SortMergeJoinExec results fai…

    …lures in some cases
    
    ## What changes were proposed in this pull request?
    
    https://issues.apache.org/jira/projects/SPARK/issues/SPARK-21441
    
    This issue can be reproduced by the following example:
    
    ```
    val spark = SparkSession
       .builder()
       .appName("smj-codegen")
       .master("local")
       .config("spark.sql.autoBroadcastJoinThreshold", "1")
       .getOrCreate()
    val df1 = spark.createDataFrame(Seq((1, 1), (2, 2), (3, 3))).toDF("key", "int")
    val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"), (3, "3"))).toDF("key", "str")
    val df = df1.join(df2, df1("key") === df2("key"))
       .filter("int = 2 or reflect('java.lang.Integer', 'valueOf', str) = 1")
       .select("int")
       df.show()
    ```
    
    To conclude, the issue happens when:
    (1) SortMergeJoin condition contains CodegenFallback expressions.
    (2) In PhysicalPlan tree, SortMergeJoin node  is the child of root node, e.g., the Project in above example.
    
    This patch fixes the logic in `CollapseCodegenStages` rule.
    
    ## How was this patch tested?
    Unit test and manual verification in our cluster.
    
    Author: donnyzone <wellfengzhu@gmail.com>
    
    Closes #18656 from DonnyZone/Fix_SortMergeJoinExec.
    
    (cherry picked from commit 6b6dd68)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    DonnyZone authored and cloud-fan committed Jul 19, 2017
    Configuration menu
    Copy the full SHA
    4c212ee View commit details
    Browse the repository at this point in the history
  3. [SPARK-21464][SS] Minimize deprecation warnings caused by ProcessingT…

    …ime class
    
    ## What changes were proposed in this pull request?
    
    Use of `ProcessingTime` class was deprecated in favor of `Trigger.ProcessingTime` in Spark 2.2. However interval uses to ProcessingTime causes deprecation warnings during compilation. This cannot be avoided entirely as even though it is deprecated as a public API, ProcessingTime instances are used internally in TriggerExecutor. This PR is to minimize the warning by removing its uses from tests as much as possible.
    
    ## How was this patch tested?
    Existing tests.
    
    Author: Tathagata Das <tathagata.das1565@gmail.com>
    
    Closes #18678 from tdas/SPARK-21464.
    
    (cherry picked from commit 70fe99d)
    Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
    tdas committed Jul 19, 2017
    Configuration menu
    Copy the full SHA
    86cd3c0 View commit details
    Browse the repository at this point in the history
  4. [SPARK-21446][SQL] Fix setAutoCommit never executed

    ## What changes were proposed in this pull request?
    JIRA Issue: https://issues.apache.org/jira/browse/SPARK-21446
    options.asConnectionProperties can not have fetchsize,because fetchsize belongs to Spark-only options, and Spark-only options have been excluded in connection properities.
    So change properties of beforeFetch from  options.asConnectionProperties.asScala.toMap to options.asProperties.asScala.toMap
    
    ## How was this patch tested?
    
    Author: DFFuture <albert.zhang23@gmail.com>
    
    Closes #18665 from DFFuture/sparksql_pg.
    
    (cherry picked from commit c972918)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    DFFuture authored and gatorsmile committed Jul 19, 2017
    Configuration menu
    Copy the full SHA
    308bce0 View commit details
    Browse the repository at this point in the history
  5. [SPARK-21333][DOCS] Removed invalid joinTypes from javadoc of Dataset…

    …#joinWith
    
    ## What changes were proposed in this pull request?
    
    Two invalid join types were mistakenly listed in the javadoc for joinWith, in the Dataset class. I presume these were copied from the javadoc of join, but since joinWith returns a Dataset\<Tuple2\>, left_semi and left_anti are invalid, as they only return values from one of the datasets, instead of from both
    
    ## How was this patch tested?
    
    I ran the following code :
    ```
    public static void main(String[] args) {
    	SparkSession spark = new SparkSession(new SparkContext("local[*]", "Test"));
    	Dataset<Row> one = spark.createDataFrame(Arrays.asList(new Bean(1), new Bean(2), new Bean(3), new Bean(4), new Bean(5)), Bean.class);
    	Dataset<Row> two = spark.createDataFrame(Arrays.asList(new Bean(4), new Bean(5), new Bean(6), new Bean(7), new Bean(8), new Bean(9)), Bean.class);
    
    	try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "inner").show();} catch (Exception e) {e.printStackTrace();}
    	try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "cross").show();} catch (Exception e) {e.printStackTrace();}
    	try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "outer").show();} catch (Exception e) {e.printStackTrace();}
    	try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "full").show();} catch (Exception e) {e.printStackTrace();}
    	try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "full_outer").show();} catch (Exception e) {e.printStackTrace();}
    	try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "left").show();} catch (Exception e) {e.printStackTrace();}
    	try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "left_outer").show();} catch (Exception e) {e.printStackTrace();}
    	try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "right").show();} catch (Exception e) {e.printStackTrace();}
    	try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "right_outer").show();} catch (Exception e) {e.printStackTrace();}
    	try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "left_semi").show();} catch (Exception e) {e.printStackTrace();}
    	try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "left_anti").show();} catch (Exception e) {e.printStackTrace();}
    }
    ```
    which tests all the different join types, and the last two (left_semi and left_anti) threw exceptions. The same code using join instead of joinWith did fine. The Bean class was just a java bean with a single int field, x.
    
    Author: Corey Woodfield <coreywoodfield@gmail.com>
    
    Closes #18462 from coreywoodfield/master.
    
    (cherry picked from commit 8cd9cdf)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    coreywoodfield authored and gatorsmile committed Jul 19, 2017
    Configuration menu
    Copy the full SHA
    9949fed View commit details
    Browse the repository at this point in the history

Commits on Jul 21, 2017

  1. [SPARK-21243][CORE] Limit no. of map outputs in a shuffle fetch

    For configurations with external shuffle enabled, we have observed that if a very large no. of blocks are being fetched from a remote host, it puts the NM under extra pressure and can crash it. This change introduces a configuration `spark.reducer.maxBlocksInFlightPerAddress` , to limit the no. of map outputs being fetched from a given remote address. The changes applied here are applicable for both the scenarios - when external shuffle is enabled as well as disabled.
    
    Ran the job with the default configuration which does not change the existing behavior and ran it with few configurations of lower values -10,20,50,100. The job ran fine and there is no change in the output. (I will update the metrics related to NM in some time.)
    
    Author: Dhruve Ashar <dhruveashargmail.com>
    
    Closes #18487 from dhruve/impr/SPARK-21243.
    
    Author: Dhruve Ashar <dhruveashar@gmail.com>
    
    Closes #18691 from dhruve/branch-2.2.
    dhruve authored and Tom Graves committed Jul 21, 2017
    Configuration menu
    Copy the full SHA
    88dccda View commit details
    Browse the repository at this point in the history
  2. [SPARK-21434][PYTHON][DOCS] Add pyspark pip documentation.

    Update the Quickstart and RDD programming guides to mention pip.
    
    Built docs locally.
    
    Author: Holden Karau <holden@us.ibm.com>
    
    Closes #18698 from holdenk/SPARK-21434-add-pyspark-pip-documentation.
    
    (cherry picked from commit cc00e99)
    Signed-off-by: Holden Karau <holden@us.ibm.com>
    holdenk committed Jul 21, 2017
    Configuration menu
    Copy the full SHA
    da403b9 View commit details
    Browse the repository at this point in the history

Commits on Jul 23, 2017

  1. [SPARK-20904][CORE] Don't report task failures to driver during shutd…

    …own.
    
    Executors run a thread pool with daemon threads to run tasks. This means
    that those threads remain active when the JVM is shutting down, meaning
    those tasks are affected by code that runs in shutdown hooks.
    
    So if a shutdown hook messes with something that the task is using (e.g.
    an HDFS connection), the task will fail and will report that failure to
    the driver. That will make the driver mark the task as failed regardless
    of what caused the executor to shut down. So, for example, if YARN pre-empted
    that executor, the driver would consider that task failed when it should
    instead ignore the failure.
    
    This change avoids reporting failures to the driver when shutdown hooks
    are executing; this fixes the YARN preemption accounting, and doesn't really
    change things much for other scenarios, other than reporting a more generic
    error ("Executor lost") when the executor shuts down unexpectedly - which
    is arguably more correct.
    
    Tested with a hacky app running on spark-shell that tried to cause failures
    only when shutdown hooks were running, verified that preemption didn't cause
    the app to fail because of task failures exceeding the threshold.
    
    Author: Marcelo Vanzin <vanzin@cloudera.com>
    
    Closes #18594 from vanzin/SPARK-20904.
    
    (cherry picked from commit cecd285)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    Marcelo Vanzin authored and cloud-fan committed Jul 23, 2017
    Configuration menu
    Copy the full SHA
    62ca13d View commit details
    Browse the repository at this point in the history

Commits on Jul 25, 2017

  1. [SPARK-21383][YARN] Fix the YarnAllocator allocates more Resource

    When NodeManagers launching Executors,
    the `missing` value will exceed the
    real value when the launch is slow, this can lead to YARN allocates more resource.
    
    We add the `numExecutorsRunning` when calculate the `missing` to avoid this.
    
    Test by experiment.
    
    Author: DjvuLee <lihu@bytedance.com>
    
    Closes #18651 from djvulee/YarnAllocate.
    
    (cherry picked from commit 8de080d)
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
    DjvuLee authored and Marcelo Vanzin committed Jul 25, 2017
    Configuration menu
    Copy the full SHA
    e5ec339 View commit details
    Browse the repository at this point in the history
  2. [SPARK-21447][WEB UI] Spark history server fails to render compressed

    inprogress history file in some cases.
    
    Add failure handling for EOFException that can be thrown during
    decompression of an inprogress spark history file, treat same as case
    where can't parse the last line.
    
    ## What changes were proposed in this pull request?
    
    Failure handling for case of EOFException thrown within the ReplayListenerBus.replay method to handle the case analogous to json parse fail case.  This path can arise in compressed inprogress history files since an incomplete compression block could be read (not flushed by writer on a block boundary).  See the stack trace of this occurrence in the jira ticket (https://issues.apache.org/jira/browse/SPARK-21447)
    
    ## How was this patch tested?
    
    Added a unit test that specifically targets validating the failure handling path appropriately when maybeTruncated is true and false.
    
    Author: Eric Vandenberg <ericvandenberg@fb.com>
    
    Closes #18673 from ericvandenbergfb/fix_inprogress_compr_history_file.
    
    (cherry picked from commit 06a9793)
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
    Eric Vandenberg authored and Marcelo Vanzin committed Jul 25, 2017
    Configuration menu
    Copy the full SHA
    c91191b View commit details
    Browse the repository at this point in the history

Commits on Jul 26, 2017

  1. [SPARK-21494][NETWORK] Use correct app id when authenticating to exte…

    …rnal service.
    
    There was some code based on the old SASL handler in the new auth client that
    was incorrectly using the SASL user as the user to authenticate against the
    external shuffle service. This caused the external service to not be able to
    find the correct secret to authenticate the connection, failing the connection.
    
    In the course of debugging, I found that some log messages from the YARN shuffle
    service were a little noisy, so I silenced some of them, and also added a couple
    of new ones that helped find this issue. On top of that, I found that a check
    in the code that records app secrets was wrong, causing more log spam and also
    using an O(n) operation instead of an O(1) call.
    
    Also added a new integration suite for the YARN shuffle service with auth on,
    and verified it failed before, and passes now.
    
    Author: Marcelo Vanzin <vanzin@cloudera.com>
    
    Closes #18706 from vanzin/SPARK-21494.
    
    (cherry picked from commit 300807c)
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
    Marcelo Vanzin committed Jul 26, 2017
    Configuration menu
    Copy the full SHA
    1bfd1a8 View commit details
    Browse the repository at this point in the history

Commits on Jul 27, 2017

  1. [SPARK-21538][SQL] Attribute resolution inconsistency in the Dataset API

    ## What changes were proposed in this pull request?
    
    This PR contains a tiny update that removes an attribute resolution inconsistency in the Dataset API. The following example is taken from the ticket description:
    
    ```
    spark.range(1).withColumnRenamed("id", "x").sort(col("id"))  // works
    spark.range(1).withColumnRenamed("id", "x").sort($"id")  // works
    spark.range(1).withColumnRenamed("id", "x").sort('id) // works
    spark.range(1).withColumnRenamed("id", "x").sort("id") // fails with:
    org.apache.spark.sql.AnalysisException: Cannot resolve column name "id" among (x);
    ```
    The above `AnalysisException` happens because the last case calls `Dataset.apply()` to convert strings into columns, which triggers attribute resolution. To make the API consistent between overloaded methods, this PR defers the resolution and constructs columns directly.
    
    Author: aokolnychyi <anton.okolnychyi@sap.com>
    
    Closes #18740 from aokolnychyi/spark-21538.
    
    (cherry picked from commit f44ead8)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    aokolnychyi authored and gatorsmile committed Jul 27, 2017
    Configuration menu
    Copy the full SHA
    06b2ef0 View commit details
    Browse the repository at this point in the history

Commits on Jul 28, 2017

  1. [SPARK-21306][ML] OneVsRest should support setWeightCol

    ## What changes were proposed in this pull request?
    
    add `setWeightCol` method for OneVsRest.
    
    `weightCol` is ignored if classifier doesn't inherit HasWeightCol trait.
    
    ## How was this patch tested?
    
    + [x] add an unit test.
    
    Author: Yan Facai (颜发才) <facai.yan@gmail.com>
    
    Closes #18554 from facaiy/BUG/oneVsRest_missing_weightCol.
    
    (cherry picked from commit a5a3189)
    Signed-off-by: Yanbo Liang <ybliang8@gmail.com>
    facaiy authored and yanboliang committed Jul 28, 2017
    Configuration menu
    Copy the full SHA
    9379031 View commit details
    Browse the repository at this point in the history

Commits on Jul 29, 2017

  1. [SPARK-21508][DOC] Fix example code provided in Spark Streaming Docum…

    …entation
    
    ## What changes were proposed in this pull request?
    
    JIRA ticket : [SPARK-21508](https://issues.apache.org/jira/projects/SPARK/issues/SPARK-21508)
    
    correcting a mistake in example code provided in Spark Streaming Custom Receivers Documentation
    The example code provided in the documentation on 'Spark Streaming Custom Receivers' has an error.
    doc link : https://spark.apache.org/docs/latest/streaming-custom-receivers.html
    
    ```
    
    // Assuming ssc is the StreamingContext
    val customReceiverStream = ssc.receiverStream(new CustomReceiver(host, port))
    val words = lines.flatMap(_.split(" "))
    ...
    ```
    
    instead of `lines.flatMap(_.split(" "))`
    it should be `customReceiverStream.flatMap(_.split(" "))`
    
    ## How was this patch tested?
    this documentation change is tested manually by jekyll build , running below commands
    ```
    jekyll build
    jekyll serve --watch
    ```
    screen-shots provided below
    ![screenshot1](https://user-images.githubusercontent.com/8828470/28744636-a6de1ac6-7482-11e7-843b-ff84b5855ec0.png)
    ![screenshot2](https://user-images.githubusercontent.com/8828470/28744637-a6def496-7482-11e7-9512-7f4bbe027c6a.png)
    
    Author: Remis Haroon <Remis.Haroon@insdc01.pwc.com>
    
    Closes #18770 from remisharoon/master.
    
    (cherry picked from commit c143820)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    Remis Haroon authored and srowen committed Jul 29, 2017
    Configuration menu
    Copy the full SHA
    df6cd35 View commit details
    Browse the repository at this point in the history
  2. [SPARK-21555][SQL] RuntimeReplaceable should be compared semantically…

    … by its canonicalized child
    
    ## What changes were proposed in this pull request?
    
    When there are aliases (these aliases were added for nested fields) as parameters in `RuntimeReplaceable`, as they are not in the children expression, those aliases can't be cleaned up in analyzer rule `CleanupAliases`.
    
    An expression `nvl(foo.foo1, "value")` can be resolved to two semantically different expressions in a group by query because they contain different aliases.
    
    Because those aliases are not children of `RuntimeReplaceable` which is an `UnaryExpression`. So we can't trim the aliases out by simple transforming the expressions in `CleanupAliases`.
    
    If we want to replace the non-children aliases in `RuntimeReplaceable`, we need to add more codes to `RuntimeReplaceable` and modify all expressions of `RuntimeReplaceable`. It makes the interface ugly IMO.
    
    Consider those aliases will be replaced later at optimization and so they're no harm, this patch chooses to simply override `canonicalized` of `RuntimeReplaceable`.
    
    One concern is about `CleanupAliases`. Because it actually cannot clean up ALL aliases inside a plan. To make caller of this rule notice that, this patch adds a comment to `CleanupAliases`.
    
    ## How was this patch tested?
    
    Added test.
    
    Author: Liang-Chi Hsieh <viirya@gmail.com>
    
    Closes #18761 from viirya/SPARK-21555.
    
    (cherry picked from commit 9c8109e)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    viirya authored and gatorsmile committed Jul 29, 2017
    Configuration menu
    Copy the full SHA
    24a9bac View commit details
    Browse the repository at this point in the history
  3. [SPARK-19451][SQL] rangeBetween method should accept Long value as bo…

    …undary
    
    ## What changes were proposed in this pull request?
    
    Long values can be passed to `rangeBetween` as range frame boundaries, but we silently convert it to Int values, this can cause wrong results and we should fix this.
    
    Further more, we should accept any legal literal values as range frame boundaries. In this PR, we make it possible for Long values, and make accepting other DataTypes really easy to add.
    
    This PR is mostly based on Herman's previous amazing work: hvanhovell@596f53c
    
    After this been merged, we can close #16818 .
    
    ## How was this patch tested?
    
    Add new tests in `DataFrameWindowFunctionsSuite` and `TypeCoercionSuite`.
    
    Author: Xingbo Jiang <xingbo.jiang@databricks.com>
    
    Closes #18540 from jiangxb1987/rangeFrame.
    
    (cherry picked from commit 92d8563)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    jiangxb1987 authored and gatorsmile committed Jul 29, 2017
    Configuration menu
    Copy the full SHA
    66fa6bd View commit details
    Browse the repository at this point in the history

Commits on Jul 30, 2017

  1. Revert "[SPARK-19451][SQL] rangeBetween method should accept Long val…

    …ue as boundary"
    
    This reverts commit 66fa6bd.
    gatorsmile committed Jul 30, 2017
    Configuration menu
    Copy the full SHA
    e2062b9 View commit details
    Browse the repository at this point in the history

Commits on Aug 1, 2017

  1. [SPARK-21522][CORE] Fix flakiness in LauncherServerSuite.

    Handle the case where the server closes the socket before the full message
    has been written by the client.
    
    Author: Marcelo Vanzin <vanzin@cloudera.com>
    
    Closes #18727 from vanzin/SPARK-21522.
    
    (cherry picked from commit b133501)
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
    Marcelo Vanzin committed Aug 1, 2017
    Configuration menu
    Copy the full SHA
    1745434 View commit details
    Browse the repository at this point in the history
  2. [SPARK-21593][DOCS] Fix 2 rendering errors on configuration page

    ## What changes were proposed in this pull request?
    
    Fix 2 rendering errors on configuration doc page, due to SPARK-21243 and SPARK-15355.
    
    ## How was this patch tested?
    
    Manually built and viewed docs with jekyll
    
    Author: Sean Owen <sowen@cloudera.com>
    
    Closes #18793 from srowen/SPARK-21593.
    
    (cherry picked from commit b1d59e6)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    srowen committed Aug 1, 2017
    Configuration menu
    Copy the full SHA
    79e5805 View commit details
    Browse the repository at this point in the history
  3. [SPARK-21339][CORE] spark-shell --packages option does not add jars t…

    …o classpath on windows
    
    The --packages option jars are getting added to the classpath with the scheme as "file:///", in Unix it doesn't have problem with this since the scheme contains the Unix Path separator which separates the jar name with location in the classpath. In Windows, the jar file is not getting resolved from the classpath because of the scheme.
    
    Windows : file:///C:/Users/<user>/.ivy2/jars/<jar-name>.jar
    Unix : file:///home/<user>/.ivy2/jars/<jar-name>.jar
    
    With this PR, we are avoiding the 'file://' scheme to get added to the packages jar files.
    
    I have verified manually in Windows and Unix environments, with the change it adds the jar to classpath like below,
    
    Windows : C:\Users\<user>\.ivy2\jars\<jar-name>.jar
    Unix : /home/<user>/.ivy2/jars/<jar-name>.jar
    
    Author: Devaraj K <devaraj@apache.org>
    
    Closes #18708 from devaraj-kavali/SPARK-21339.
    
    (cherry picked from commit 58da1a2)
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
    Devaraj K authored and Marcelo Vanzin committed Aug 1, 2017
    Configuration menu
    Copy the full SHA
    67c60d7 View commit details
    Browse the repository at this point in the history

Commits on Aug 2, 2017

  1. [SPARK-21597][SS] Fix a potential overflow issue in EventTimeStats

    ## What changes were proposed in this pull request?
    
    This PR fixed a potential overflow issue in EventTimeStats.
    
    ## How was this patch tested?
    
    The new unit tests
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes #18803 from zsxwing/avg.
    
    (cherry picked from commit 7f63e85)
    Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
    zsxwing committed Aug 2, 2017
    Configuration menu
    Copy the full SHA
    397f904 View commit details
    Browse the repository at this point in the history
  2. [SPARK-21546][SS] dropDuplicates should ignore watermark when it's no…

    …t a key
    
    ## What changes were proposed in this pull request?
    
    When the watermark is not a column of `dropDuplicates`, right now it will crash. This PR fixed this issue.
    
    ## How was this patch tested?
    
    The new unit test.
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes #18822 from zsxwing/SPARK-21546.
    
    (cherry picked from commit 0d26b3a)
    Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
    zsxwing committed Aug 2, 2017
    Configuration menu
    Copy the full SHA
    467ee8d View commit details
    Browse the repository at this point in the history

Commits on Aug 3, 2017

  1. [SPARK-12717][PYTHON][BRANCH-2.2] Adding thread-safe broadcast pickle…

    … registry
    
    ## What changes were proposed in this pull request?
    
    When using PySpark broadcast variables in a multi-threaded environment,  `SparkContext._pickled_broadcast_vars` becomes a shared resource.  A race condition can occur when broadcast variables that are pickled from one thread get added to the shared ` _pickled_broadcast_vars` and become part of the python command from another thread.  This PR introduces a thread-safe pickled registry using thread local storage so that when python command is pickled (causing the broadcast variable to be pickled and added to the registry) each thread will have their own view of the pickle registry to retrieve and clear the broadcast variables used.
    
    ## How was this patch tested?
    
    Added a unit test that causes this race condition using another thread.
    
    Author: Bryan Cutler <cutlerb@gmail.com>
    
    Closes #18823 from BryanCutler/branch-2.2.
    BryanCutler authored and HyukjinKwon committed Aug 3, 2017
    Configuration menu
    Copy the full SHA
    690f491 View commit details
    Browse the repository at this point in the history
  2. Fix Java SimpleApp spark application

    ## What changes were proposed in this pull request?
    
    Add missing import and missing parentheses to invoke `SparkSession::text()`.
    
    ## How was this patch tested?
    
    Built and the code for this application, ran jekyll locally per docs/README.md.
    
    Author: Christiam Camacho <camacho@ncbi.nlm.nih.gov>
    
    Closes #18795 from christiam/master.
    
    (cherry picked from commit dd72b10)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    christiam authored and srowen committed Aug 3, 2017
    Configuration menu
    Copy the full SHA
    1bcfa2a View commit details
    Browse the repository at this point in the history

Commits on Aug 4, 2017

  1. [SPARK-21330][SQL] Bad partitioning does not allow to read a JDBC tab…

    …le with extreme values on the partition column
    
    ## What changes were proposed in this pull request?
    
    An overflow of the difference of bounds on the partitioning column leads to no data being read. This
    patch checks for this overflow.
    
    ## How was this patch tested?
    
    New unit test.
    
    Author: Andrew Ray <ray.andrew@gmail.com>
    
    Closes #18800 from aray/SPARK-21330.
    
    (cherry picked from commit 25826c7)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    aray authored and srowen committed Aug 4, 2017
    Configuration menu
    Copy the full SHA
    f9aae8e View commit details
    Browse the repository at this point in the history

Commits on Aug 5, 2017

  1. [SPARK-21580][SQL] Integers in aggregation expressions are wrongly ta…

    …ken as group-by ordinal
    
    ## What changes were proposed in this pull request?
    
    create temporary view data as select * from values
    (1, 1),
    (1, 2),
    (2, 1),
    (2, 2),
    (3, 1),
    (3, 2)
    as data(a, b);
    
    `select 3, 4, sum(b) from data group by 1, 2;`
    `select 3 as c, 4 as d, sum(b) from data group by c, d;`
    When running these two cases, the following exception occurred:
    `Error in query: GROUP BY position 4 is not in select list (valid range is [1, 3]); line 1 pos 10`
    
    The cause of this failure:
    If an aggregateExpression is integer, after replaced with this aggregateExpression, the
    groupExpression still considered as an ordinal.
    
    The solution:
    This bug is due to re-entrance of an analyzed plan. We can solve it by using `resolveOperators` in `SubstituteUnresolvedOrdinals`.
    
    ## How was this patch tested?
    Added unit test case
    
    Author: liuxian <liu.xian3@zte.com.cn>
    
    Closes #18779 from 10110346/groupby.
    
    (cherry picked from commit 894d5a4)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    10110346 authored and gatorsmile committed Aug 5, 2017
    Configuration menu
    Copy the full SHA
    841bc2f View commit details
    Browse the repository at this point in the history

Commits on Aug 6, 2017

  1. [SPARK-21588][SQL] SQLContext.getConf(key, null) should return null

    ## What changes were proposed in this pull request?
    
    In SQLContext.get(key,null) for a key that is not defined in the conf, and doesn't have a default value defined, throws a NPE. Int happens only when conf has a value converter
    
    Added null check on defaultValue inside SQLConf.getConfString to avoid calling entry.valueConverter(defaultValue)
    
    ## How was this patch tested?
    Added unit test
    
    Author: vinodkc <vinod.kc.in@gmail.com>
    
    Closes #18852 from vinodkc/br_Fix_SPARK-21588.
    
    (cherry picked from commit 1ba967b)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    vinodkc authored and gatorsmile committed Aug 6, 2017
    Configuration menu
    Copy the full SHA
    098aaec View commit details
    Browse the repository at this point in the history

Commits on Aug 7, 2017

  1. [SPARK-21621][CORE] Reset numRecordsWritten after DiskBlockObjectWrit…

    …er.commitAndGet called
    
    ## What changes were proposed in this pull request?
    
    We should reset numRecordsWritten to zero after DiskBlockObjectWriter.commitAndGet called.
    Because when `revertPartialWritesAndClose` be called, we decrease the written records in `ShuffleWriteMetrics` . However, we decreased the written records to zero, this should be wrong, we should only decreased the number reords after the last `commitAndGet` called.
    
    ## How was this patch tested?
    Modified existing test.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Xianyang Liu <xianyang.liu@intel.com>
    
    Closes #18830 from ConeyLiu/DiskBlockObjectWriter.
    
    (cherry picked from commit 534a063)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    ConeyLiu authored and cloud-fan committed Aug 7, 2017
    Configuration menu
    Copy the full SHA
    7a04def View commit details
    Browse the repository at this point in the history
  2. [SPARK-21647][SQL] Fix SortMergeJoin when using CROSS

    ### What changes were proposed in this pull request?
    author: BoleynSu
    closes #18836
    
    ```Scala
    val df = Seq((1, 1)).toDF("i", "j")
    df.createOrReplaceTempView("T")
    withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
      sql("select * from (select a.i from T a cross join T t where t.i = a.i) as t1 " +
        "cross join T t2 where t2.i = t1.i").explain(true)
    }
    ```
    The above code could cause the following exception:
    ```
    SortMergeJoinExec should not take Cross as the JoinType
    java.lang.IllegalArgumentException: SortMergeJoinExec should not take Cross as the JoinType
    	at org.apache.spark.sql.execution.joins.SortMergeJoinExec.outputOrdering(SortMergeJoinExec.scala:100)
    ```
    
    Our SortMergeJoinExec supports CROSS. We should not hit such an exception. This PR is to fix the issue.
    
    ### How was this patch tested?
    Modified the two existing test cases.
    
    Author: Xiao Li <gatorsmile@gmail.com>
    Author: Boleyn Su <boleyn.su@gmail.com>
    
    Closes #18863 from gatorsmile/pr-18836.
    
    (cherry picked from commit bbfd6b5)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    gatorsmile authored and cloud-fan committed Aug 7, 2017
    Configuration menu
    Copy the full SHA
    4f0eb0c View commit details
    Browse the repository at this point in the history
  3. [SPARK-21374][CORE] Fix reading globbed paths from S3 into DF with di…

    …sabled FS cache
    
    This PR replaces #18623 to do some clean up.
    
    Closes #18623
    
    Jenkins
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    Author: Andrey Taptunov <taptunov@amazon.com>
    
    Closes #18848 from zsxwing/review-pr18623.
    Andrey Taptunov authored and zsxwing committed Aug 7, 2017
    Configuration menu
    Copy the full SHA
    43f9c84 View commit details
    Browse the repository at this point in the history
  4. [SPARK-21565][SS] Propagate metadata in attribute replacement.

    ## What changes were proposed in this pull request?
    
    Propagate metadata in attribute replacement during streaming execution. This is necessary for EventTimeWatermarks consuming replaced attributes.
    
    ## How was this patch tested?
    new unit test, which was verified to fail before the fix
    
    Author: Jose Torres <joseph-torres@databricks.com>
    
    Closes #18840 from joseph-torres/SPARK-21565.
    
    (cherry picked from commit cce25b3)
    Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
    Jose Torres authored and zsxwing committed Aug 7, 2017
    Configuration menu
    Copy the full SHA
    fa92a7b View commit details
    Browse the repository at this point in the history
  5. [SPARK-21648][SQL] Fix confusing assert failure in JDBC source when p…

    …arallel fetching parameters are not properly provided.
    
    ### What changes were proposed in this pull request?
    ```SQL
    CREATE TABLE mytesttable1
    USING org.apache.spark.sql.jdbc
      OPTIONS (
      url 'jdbc:mysql://${jdbcHostname}:${jdbcPort}/${jdbcDatabase}?user=${jdbcUsername}&password=${jdbcPassword}',
      dbtable 'mytesttable1',
      paritionColumn 'state_id',
      lowerBound '0',
      upperBound '52',
      numPartitions '53',
      fetchSize '10000'
    )
    ```
    
    The above option name `paritionColumn` is wrong. That mean, users did not provide the value for `partitionColumn`. In such case, users hit a confusing error.
    
    ```
    AssertionError: assertion failed
    java.lang.AssertionError: assertion failed
    	at scala.Predef$.assert(Predef.scala:156)
    	at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:39)
    	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:312)
    ```
    
    ### How was this patch tested?
    Added a test case
    
    Author: gatorsmile <gatorsmile@gmail.com>
    
    Closes #18864 from gatorsmile/jdbcPartCol.
    
    (cherry picked from commit baf5cac)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    gatorsmile committed Aug 7, 2017
    Configuration menu
    Copy the full SHA
    a1c1199 View commit details
    Browse the repository at this point in the history

Commits on Aug 8, 2017

  1. [SPARK-21567][SQL] Dataset should work with type alias

    If we create a type alias for a type workable with Dataset, the type alias doesn't work with Dataset.
    
    A reproducible case looks like:
    
        object C {
          type TwoInt = (Int, Int)
          def tupleTypeAlias: TwoInt = (1, 1)
        }
    
        Seq(1).toDS().map(_ => ("", C.tupleTypeAlias))
    
    It throws an exception like:
    
        type T1 is not a class
        scala.ScalaReflectionException: type T1 is not a class
          at scala.reflect.api.Symbols$SymbolApi$class.asClass(Symbols.scala:275)
          ...
    
    This patch accesses the dealias of type in many places in `ScalaReflection` to fix it.
    
    Added test case.
    
    Author: Liang-Chi Hsieh <viirya@gmail.com>
    
    Closes #18813 from viirya/SPARK-21567.
    
    (cherry picked from commit ee13041)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    viirya authored and cloud-fan committed Aug 8, 2017
    Configuration menu
    Copy the full SHA
    86609a9 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    e87ffca View commit details
    Browse the repository at this point in the history

Commits on Aug 9, 2017

  1. [SPARK-21503][UI] Spark UI shows incorrect task status for a killed E…

    …xecutor Process
    
    The executor tab on Spark UI page shows task as completed when an executor process that is running that task is killed using the kill command.
    Added the case ExecutorLostFailure which was previously not there, thus, the default case would be executed in which case, task would be marked as completed. This case will consider all those cases where executor connection to Spark Driver was lost due to killing the executor process, network connection etc.
    
    ## How was this patch tested?
    Manually Tested the fix by observing the UI change before and after.
    Before:
    <img width="1398" alt="screen shot-before" src="https://user-images.githubusercontent.com/22228190/28482929-571c9cea-6e30-11e7-93dd-728de5cdea95.png">
    After:
    <img width="1385" alt="screen shot-after" src="https://user-images.githubusercontent.com/22228190/28482964-8649f5ee-6e30-11e7-91bd-2eb2089c61cc.png">
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: pgandhi <pgandhi@yahoo-inc.com>
    Author: pgandhi999 <parthkgandhi9@gmail.com>
    
    Closes #18707 from pgandhi999/master.
    
    (cherry picked from commit f016f5c)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    pgandhi authored and cloud-fan committed Aug 9, 2017
    Configuration menu
    Copy the full SHA
    d023314 View commit details
    Browse the repository at this point in the history
  2. [SPARK-21523][ML] update breeze to 0.13.2 for an emergency bugfix in …

    …strong wolfe line search
    
    ## What changes were proposed in this pull request?
    
    Update breeze to 0.13.1 for an emergency bugfix in strong wolfe line search
    scalanlp/breeze#651
    
    ## How was this patch tested?
    
    N/A
    
    Author: WeichenXu <WeichenXu123@outlook.com>
    
    Closes #18797 from WeichenXu123/update-breeze.
    
    (cherry picked from commit b35660d)
    Signed-off-by: Yanbo Liang <ybliang8@gmail.com>
    WeichenXu123 authored and yanboliang committed Aug 9, 2017
    Configuration menu
    Copy the full SHA
    7446be3 View commit details
    Browse the repository at this point in the history
  3. [SPARK-21596][SS] Ensure places calling HDFSMetadataLog.get check the…

    … return value
    
    Same PR as #18799 but for branch 2.2. Main discussion the other PR.
    --------
    
    When I was investigating a flaky test, I realized that many places don't check the return value of `HDFSMetadataLog.get(batchId: Long): Option[T]`. When a batch is supposed to be there, the caller just ignores None rather than throwing an error. If some bug causes a query doesn't generate a batch metadata file, this behavior will hide it and allow the query continuing to run and finally delete metadata logs and make it hard to debug.
    
    This PR ensures that places calling HDFSMetadataLog.get always check the return value.
    
    Jenkins
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes #18890 from tdas/SPARK-21596-2.2.
    zsxwing committed Aug 9, 2017
    Configuration menu
    Copy the full SHA
    f6d56d2 View commit details
    Browse the repository at this point in the history
  4. [SPARK-21663][TESTS] test("remote fetch below max RPC message size") …

    …should call masterTracker.stop() in MapOutputTrackerSuite
    
    Signed-off-by: 10087686 <wang.jiaochunzte.com.cn>
    
    ## What changes were proposed in this pull request?
    After Unit tests end,there should be call masterTracker.stop() to free resource;
    (Please fill in changes proposed in this fix)
    
    ## How was this patch tested?
    Run Unit tests;
    (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: 10087686 <wang.jiaochun@zte.com.cn>
    
    Closes #18867 from wangjiaochun/mapout.
    
    (cherry picked from commit 6426adf)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    wangjiaochun authored and cloud-fan committed Aug 9, 2017
    Configuration menu
    Copy the full SHA
    3ca55ea View commit details
    Browse the repository at this point in the history

Commits on Aug 11, 2017

  1. [SPARK-21699][SQL] Remove unused getTableOption in ExternalCatalog

    ## What changes were proposed in this pull request?
    This patch removes the unused SessionCatalog.getTableMetadataOption and ExternalCatalog. getTableOption.
    
    ## How was this patch tested?
    Removed the test case.
    
    Author: Reynold Xin <rxin@databricks.com>
    
    Closes #18912 from rxin/remove-getTableOption.
    
    (cherry picked from commit 584c7f1)
    Signed-off-by: Reynold Xin <rxin@databricks.com>
    rxin committed Aug 11, 2017
    Configuration menu
    Copy the full SHA
    c909496 View commit details
    Browse the repository at this point in the history
  2. [SPARK-21595] Separate thresholds for buffering and spilling in Exter…

    …nalAppendOnlyUnsafeRowArray
    
    ## What changes were proposed in this pull request?
    
    [SPARK-21595](https://issues.apache.org/jira/browse/SPARK-21595) reported that there is excessive spilling to disk due to default spill threshold for `ExternalAppendOnlyUnsafeRowArray` being quite small for WINDOW operator. Old behaviour of WINDOW operator (pre #16909) would hold data in an array for first 4096 records post which it would switch to `UnsafeExternalSorter` and start spilling to disk after reaching `spark.shuffle.spill.numElementsForceSpillThreshold` (or earlier if there was paucity of memory due to excessive consumers).
    
    Currently the (switch from in-memory to `UnsafeExternalSorter`) and (`UnsafeExternalSorter` spilling to disk) for `ExternalAppendOnlyUnsafeRowArray` is controlled by a single threshold. This PR aims to separate that to have more granular control.
    
    ## How was this patch tested?
    
    Added unit tests
    
    Author: Tejas Patil <tejasp@fb.com>
    
    Closes #18843 from tejasapatil/SPARK-21595.
    
    (cherry picked from commit 9443999)
    Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>
    tejasapatil authored and hvanhovell committed Aug 11, 2017
    Configuration menu
    Copy the full SHA
    406eb1c View commit details
    Browse the repository at this point in the history

Commits on Aug 14, 2017

  1. [SPARK-21563][CORE] Fix race condition when serializing TaskDescripti…

    …ons and adding jars
    
    ## What changes were proposed in this pull request?
    
    Fix the race condition when serializing TaskDescriptions and adding jars by keeping the set of jars and files for a TaskSet constant across the lifetime of the TaskSet.  Otherwise TaskDescription serialization can produce an invalid serialization when new file/jars are added concurrently as the TaskDescription is serialized.
    
    ## How was this patch tested?
    
    Additional unit test ensures jars/files contained in the TaskDescription remain constant throughout the lifetime of the TaskSet.
    
    Author: Andrew Ash <andrew@andrewash.com>
    
    Closes #18913 from ash211/SPARK-21563.
    
    (cherry picked from commit 6847e93)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    ash211 authored and cloud-fan committed Aug 14, 2017
    Configuration menu
    Copy the full SHA
    7b98077 View commit details
    Browse the repository at this point in the history
  2. [SPARK-21696][SS] Fix a potential issue that may generate partial sna…

    …pshot files
    
    ## What changes were proposed in this pull request?
    
    Directly writing a snapshot file may generate a partial file. This PR changes it to write to a temp file then rename to the target file.
    
    ## How was this patch tested?
    
    Jenkins.
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes #18928 from zsxwing/SPARK-21696.
    
    (cherry picked from commit 282f00b)
    Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
    zsxwing authored and tdas committed Aug 14, 2017
    Configuration menu
    Copy the full SHA
    48bacd3 View commit details
    Browse the repository at this point in the history

Commits on Aug 15, 2017

  1. [SPARK-21721][SQL] Clear FileSystem deleteOnExit cache when paths are…

    … successfully removed
    
    ## What changes were proposed in this pull request?
    
    We put staging path to delete into the deleteOnExit cache of `FileSystem` in case of the path can't be successfully removed. But when we successfully remove the path, we don't remove it from the cache. We should do it to avoid continuing grow the cache size.
    
    ## How was this patch tested?
    
    Added a test.
    
    Author: Liang-Chi Hsieh <viirya@gmail.com>
    
    Closes #18934 from viirya/SPARK-21721.
    
    (cherry picked from commit 4c3cf1c)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    viirya authored and gatorsmile committed Aug 15, 2017
    Configuration menu
    Copy the full SHA
    d9c8e62 View commit details
    Browse the repository at this point in the history

Commits on Aug 16, 2017

  1. [SPARK-21723][ML] Fix writing LibSVM (key not found: numFeatures)

    Check the option "numFeatures" only when reading LibSVM, not when writing. When writing, Spark was raising an exception. After the change it will ignore the option completely. liancheng HyukjinKwon
    
    (Maybe the usage should be forbidden when writing, in a major version change?).
    
    Manual test, that loading and writing LibSVM files work fine, both with and without the numFeatures option.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Jan Vrsovsky <jan.vrsovsky@firma.seznam.cz>
    
    Closes #18872 from ProtD/master.
    
    (cherry picked from commit 8321c14)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    Jan Vrsovsky authored and srowen committed Aug 16, 2017
    Configuration menu
    Copy the full SHA
    f1accc8 View commit details
    Browse the repository at this point in the history
  2. [SPARK-21656][CORE] spark dynamic allocation should not idle timeout …

    …executors when tasks still to run
    
    ## What changes were proposed in this pull request?
    
    Right now spark lets go of executors when they are idle for the 60s (or configurable time). I have seen spark let them go when they are idle but they were really needed. I have seen this issue when the scheduler was waiting to get node locality but that takes longer than the default idle timeout. In these jobs the number of executors goes down really small (less than 10) but there are still like 80,000 tasks to run.
    We should consider not allowing executors to idle timeout if they are still needed according to the number of tasks to be run.
    
    ## How was this patch tested?
    
    Tested by manually adding executors to `executorsIdsToBeRemoved` list and seeing if those executors were removed when there are a lot of tasks and a high `numExecutorsTarget` value.
    
    Code used
    
    In  `ExecutorAllocationManager.start()`
    
    ```
        start_time = clock.getTimeMillis()
    ```
    
    In `ExecutorAllocationManager.schedule()`
    ```
        val executorIdsToBeRemoved = ArrayBuffer[String]()
        if ( now > start_time + 1000 * 60 * 2) {
          logInfo("--- REMOVING 1/2 of the EXECUTORS ---")
          start_time +=  1000 * 60 * 100
          var counter = 0
          for (x <- executorIds) {
            counter += 1
            if (counter == 2) {
              counter = 0
              executorIdsToBeRemoved += x
            }
          }
        }
    
    Author: John Lee <jlee2@yahoo-inc.com>
    
    Closes #18874 from yoonlee95/SPARK-21656.
    
    (cherry picked from commit adf005d)
    Signed-off-by: Tom Graves <tgraves@yahoo-inc.com>
    John Lee authored and Tom Graves committed Aug 16, 2017
    Configuration menu
    Copy the full SHA
    f5ede0d View commit details
    Browse the repository at this point in the history
  3. [SPARK-18464][SQL][BACKPORT] support old table which doesn't store sc…

    …hema in table properties
    
    backport #18907 to branch 2.2
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes #18963 from cloud-fan/backport.
    cloud-fan authored and gatorsmile committed Aug 16, 2017
    Configuration menu
    Copy the full SHA
    2a96975 View commit details
    Browse the repository at this point in the history

Commits on Aug 18, 2017

  1. [SPARK-21739][SQL] Cast expression should initialize timezoneId when …

    …it is called statically to convert something into TimestampType
    
    ## What changes were proposed in this pull request?
    
    https://issues.apache.org/jira/projects/SPARK/issues/SPARK-21739
    
    This issue is caused by introducing TimeZoneAwareExpression.
    When the **Cast** expression converts something into TimestampType, it should be resolved with setting `timezoneId`. In general, it is resolved in LogicalPlan phase.
    
    However, there are still some places that use Cast expression statically to convert datatypes without setting `timezoneId`. In such cases,  `NoSuchElementException: None.get` will be thrown for TimestampType.
    
    This PR is proposed to fix the issue. We have checked the whole project and found two such usages(i.e., in`TableReader` and `HiveTableScanExec`).
    
    ## How was this patch tested?
    
    unit test
    
    Author: donnyzone <wellfengzhu@gmail.com>
    
    Closes #18960 from DonnyZone/spark-21739.
    
    (cherry picked from commit 310454b)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    DonnyZone authored and gatorsmile committed Aug 18, 2017
    Configuration menu
    Copy the full SHA
    fdea642 View commit details
    Browse the repository at this point in the history

Commits on Aug 20, 2017

  1. [MINOR] Correct validateAndTransformSchema in GaussianMixture and AFT…

    …SurvivalRegression
    
    ## What changes were proposed in this pull request?
    
    The line SchemaUtils.appendColumn(schema, $(predictionCol), IntegerType) did not modify the variable schema, hence only the last line had any effect. A temporary variable is used to correctly append the two columns predictionCol and probabilityCol.
    
    ## How was this patch tested?
    
    Manually.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Cédric Pelvet <cedric.pelvet@gmail.com>
    
    Closes #18980 from sharp-pixel/master.
    
    (cherry picked from commit 73e04ec)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    sharp-pixel authored and srowen committed Aug 20, 2017
    Configuration menu
    Copy the full SHA
    6c2a38a View commit details
    Browse the repository at this point in the history
  2. [SPARK-21721][SQL][FOLLOWUP] Clear FileSystem deleteOnExit cache when…

    … paths are successfully removed
    
    ## What changes were proposed in this pull request?
    
    Fix a typo in test.
    
    ## How was this patch tested?
    
    Jenkins tests.
    
    Author: Liang-Chi Hsieh <viirya@gmail.com>
    
    Closes #19005 from viirya/SPARK-21721-followup.
    
    (cherry picked from commit 28a6cca)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    viirya authored and cloud-fan committed Aug 20, 2017
    Configuration menu
    Copy the full SHA
    0f640e9 View commit details
    Browse the repository at this point in the history

Commits on Aug 21, 2017

  1. [SPARK-21617][SQL] Store correct table metadata when altering schema …

    …in Hive metastore.
    
    For Hive tables, the current "replace the schema" code is the correct
    path, except that an exception in that path should result in an error, and
    not in retrying in a different way.
    
    For data source tables, Spark may generate a non-compatible Hive table;
    but for that to work with Hive 2.1, the detection of data source tables needs
    to be fixed in the Hive client, to also consider the raw tables used by code
    such as `alterTableSchema`.
    
    Tested with existing and added unit tests (plus internal tests with a 2.1 metastore).
    
    Author: Marcelo Vanzin <vanzin@cloudera.com>
    
    Closes #18849 from vanzin/SPARK-21617.
    
    (cherry picked from commit 84b5b16)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    Marcelo Vanzin authored and gatorsmile committed Aug 21, 2017
    Configuration menu
    Copy the full SHA
    526087f View commit details
    Browse the repository at this point in the history

Commits on Aug 24, 2017

  1. [SPARK-21805][SPARKR] Disable R vignettes code on Windows

    ## What changes were proposed in this pull request?
    
    Code in vignettes requires winutils on windows to run, when publishing to CRAN or building from source, winutils might not be available, so it's better to disable code run (so resulting vigenttes will not have output from code, but text is still there and code is still there)
    
    fix * checking re-building of vignette outputs ... WARNING
    and
    > %LOCALAPPDATA% not found. Please define the environment variable or restart and enter an installation path in localDir.
    
    ## How was this patch tested?
    
    jenkins, appveyor, r-hub
    
    before: https://artifacts.r-hub.io/SparkR_2.2.0.tar.gz-49cecef3bb09db1db130db31604e0293/SparkR.Rcheck/00check.log
    after: https://artifacts.r-hub.io/SparkR_2.2.0.tar.gz-86a066c7576f46794930ad114e5cff7c/SparkR.Rcheck/00check.log
    
    Author: Felix Cheung <felixcheung_m@hotmail.com>
    
    Closes #19016 from felixcheung/rvigwind.
    
    (cherry picked from commit 43cbfad)
    Signed-off-by: Felix Cheung <felixcheung@apache.org>
    felixcheung authored and Felix Cheung committed Aug 24, 2017
    Configuration menu
    Copy the full SHA
    236b2f4 View commit details
    Browse the repository at this point in the history
  2. [SPARK-21826][SQL] outer broadcast hash join should not throw NPE

    This is a bug introduced by https://github.com/apache/spark/pull/11274/files#diff-7adb688cbfa583b5711801f196a074bbL274 .
    
    Non-equal join condition should only be applied when the equal-join condition matches.
    
    regression test
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes #19036 from cloud-fan/bug.
    
    (cherry picked from commit 2dd37d8)
    Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>
    cloud-fan authored and hvanhovell committed Aug 24, 2017
    Configuration menu
    Copy the full SHA
    a585367 View commit details
    Browse the repository at this point in the history
  3. [SPARK-21681][ML] fix bug of MLOR do not work correctly when featureS…

    …td contains zero (backport PR for 2.2)
    
    ## What changes were proposed in this pull request?
    
    This is backport PR of #18896
    
    fix bug of MLOR do not work correctly when featureStd contains zero
    
    We can reproduce the bug through such dataset (features including zero variance), will generate wrong result (all coefficients becomes 0)
    ```
        val multinomialDatasetWithZeroVar = {
          val nPoints = 100
          val coefficients = Array(
            -0.57997, 0.912083, -0.371077,
            -0.16624, -0.84355, -0.048509)
    
          val xMean = Array(5.843, 3.0)
          val xVariance = Array(0.6856, 0.0)  // including zero variance
    
          val testData = generateMultinomialLogisticInput(
            coefficients, xMean, xVariance, addIntercept = true, nPoints, seed)
    
          val df = sc.parallelize(testData, 4).toDF().withColumn("weight", lit(1.0))
          df.cache()
          df
        }
    ```
    ## How was this patch tested?
    
    testcase added.
    
    Author: WeichenXu <WeichenXu123@outlook.com>
    
    Closes #19026 from WeichenXu123/fix_mlor_zero_var_bug_2_2.
    WeichenXu123 authored and jkbradley committed Aug 24, 2017
    Configuration menu
    Copy the full SHA
    2b4bd79 View commit details
    Browse the repository at this point in the history

Commits on Aug 28, 2017

  1. [SPARK-21818][ML][MLLIB] Fix bug of MultivariateOnlineSummarizer.vari…

    …ance generate negative result
    
    Because of numerical error, MultivariateOnlineSummarizer.variance is possible to generate negative variance.
    
    **This is a serious bug because many algos in MLLib**
    **use stddev computed from** `sqrt(variance)`
    **it will generate NaN and crash the whole algorithm.**
    
    we can reproduce this bug use the following code:
    ```
        val summarizer1 = (new MultivariateOnlineSummarizer)
          .add(Vectors.dense(3.0), 0.7)
        val summarizer2 = (new MultivariateOnlineSummarizer)
          .add(Vectors.dense(3.0), 0.4)
        val summarizer3 = (new MultivariateOnlineSummarizer)
          .add(Vectors.dense(3.0), 0.5)
        val summarizer4 = (new MultivariateOnlineSummarizer)
          .add(Vectors.dense(3.0), 0.4)
    
        val summarizer = summarizer1
          .merge(summarizer2)
          .merge(summarizer3)
          .merge(summarizer4)
    
        println(summarizer.variance(0))
    ```
    This PR fix the bugs in `mllib.stat.MultivariateOnlineSummarizer.variance` and `ml.stat.SummarizerBuffer.variance`, and several places in `WeightedLeastSquares`
    
    test cases added.
    
    Author: WeichenXu <WeichenXu123@outlook.com>
    
    Closes #19029 from WeichenXu123/fix_summarizer_var_bug.
    
    (cherry picked from commit 0456b40)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    WeichenXu123 authored and srowen committed Aug 28, 2017
    Configuration menu
    Copy the full SHA
    0d4ef2f View commit details
    Browse the repository at this point in the history
  2. [SPARK-21798] No config to replace deprecated SPARK_CLASSPATH config …

    …for launching daemons like History Server
    
    History Server Launch uses SparkClassCommandBuilder for launching the server. It is observed that SPARK_CLASSPATH has been removed and deprecated. For spark-submit this takes a different route and spark.driver.extraClasspath takes care of specifying additional jars in the classpath that were previously specified in the SPARK_CLASSPATH. Right now the only way specify the additional jars for launching daemons such as history server is using SPARK_DIST_CLASSPATH (https://spark.apache.org/docs/latest/hadoop-provided.html) but this I presume is a distribution classpath. It would be nice to have a similar config like spark.driver.extraClasspath for launching daemons similar to history server.
    
    Added new environment variable SPARK_DAEMON_CLASSPATH to set classpath for launching daemons. Tested and verified for History Server and Standalone Mode.
    
    ## How was this patch tested?
    Initially, history server start script would fail for the reason being that it could not find the required jars for launching the server in the java classpath. Same was true for running Master and Worker in standalone mode. By adding the environment variable SPARK_DAEMON_CLASSPATH to the java classpath, both the daemons(History Server, Standalone daemons) are starting up and running.
    
    Author: pgandhi <pgandhi@yahoo-inc.com>
    Author: pgandhi999 <parthkgandhi9@gmail.com>
    
    Closes #19047 from pgandhi999/master.
    
    (cherry picked from commit 24e6c18)
    Signed-off-by: Tom Graves <tgraves@yahoo-inc.com>
    pgandhi authored and Tom Graves committed Aug 28, 2017
    Configuration menu
    Copy the full SHA
    59bb7eb View commit details
    Browse the repository at this point in the history

Commits on Aug 29, 2017

  1. [SPARK-21714][CORE][BACKPORT-2.2] Avoiding re-uploading remote resour…

    …ces in yarn client mode
    
    ## What changes were proposed in this pull request?
    
    This is a backport PR to fix issue of re-uploading remote resource in yarn client mode. The original PR is #18962.
    
    ## How was this patch tested?
    
    Tested in local UT.
    
    Author: jerryshao <sshao@hortonworks.com>
    
    Closes #19074 from jerryshao/SPARK-21714-2.2-backport.
    jerryshao authored and Marcelo Vanzin committed Aug 29, 2017
    2 Configuration menu
    Copy the full SHA
    59529b2 View commit details
    Browse the repository at this point in the history
  2. Revert "[SPARK-21714][CORE][BACKPORT-2.2] Avoiding re-uploading remot…

    …e resources in yarn client mode"
    
    This reverts commit 59529b2.
    Marcelo Vanzin committed Aug 29, 2017
    Configuration menu
    Copy the full SHA
    917fe66 View commit details
    Browse the repository at this point in the history

Commits on Aug 30, 2017

  1. [SPARK-21254][WEBUI] History UI performance fixes

    ## This is a backport of PR #18783 to the latest released branch 2.2.
    
    ## What changes were proposed in this pull request?
    
    As described in JIRA ticket, History page is taking ~1min to load for cases when amount of jobs is 10k+.
    Most of the time is currently being spent on DOM manipulations and all additional costs implied by this (browser repaints and reflows).
    PR's goal is not to change any behavior but to optimize time of History UI rendering:
    
    1. The most costly operation is setting `innerHTML` for `duration` column within a loop, which is [extremely unperformant](https://jsperf.com/jquery-append-vs-html-list-performance/24). [Refactoring ](criteo-forks@b7e56ee) this helped to get page load time **down to 10-15s**
    
    2. Second big gain bringing page load time **down to 4s** was [was achieved](criteo-forks@3630ca2) by detaching table's DOM before parsing it with DataTables jQuery plugin.
    
    3. Another chunk of improvements ([1]criteo-forks@aeeeeb5), [2](criteo-forks@e25be9a), [3](criteo-forks@9169707)) was focused on removing unnecessary DOM manipulations that in total contributed ~250ms to page load time.
    
    ## How was this patch tested?
    
    Tested by existing Selenium tests in `org.apache.spark.deploy.history.HistoryServerSuite`.
    
    Changes were also tested on Criteo's spark-2.1 fork with 20k+ number of rows in the table, reducing load time to 4s.
    
    Author: Dmitry Parfenchik <d.parfenchik@criteo.com>
    
    Closes #18860 from 2ooom/history-ui-perf-fix-2.2.
    2ooom authored and srowen committed Aug 30, 2017
    Configuration menu
    Copy the full SHA
    a6a9944 View commit details
    Browse the repository at this point in the history
  2. [SPARK-21714][CORE][BACKPORT-2.2] Avoiding re-uploading remote resour…

    …ces in yarn client mode
    
    ## What changes were proposed in this pull request?
    
    This is a backport PR to fix issue of re-uploading remote resource in yarn client mode. The original PR is #18962.
    
    ## How was this patch tested?
    
    Tested in local UT.
    
    Author: jerryshao <sshao@hortonworks.com>
    
    Closes #19074 from jerryshao/SPARK-21714-2.2-backport.
    jerryshao authored and Marcelo Vanzin committed Aug 30, 2017
    Configuration menu
    Copy the full SHA
    d10c9dc View commit details
    Browse the repository at this point in the history
  3. [SPARK-21834] Incorrect executor request in case of dynamic allocation

    ## What changes were proposed in this pull request?
    
    killExecutor api currently does not allow killing an executor without updating the total number of executors needed. In case of dynamic allocation is turned on and the allocator tries to kill an executor, the scheduler reduces the total number of executors needed ( see https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L635) which is incorrect because the allocator already takes care of setting the required number of executors itself.
    
    ## How was this patch tested?
    
    Ran a job on the cluster and made sure the executor request is correct
    
    Author: Sital Kedia <skedia@fb.com>
    
    Closes #19081 from sitalkedia/skedia/oss_fix_executor_allocation.
    
    (cherry picked from commit 6949a9c)
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
    Sital Kedia authored and Marcelo Vanzin committed Aug 30, 2017
    Configuration menu
    Copy the full SHA
    14054ff View commit details
    Browse the repository at this point in the history

Commits on Sep 1, 2017

  1. [SPARK-21884][SPARK-21477][BACKPORT-2.2][SQL] Mark LocalTableScanExec…

    …'s input data transient
    
    This PR is to backport #18686 for resolving the issue in #19094
    
    ---
    
    ## What changes were proposed in this pull request?
    This PR is to mark the parameter `rows` and `unsafeRow` of LocalTableScanExec transient. It can avoid serializing the unneeded objects.
    
    ## How was this patch tested?
    N/A
    
    Author: gatorsmile <gatorsmile@gmail.com>
    
    Closes #19101 from gatorsmile/backport-21477.
    gatorsmile committed Sep 1, 2017
    Configuration menu
    Copy the full SHA
    50f86e1 View commit details
    Browse the repository at this point in the history

Commits on Sep 4, 2017

  1. [SPARK-21418][SQL] NoSuchElementException: None.get in DataSourceScan…

    …Exec with sun.io.serialization.extendedDebugInfo=true
    
    ## What changes were proposed in this pull request?
    
    If no SparkConf is available to Utils.redact, simply don't redact.
    
    ## How was this patch tested?
    
    Existing tests
    
    Author: Sean Owen <sowen@cloudera.com>
    
    Closes #19123 from srowen/SPARK-21418.
    
    (cherry picked from commit ca59445)
    Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>
    srowen authored and hvanhovell committed Sep 4, 2017
    Configuration menu
    Copy the full SHA
    fb1b5f0 View commit details
    Browse the repository at this point in the history

Commits on Sep 5, 2017

  1. [SPARK-21925] Update trigger interval documentation in docs with beha…

    …vior change in Spark 2.2
    
    Forgot to update docs with behavior change.
    
    Author: Burak Yavuz <brkyvz@gmail.com>
    
    Closes #19138 from brkyvz/trigger-doc-fix.
    
    (cherry picked from commit 8c954d2)
    Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
    brkyvz authored and tdas committed Sep 5, 2017
    Configuration menu
    Copy the full SHA
    1f7c486 View commit details
    Browse the repository at this point in the history
  2. [MINOR][DOC] Update Partition Discovery section to enumerate all av…

    …ailable file sources
    
    ## What changes were proposed in this pull request?
    
    All built-in data sources support `Partition Discovery`. We had better update the document to give the users more benefit clearly.
    
    **AFTER**
    
    <img width="906" alt="1" src="https://user-images.githubusercontent.com/9700541/30083628-14278908-9244-11e7-98dc-9ad45fe233a9.png">
    
    ## How was this patch tested?
    
    ```
    SKIP_API=1 jekyll serve --watch
    ```
    
    Author: Dongjoon Hyun <dongjoon@apache.org>
    
    Closes #19139 from dongjoon-hyun/partitiondiscovery.
    
    (cherry picked from commit 9e451bc)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    dongjoon-hyun authored and gatorsmile committed Sep 5, 2017
    Configuration menu
    Copy the full SHA
    7da8fbf View commit details
    Browse the repository at this point in the history

Commits on Sep 6, 2017

  1. [SPARK-21924][DOCS] Update structured streaming programming guide doc

    ## What changes were proposed in this pull request?
    
    Update the line "For example, the data (12:09, cat) is out of order and late, and it falls in windows 12:05 - 12:15 and 12:10 - 12:20." as follow "For example, the data (12:09, cat) is out of order and late, and it falls in windows 12:00 - 12:10 and 12:05 - 12:15." under the programming structured streaming programming guide.
    
    Author: Riccardo Corbella <r.corbella@reply.it>
    
    Closes #19137 from riccardocorbella/bugfix.
    
    (cherry picked from commit 4ee7dfe)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    Riccardo Corbella authored and srowen committed Sep 6, 2017
    Configuration menu
    Copy the full SHA
    9afab9a View commit details
    Browse the repository at this point in the history
  2. [SPARK-21901][SS] Define toString for StateOperatorProgress

    ## What changes were proposed in this pull request?
    
    Just `StateOperatorProgress.toString` + few formatting fixes
    
    ## How was this patch tested?
    
    Local build. Waiting for OK from Jenkins.
    
    Author: Jacek Laskowski <jacek@japila.pl>
    
    Closes #19112 from jaceklaskowski/SPARK-21901-StateOperatorProgress-toString.
    
    (cherry picked from commit fa0092b)
    Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
    jaceklaskowski authored and zsxwing committed Sep 6, 2017
    Configuration menu
    Copy the full SHA
    342cc2a View commit details
    Browse the repository at this point in the history

Commits on Sep 7, 2017

  1. Fixed pandoc dependency issue in python/setup.py

    ## Problem Description
    
    When pyspark is listed as a dependency of another package, installing
    the other package will cause an install failure in pyspark. When the
    other package is being installed, pyspark's setup_requires requirements
    are installed including pypandoc. Thus, the exception handling on
    setup.py:152 does not work because the pypandoc module is indeed
    available. However, the pypandoc.convert() function fails if pandoc
    itself is not installed (in our use cases it is not). This raises an
    OSError that is not handled, and setup fails.
    
    The following is a sample failure:
    ```
    $ which pandoc
    $ pip freeze | grep pypandoc
    pypandoc==1.4
    $ pip install pyspark
    Collecting pyspark
      Downloading pyspark-2.2.0.post0.tar.gz (188.3MB)
        100% |████████████████████████████████| 188.3MB 16.8MB/s
        Complete output from command python setup.py egg_info:
        Maybe try:
    
            sudo apt-get install pandoc
        See http://johnmacfarlane.net/pandoc/installing.html
        for installation options
        ---------------------------------------------------------------
    
        Traceback (most recent call last):
          File "<string>", line 1, in <module>
          File "/tmp/pip-build-mfnizcwa/pyspark/setup.py", line 151, in <module>
            long_description = pypandoc.convert('README.md', 'rst')
          File "/home/tbeck/.virtualenvs/cem/lib/python3.5/site-packages/pypandoc/__init__.py", line 69, in convert
            outputfile=outputfile, filters=filters)
          File "/home/tbeck/.virtualenvs/cem/lib/python3.5/site-packages/pypandoc/__init__.py", line 260, in _convert_input
            _ensure_pandoc_path()
          File "/home/tbeck/.virtualenvs/cem/lib/python3.5/site-packages/pypandoc/__init__.py", line 544, in _ensure_pandoc_path
            raise OSError("No pandoc was found: either install pandoc and add it\n"
        OSError: No pandoc was found: either install pandoc and add it
        to your PATH or or call pypandoc.download_pandoc(...) or
        install pypandoc wheels with included pandoc.
    
        ----------------------------------------
    Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-mfnizcwa/pyspark/
    ```
    
    ## What changes were proposed in this pull request?
    
    This change simply adds an additional exception handler for the OSError
    that is raised. This allows pyspark to be installed client-side without requiring pandoc to be installed.
    
    ## How was this patch tested?
    
    I tested this by building a wheel package of pyspark with the change applied. Then, in a clean virtual environment with pypandoc installed but pandoc not available on the system, I installed pyspark from the wheel.
    
    Here is the output
    
    ```
    $ pip freeze | grep pypandoc
    pypandoc==1.4
    $ which pandoc
    $ pip install --no-cache-dir ../spark/python/dist/pyspark-2.3.0.dev0-py2.py3-none-any.whl
    Processing /home/tbeck/work/spark/python/dist/pyspark-2.3.0.dev0-py2.py3-none-any.whl
    Requirement already satisfied: py4j==0.10.6 in /home/tbeck/.virtualenvs/cem/lib/python3.5/site-packages (from pyspark==2.3.0.dev0)
    Installing collected packages: pyspark
    Successfully installed pyspark-2.3.0.dev0
    ```
    
    Author: Tucker Beck <tucker.beck@rentrakmail.com>
    
    Closes #18981 from dusktreader/dusktreader/fix-pandoc-dependency-issue-in-setup_py.
    
    (cherry picked from commit aad2125)
    Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
    Tucker Beck authored and HyukjinKwon committed Sep 7, 2017
    Configuration menu
    Copy the full SHA
    49968de View commit details
    Browse the repository at this point in the history
  2. [SPARK-21890] Credentials not being passed to add the tokens

    ## What changes were proposed in this pull request?
    I observed this while running a oozie job trying to connect to hbase via spark.
    It look like the creds are not being passed in thehttps://github.com/apache/spark/blob/branch-2.2/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/security/HadoopFSCredentialProvider.scala#L53 for 2.2 release.
    More Info as to why it fails on secure grid:
    Oozie client gets the necessary tokens the application needs before launching. It passes those tokens along to the oozie launcher job (MR job) which will then actually call the Spark client to launch the spark app and pass the tokens along.
    The oozie launcher job cannot get anymore tokens because all it has is tokens ( you can't get tokens with tokens, you need tgt or keytab).
    The error here is because the launcher job runs the Spark Client to submit the spark job but the spark client doesn't see that it already has the hdfs tokens so it tries to get more, which ends with the exception.
    There was a change with SPARK-19021 to generalize the hdfs credentials provider that changed it so we don't pass the existing credentials into the call to get tokens so it doesn't realize it already has the necessary tokens.
    
    https://issues.apache.org/jira/browse/SPARK-21890
    Modified to pass creds to get delegation tokens
    
    ## How was this patch tested?
    Manual testing on our secure cluster
    
    Author: Sanket Chintapalli <schintap@yahoo-inc.com>
    
    Closes #19103 from redsanket/SPARK-21890.
    Sanket Chintapalli authored and Marcelo Vanzin committed Sep 7, 2017
    Configuration menu
    Copy the full SHA
    0848df1 View commit details
    Browse the repository at this point in the history

Commits on Sep 8, 2017

  1. [SPARK-21950][SQL][PYTHON][TEST] pyspark.sql.tests.SQLTests2 should s…

    …top SparkContext.
    
    ## What changes were proposed in this pull request?
    
    `pyspark.sql.tests.SQLTests2` doesn't stop newly created spark context in the test and it might affect the following tests.
    This pr makes `pyspark.sql.tests.SQLTests2` stop `SparkContext`.
    
    ## How was this patch tested?
    
    Existing tests.
    
    Author: Takuya UESHIN <ueshin@databricks.com>
    
    Closes #19158 from ueshin/issues/SPARK-21950.
    
    (cherry picked from commit 57bc1e9)
    Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
    ueshin committed Sep 8, 2017
    Configuration menu
    Copy the full SHA
    4304d0b View commit details
    Browse the repository at this point in the history
  2. [SPARK-21915][ML][PYSPARK] Model 1 and Model 2 ParamMaps Missing

    dongjoon-hyun HyukjinKwon
    
    Error in PySpark example code:
    /examples/src/main/python/ml/estimator_transformer_param_example.py
    
    The original Scala code says
    println("Model 2 was fit using parameters: " + model2.parent.extractParamMap)
    
    The parent is lr
    
    There is no method for accessing parent as is done in Scala.
    
    This code has been tested in Python, and returns values consistent with Scala
    
    ## What changes were proposed in this pull request?
    
    Proposing to call the lr variable instead of model1 or model2
    
    ## How was this patch tested?
    
    This patch was tested with Spark 2.1.0 comparing the Scala and PySpark results. Pyspark returns nothing at present for those two print lines.
    
    The output for model2 in PySpark should be
    
    {Param(parent='LogisticRegression_4187be538f744d5a9090', name='tol', doc='the convergence tolerance for iterative algorithms (>= 0).'): 1e-06,
    Param(parent='LogisticRegression_4187be538f744d5a9090', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0,
    Param(parent='LogisticRegression_4187be538f744d5a9090', name='predictionCol', doc='prediction column name.'): 'prediction',
    Param(parent='LogisticRegression_4187be538f744d5a9090', name='featuresCol', doc='features column name.'): 'features',
    Param(parent='LogisticRegression_4187be538f744d5a9090', name='labelCol', doc='label column name.'): 'label',
    Param(parent='LogisticRegression_4187be538f744d5a9090', name='probabilityCol', doc='Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities.'): 'myProbability',
    Param(parent='LogisticRegression_4187be538f744d5a9090', name='rawPredictionCol', doc='raw prediction (a.k.a. confidence) column name.'): 'rawPrediction',
    Param(parent='LogisticRegression_4187be538f744d5a9090', name='family', doc='The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial'): 'auto',
    Param(parent='LogisticRegression_4187be538f744d5a9090', name='fitIntercept', doc='whether to fit an intercept term.'): True,
    Param(parent='LogisticRegression_4187be538f744d5a9090', name='threshold', doc='Threshold in binary classification prediction, in range [0, 1]. If threshold and thresholds are both set, they must match.e.g. if threshold is p, then thresholds must be equal to [1-p, p].'): 0.55,
    Param(parent='LogisticRegression_4187be538f744d5a9090', name='aggregationDepth', doc='suggested depth for treeAggregate (>= 2).'): 2,
    Param(parent='LogisticRegression_4187be538f744d5a9090', name='maxIter', doc='max number of iterations (>= 0).'): 30,
    Param(parent='LogisticRegression_4187be538f744d5a9090', name='regParam', doc='regularization parameter (>= 0).'): 0.1,
    Param(parent='LogisticRegression_4187be538f744d5a9090', name='standardization', doc='whether to standardize the training features before fitting the model.'): True}
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: MarkTab marktab.net <marktab@users.noreply.github.com>
    
    Closes #19152 from marktab/branch-2.2.
    marktab authored and srowen committed Sep 8, 2017
    Configuration menu
    Copy the full SHA
    781a1f8 View commit details
    Browse the repository at this point in the history
  3. [SPARK-21936][SQL][2.2] backward compatibility test framework for Hiv…

    …eExternalCatalog
    
    backport #19148 to 2.2
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes #19163 from cloud-fan/test.
    cloud-fan authored and gatorsmile committed Sep 8, 2017
    Configuration menu
    Copy the full SHA
    08cb06a View commit details
    Browse the repository at this point in the history
  4. [SPARK-21946][TEST] fix flaky test: "alter table: rename cached table…

    …" in InMemoryCatalogedDDLSuite
    
    ## What changes were proposed in this pull request?
    
    This PR fixes flaky test `InMemoryCatalogedDDLSuite "alter table: rename cached table"`.
    Since this test validates distributed DataFrame, the result should be checked by using `checkAnswer`. The original version used `df.collect().Seq` method that does not guaranty an order of each element of the result.
    
    ## How was this patch tested?
    
    Use existing test case
    
    Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
    
    Closes #19159 from kiszk/SPARK-21946.
    
    (cherry picked from commit 8a4f228)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    kiszk authored and gatorsmile committed Sep 8, 2017
    Configuration menu
    Copy the full SHA
    9ae7c96 View commit details
    Browse the repository at this point in the history
  5. [SPARK-21128][R][BACKPORT-2.2] Remove both "spark-warehouse" and "met…

    …astore_db" before listing files in R tests
    
    ## What changes were proposed in this pull request?
    
    This PR proposes to list the files in test _after_ removing both "spark-warehouse" and "metastore_db" so that the next run of R tests pass fine. This is sometimes a bit annoying.
    
    ## How was this patch tested?
    
    Manually running multiple times R tests via `./R/run-tests.sh`.
    
    **Before**
    
    Second run:
    
    ```
    SparkSQL functions: Spark package found in SPARK_HOME: .../spark
    ...............................................................................................................................................................
    ...............................................................................................................................................................
    ...............................................................................................................................................................
    ...............................................................................................................................................................
    ...............................................................................................................................................................
    ....................................................................................................1234.......................
    
    Failed -------------------------------------------------------------------------
    1. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3384)
    length(list1) not equal to length(list2).
    1/1 mismatches
    [1] 25 - 23 == 2
    
    2. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3384)
    sort(list1, na.last = TRUE) not equal to sort(list2, na.last = TRUE).
    10/25 mismatches
    x[16]: "metastore_db"
    y[16]: "pkg"
    
    x[17]: "pkg"
    y[17]: "R"
    
    x[18]: "R"
    y[18]: "README.md"
    
    x[19]: "README.md"
    y[19]: "run-tests.sh"
    
    x[20]: "run-tests.sh"
    y[20]: "SparkR_2.2.0.tar.gz"
    
    x[21]: "metastore_db"
    y[21]: "pkg"
    
    x[22]: "pkg"
    y[22]: "R"
    
    x[23]: "R"
    y[23]: "README.md"
    
    x[24]: "README.md"
    y[24]: "run-tests.sh"
    
    x[25]: "run-tests.sh"
    y[25]: "SparkR_2.2.0.tar.gz"
    
    3. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3388)
    length(list1) not equal to length(list2).
    1/1 mismatches
    [1] 25 - 23 == 2
    
    4. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3388)
    sort(list1, na.last = TRUE) not equal to sort(list2, na.last = TRUE).
    10/25 mismatches
    x[16]: "metastore_db"
    y[16]: "pkg"
    
    x[17]: "pkg"
    y[17]: "R"
    
    x[18]: "R"
    y[18]: "README.md"
    
    x[19]: "README.md"
    y[19]: "run-tests.sh"
    
    x[20]: "run-tests.sh"
    y[20]: "SparkR_2.2.0.tar.gz"
    
    x[21]: "metastore_db"
    y[21]: "pkg"
    
    x[22]: "pkg"
    y[22]: "R"
    
    x[23]: "R"
    y[23]: "README.md"
    
    x[24]: "README.md"
    y[24]: "run-tests.sh"
    
    x[25]: "run-tests.sh"
    y[25]: "SparkR_2.2.0.tar.gz"
    
    DONE ===========================================================================
    ```
    
    **After**
    
    Second run:
    
    ```
    SparkSQL functions: Spark package found in SPARK_HOME: .../spark
    ...............................................................................................................................................................
    ...............................................................................................................................................................
    ...............................................................................................................................................................
    ...............................................................................................................................................................
    ...............................................................................................................................................................
    ...............................................................................................................................
    ```
    
    Author: hyukjinkwon <gurwls223gmail.com>
    
    Closes #18335 from HyukjinKwon/SPARK-21128.
    
    Author: hyukjinkwon <gurwls223@gmail.com>
    
    Closes #19166 from felixcheung/rbackport21128.
    HyukjinKwon authored and Felix Cheung committed Sep 8, 2017
    Configuration menu
    Copy the full SHA
    9876821 View commit details
    Browse the repository at this point in the history

Commits on Sep 9, 2017

  1. [SPARK-21954][SQL] JacksonUtils should verify MapType's value type in…

    …stead of key type
    
    ## What changes were proposed in this pull request?
    
    `JacksonUtils.verifySchema` verifies if a data type can be converted to JSON. For `MapType`, it now verifies the key type. However, in `JacksonGenerator`, when converting a map to JSON, we only care about its values and create a writer for the values. The keys in a map are treated as strings by calling `toString` on the keys.
    
    Thus, we should change `JacksonUtils.verifySchema` to verify the value type of `MapType`.
    
    ## How was this patch tested?
    
    Added tests.
    
    Author: Liang-Chi Hsieh <viirya@gmail.com>
    
    Closes #19167 from viirya/test-jacksonutils.
    
    (cherry picked from commit 6b45d7e)
    Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
    viirya authored and HyukjinKwon committed Sep 9, 2017
    Configuration menu
    Copy the full SHA
    182478e View commit details
    Browse the repository at this point in the history

Commits on Sep 10, 2017

  1. [SPARK-20098][PYSPARK] dataType's typeName fix

    ## What changes were proposed in this pull request?
    `typeName`  classmethod has been fixed by using type -> typeName map.
    
    ## How was this patch tested?
    local build
    
    Author: Peter Szalai <szalaipeti.vagyok@gmail.com>
    
    Closes #17435 from szalai1/datatype-gettype-fix.
    
    (cherry picked from commit 520d92a)
    Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
    szalai1 authored and HyukjinKwon committed Sep 10, 2017
    Configuration menu
    Copy the full SHA
    b1b5a7f View commit details
    Browse the repository at this point in the history

Commits on Sep 12, 2017

  1. [SPARK-21976][DOC] Fix wrong documentation for Mean Absolute Error.

    ## What changes were proposed in this pull request?
    
    Fixed wrong documentation for Mean Absolute Error.
    
    Even though the code is correct for the MAE:
    
    ```scala
    Since("1.2.0")
      def meanAbsoluteError: Double = {
        summary.normL1(1) / summary.count
      }
    ```
    In the documentation the division by N is missing.
    
    ## How was this patch tested?
    
    All of spark tests were run.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: FavioVazquez <favio.vazquezp@gmail.com>
    Author: faviovazquez <favio.vazquezp@gmail.com>
    Author: Favio André Vázquez <favio.vazquezp@gmail.com>
    
    Closes #19190 from FavioVazquez/mae-fix.
    
    (cherry picked from commit e2ac2f1)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    FavioVazquez authored and srowen committed Sep 12, 2017
    Configuration menu
    Copy the full SHA
    10c6836 View commit details
    Browse the repository at this point in the history
  2. [DOCS] Fix unreachable links in the document

    ## What changes were proposed in this pull request?
    
    Recently, I found two unreachable links in the document and fixed them.
    Because of small changes related to the document, I don't file this issue in JIRA but please suggest I should do it if you think it's needed.
    
    ## How was this patch tested?
    
    Tested manually.
    
    Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
    
    Closes #19195 from sarutak/fix-unreachable-link.
    
    (cherry picked from commit 9575582)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    sarutak authored and srowen committed Sep 12, 2017
    Configuration menu
    Copy the full SHA
    63098dc View commit details
    Browse the repository at this point in the history
  3. [SPARK-18608][ML] Fix double caching

    ## What changes were proposed in this pull request?
    `df.rdd.getStorageLevel` => `df.storageLevel`
    
    using cmd `find . -name '*.scala' | xargs -i bash -c 'egrep -in "\.rdd\.getStorageLevel" {} && echo {}'` to make sure all algs involved in this issue are fixed.
    
    Previous discussion in other PRs: #19107, #17014
    
    ## How was this patch tested?
    existing tests
    
    Author: Zheng RuiFeng <ruifengz@foxmail.com>
    
    Closes #19197 from zhengruifeng/double_caching.
    
    (cherry picked from commit c5f9b89)
    Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
    zhengruifeng authored and jkbradley committed Sep 12, 2017
    Configuration menu
    Copy the full SHA
    b606dc1 View commit details
    Browse the repository at this point in the history

Commits on Sep 13, 2017

  1. [SPARK-21980][SQL] References in grouping functions should be indexed…

    … with semanticEquals
    
    ## What changes were proposed in this pull request?
    
    https://issues.apache.org/jira/browse/SPARK-21980
    
    This PR fixes the issue in ResolveGroupingAnalytics rule, which indexes the column references in grouping functions without considering case sensitive configurations.
    
    The problem can be reproduced by:
    
    `val df = spark.createDataFrame(Seq((1, 1), (2, 1), (2, 2))).toDF("a", "b")
     df.cube("a").agg(grouping("A")).show()`
    
    ## How was this patch tested?
    unit tests
    
    Author: donnyzone <wellfengzhu@gmail.com>
    
    Closes #19202 from DonnyZone/ResolveGroupingAnalytics.
    
    (cherry picked from commit 21c4450)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    DonnyZone authored and gatorsmile committed Sep 13, 2017
    Configuration menu
    Copy the full SHA
    3a692e3 View commit details
    Browse the repository at this point in the history

Commits on Sep 14, 2017

  1. [SPARK-18608][ML][FOLLOWUP] Fix double caching for PySpark OneVsRest.

    ## What changes were proposed in this pull request?
    #19197 fixed double caching for MLlib algorithms, but missed PySpark ```OneVsRest```, this PR fixed it.
    
    ## How was this patch tested?
    Existing tests.
    
    Author: Yanbo Liang <ybliang8@gmail.com>
    
    Closes #19220 from yanboliang/SPARK-18608.
    
    (cherry picked from commit c76153c)
    Signed-off-by: Yanbo Liang <ybliang8@gmail.com>
    yanboliang committed Sep 14, 2017
    Configuration menu
    Copy the full SHA
    51e5a82 View commit details
    Browse the repository at this point in the history

Commits on Sep 17, 2017

  1. [SPARK-21985][PYSPARK] PairDeserializer is broken for double-zipped RDDs

    ## What changes were proposed in this pull request?
    (edited)
    Fixes a bug introduced in #16121
    
    In PairDeserializer convert each batch of keys and values to lists (if they do not have `__len__` already) so that we can check that they are the same size. Normally they already are lists so this should not have a performance impact, but this is needed when repeated `zip`'s are done.
    
    ## How was this patch tested?
    
    Additional unit test
    
    Author: Andrew Ray <ray.andrew@gmail.com>
    
    Closes #19226 from aray/SPARK-21985.
    
    (cherry picked from commit 6adf67d)
    Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
    aray authored and HyukjinKwon committed Sep 17, 2017
    Configuration menu
    Copy the full SHA
    42852bb View commit details
    Browse the repository at this point in the history

Commits on Sep 18, 2017

  1. [SPARK-21953] Show both memory and disk bytes spilled if either is pr…

    …esent
    
    As written now, there must be both memory and disk bytes spilled to show either of them. If there is only one of those types of spill recorded, it will be hidden.
    
    Author: Andrew Ash <andrew@andrewash.com>
    
    Closes #19164 from ash211/patch-3.
    
    (cherry picked from commit 6308c65)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    ash211 authored and cloud-fan committed Sep 18, 2017
    Configuration menu
    Copy the full SHA
    309c401 View commit details
    Browse the repository at this point in the history
  2. [SPARK-22043][PYTHON] Improves error message for show_profiles and du…

    …mp_profiles
    
    ## What changes were proposed in this pull request?
    
    This PR proposes to improve error message from:
    
    ```
    >>> sc.show_profiles()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File ".../spark/python/pyspark/context.py", line 1000, in show_profiles
        self.profiler_collector.show_profiles()
    AttributeError: 'NoneType' object has no attribute 'show_profiles'
    >>> sc.dump_profiles("/tmp/abc")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File ".../spark/python/pyspark/context.py", line 1005, in dump_profiles
        self.profiler_collector.dump_profiles(path)
    AttributeError: 'NoneType' object has no attribute 'dump_profiles'
    ```
    
    to
    
    ```
    >>> sc.show_profiles()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File ".../spark/python/pyspark/context.py", line 1003, in show_profiles
        raise RuntimeError("'spark.python.profile' configuration must be set "
    RuntimeError: 'spark.python.profile' configuration must be set to 'true' to enable Python profile.
    >>> sc.dump_profiles("/tmp/abc")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File ".../spark/python/pyspark/context.py", line 1012, in dump_profiles
        raise RuntimeError("'spark.python.profile' configuration must be set "
    RuntimeError: 'spark.python.profile' configuration must be set to 'true' to enable Python profile.
    ```
    
    ## How was this patch tested?
    
    Unit tests added in `python/pyspark/tests.py` and manual tests.
    
    Author: hyukjinkwon <gurwls223@gmail.com>
    
    Closes #19260 from HyukjinKwon/profile-errors.
    
    (cherry picked from commit 7c72662)
    Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
    HyukjinKwon committed Sep 18, 2017
    Configuration menu
    Copy the full SHA
    a86831d View commit details
    Browse the repository at this point in the history
  3. [SPARK-22047][TEST] ignore HiveExternalCatalogVersionsSuite

    ## What changes were proposed in this pull request?
    
    As reported in https://issues.apache.org/jira/browse/SPARK-22047 , HiveExternalCatalogVersionsSuite is failing frequently, let's disable this test suite to unblock other PRs, I'm looking into the root cause.
    
    ## How was this patch tested?
    N/A
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes #19264 from cloud-fan/test.
    
    (cherry picked from commit 894a756)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    cloud-fan committed Sep 18, 2017
    Configuration menu
    Copy the full SHA
    48d6aef View commit details
    Browse the repository at this point in the history

Commits on Sep 19, 2017

  1. [SPARK-22047][FLAKY TEST] HiveExternalCatalogVersionsSuite

    ## What changes were proposed in this pull request?
    
    This PR tries to download Spark for each test run, to make sure each test run is absolutely isolated.
    
    ## How was this patch tested?
    
    N/A
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes #19265 from cloud-fan/test.
    
    (cherry picked from commit 10f45b3)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    cloud-fan committed Sep 19, 2017
    Configuration menu
    Copy the full SHA
    d0234eb View commit details
    Browse the repository at this point in the history
  2. [SPARK-22052] Incorrect Metric assigned in MetricsReporter.scala

    Current implementation for processingRate-total uses wrong metric:
    mistakenly uses inputRowsPerSecond instead of processedRowsPerSecond
    
    ## What changes were proposed in this pull request?
    Adjust processingRate-total from using inputRowsPerSecond to processedRowsPerSecond
    
    ## How was this patch tested?
    
    Built spark from source with proposed change and tested output with correct parameter. Before change the csv metrics file for inputRate-total and processingRate-total displayed the same values due to the error. After changing MetricsReporter.scala the processingRate-total csv file displayed the correct metric.
    <img width="963" alt="processed rows per second" src="https://user-images.githubusercontent.com/32072374/30554340-82eea12c-9ca4-11e7-8370-8168526ff9a2.png">
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Taaffy <32072374+Taaffy@users.noreply.github.com>
    
    Closes #19268 from Taaffy/patch-1.
    
    (cherry picked from commit 1bc17a6)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    Taaffy authored and srowen committed Sep 19, 2017
    Configuration menu
    Copy the full SHA
    6764408 View commit details
    Browse the repository at this point in the history

Commits on Sep 20, 2017

  1. [SPARK-22076][SQL] Expand.projections should not be a Stream

    ## What changes were proposed in this pull request?
    
    Spark with Scala 2.10 fails with a group by cube:
    ```
    spark.range(1).select($"id" as "a", $"id" as "b").write.partitionBy("a").mode("overwrite").saveAsTable("rollup_bug")
    spark.sql("select 1 from rollup_bug group by rollup ()").show
    ```
    
    It can be traced back to #15484 , which made `Expand.projections` a lazy `Stream` for group by cube.
    
    In scala 2.10 `Stream` captures a lot of stuff, and in this case it captures the entire query plan which has some un-serializable parts.
    
    This change is also good for master branch, to reduce the serialized size of `Expand.projections`.
    
    ## How was this patch tested?
    
    manually verified with Spark with Scala 2.10.
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes #19289 from cloud-fan/bug.
    
    (cherry picked from commit ce6a71e)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    cloud-fan authored and gatorsmile committed Sep 20, 2017
    Configuration menu
    Copy the full SHA
    5d10586 View commit details
    Browse the repository at this point in the history
  2. [SPARK-21384][YARN] Spark + YARN fails with LocalFileSystem as defaul…

    …t FS
    
    ## What changes were proposed in this pull request?
    
    When the libraries temp directory(i.e. __spark_libs__*.zip dir) file system and staging dir(destination) file systems are the same then the __spark_libs__*.zip is not copying to the staging directory. But after making this decision the libraries zip file is getting deleted immediately and becoming unavailable for the Node Manager's localization.
    
    With this change, client copies the files to remote always when the source scheme is "file".
    
    ## How was this patch tested?
    
    I have verified it manually in yarn/cluster and yarn/client modes with hdfs and local file systems.
    
    Author: Devaraj K <devaraj@apache.org>
    
    Closes #19141 from devaraj-kavali/SPARK-21384.
    
    (cherry picked from commit 55d5fa7)
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
    Devaraj K authored and Marcelo Vanzin committed Sep 20, 2017
    Configuration menu
    Copy the full SHA
    401ac20 View commit details
    Browse the repository at this point in the history

Commits on Sep 21, 2017

  1. [SPARK-21928][CORE] Set classloader on SerializerManager's private kryo

    ## What changes were proposed in this pull request?
    
    We have to make sure that SerializerManager's private instance of
    kryo also uses the right classloader, regardless of the current thread
    classloader.  In particular, this fixes serde during remote cache
    fetches, as those occur in netty threads.
    
    ## How was this patch tested?
    
    Manual tests & existing suite via jenkins.  I haven't been able to reproduce this is in a unit test, because when a remote RDD partition can be fetched, there is a warning message and then the partition is just recomputed locally.  I manually verified the warning message is no longer present.
    
    Author: Imran Rashid <irashid@cloudera.com>
    
    Closes #19280 from squito/SPARK-21928_ser_classloader.
    
    (cherry picked from commit b75bd17)
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
    squito authored and Marcelo Vanzin committed Sep 21, 2017
    Configuration menu
    Copy the full SHA
    765fd92 View commit details
    Browse the repository at this point in the history

Commits on Sep 22, 2017

  1. [SPARK-22094][SS] processAllAvailable should check the query state

    `processAllAvailable` should also check the query state and if the query is stopped, it should return.
    
    The new unit test.
    
    Author: Shixiong Zhu <zsxwing@gmail.com>
    
    Closes #19314 from zsxwing/SPARK-22094.
    
    (cherry picked from commit fedf696)
    Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
    zsxwing committed Sep 22, 2017
    Configuration menu
    Copy the full SHA
    090b987 View commit details
    Browse the repository at this point in the history
  2. [SPARK-22072][SPARK-22071][BUILD] Improve release build scripts

    ## What changes were proposed in this pull request?
    
    Check JDK version (with javac) and use SPARK_VERSION for publish-release
    
    ## How was this patch tested?
    
    Manually tried local build with wrong JDK / JAVA_HOME & built a local release (LFTP disabled)
    
    Author: Holden Karau <holden@us.ibm.com>
    
    Closes #19312 from holdenk/improve-release-scripts-r2.
    
    (cherry picked from commit 8f130ad)
    Signed-off-by: Holden Karau <holden@us.ibm.com>
    holdenk committed Sep 22, 2017
    Configuration menu
    Copy the full SHA
    de6274a View commit details
    Browse the repository at this point in the history

Commits on Sep 23, 2017

  1. [SPARK-18136] Fix SPARK_JARS_DIR for Python pip install on Windows

    ## What changes were proposed in this pull request?
    
    Fix for setup of `SPARK_JARS_DIR` on Windows as it looks for `%SPARK_HOME%\RELEASE` file instead of `%SPARK_HOME%\jars` as it should. RELEASE file is not included in the `pip` build of PySpark.
    
    ## How was this patch tested?
    
    Local install of PySpark on Anaconda 4.4.0 (Python 3.6.1).
    
    Author: Jakub Nowacki <j.s.nowacki@gmail.com>
    
    Closes #19310 from jsnowacki/master.
    
    (cherry picked from commit c11f24a)
    Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
    jsnowacki authored and HyukjinKwon committed Sep 23, 2017
    Configuration menu
    Copy the full SHA
    c0a34a9 View commit details
    Browse the repository at this point in the history
  2. [SPARK-22092] Reallocation in OffHeapColumnVector.reserveInternal cor…

    …rupts struct and array data
    
    `OffHeapColumnVector.reserveInternal()` will only copy already inserted values during reallocation if `data != null`. In vectors containing arrays or structs this is incorrect, since there field `data` is not used at all. We need to check `nulls` instead.
    
    Adds new tests to `ColumnVectorSuite` that reproduce the errors.
    
    Author: Ala Luszczak <ala@databricks.com>
    
    Closes #19323 from ala/port-vector-realloc.
    ala authored and hvanhovell committed Sep 23, 2017
    Configuration menu
    Copy the full SHA
    1a829df View commit details
    Browse the repository at this point in the history
  3. [SPARK-22109][SQL][BRANCH-2.2] Resolves type conflicts between string…

    …s and timestamps in partition column
    
    ## What changes were proposed in this pull request?
    
    This PR backports 04975a6 into branch-2.2.
    
    ## How was this patch tested?
    
    Unit tests in `ParquetPartitionDiscoverySuite`.
    
    Author: hyukjinkwon <gurwls223@gmail.com>
    
    Closes #19333 from HyukjinKwon/SPARK-22109-backport-2.2.
    HyukjinKwon authored and ueshin committed Sep 23, 2017
    Configuration menu
    Copy the full SHA
    211d81b View commit details
    Browse the repository at this point in the history

Commits on Sep 25, 2017

  1. [SPARK-22107] Change as to alias in python quickstart

    ## What changes were proposed in this pull request?
    
    Updated docs so that a line of python in the quick start guide executes. Closes #19283
    
    ## How was this patch tested?
    
    Existing tests.
    
    Author: John O'Leary <jgoleary@gmail.com>
    
    Closes #19326 from jgoleary/issues/22107.
    
    (cherry picked from commit 20adf9a)
    Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
    John O'Leary authored and HyukjinKwon committed Sep 25, 2017
    Configuration menu
    Copy the full SHA
    8acce00 View commit details
    Browse the repository at this point in the history
  2. [SPARK-22083][CORE] Release locks in MemoryStore.evictBlocksToFreeSpace

    ## What changes were proposed in this pull request?
    
    MemoryStore.evictBlocksToFreeSpace acquires write locks for all the
    blocks it intends to evict up front.  If there is a failure to evict
    blocks (eg., some failure dropping a block to disk), then we have to
    release the lock.  Otherwise the lock is never released and an executor
    trying to get the lock will wait forever.
    
    ## How was this patch tested?
    
    Added unit test.
    
    Author: Imran Rashid <irashid@cloudera.com>
    
    Closes #19311 from squito/SPARK-22083.
    
    (cherry picked from commit 2c5b9b1)
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
    squito authored and Marcelo Vanzin committed Sep 25, 2017
    Configuration menu
    Copy the full SHA
    9836ea1 View commit details
    Browse the repository at this point in the history
  3. [SPARK-22120][SQL] TestHiveSparkSession.reset() should clean out Hive…

    … warehouse directory
    
    ## What changes were proposed in this pull request?
    During TestHiveSparkSession.reset(), which is called after each TestHiveSingleton suite, we now delete and recreate the Hive warehouse directory.
    
    ## How was this patch tested?
    Ran full suite of tests locally, verified that they pass.
    
    Author: Greg Owen <greg@databricks.com>
    
    Closes #19341 from GregOwen/SPARK-22120.
    
    (cherry picked from commit ce20478)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    Greg Owen authored and gatorsmile committed Sep 25, 2017
    Configuration menu
    Copy the full SHA
    b0f30b5 View commit details
    Browse the repository at this point in the history

Commits on Sep 27, 2017

  1. [SPARK-22141][BACKPORT][SQL] Propagate empty relation before checking…

    … Cartesian products
    
    Back port #19362 to branch-2.2
    
    ## What changes were proposed in this pull request?
    
    When inferring constraints from children, Join's condition can be simplified as None.
    For example,
    ```
    val testRelation = LocalRelation('a.int)
    val x = testRelation.as("x")
    val y = testRelation.where($"a" === 2 && !($"a" === 2)).as("y")
    x.join.where($"x.a" === $"y.a")
    ```
    The plan will become
    ```
    Join Inner
    :- LocalRelation <empty>, [a#23]
    +- LocalRelation <empty>, [a#224]
    ```
    And the Cartesian products check will throw exception for above plan.
    
    Propagate empty relation before checking Cartesian products, and the issue is resolved.
    
    ## How was this patch tested?
    
    Unit test
    
    Author: Wang Gengliang <ltnwgl@gmail.com>
    
    Closes #19366 from gengliangwang/branch-2.2.
    gengliangwang authored and hvanhovell committed Sep 27, 2017
    Configuration menu
    Copy the full SHA
    a406473 View commit details
    Browse the repository at this point in the history

Commits on Sep 28, 2017

  1. [SPARK-22140] Add TPCDSQuerySuite

    ## What changes were proposed in this pull request?
    Now, we are not running TPC-DS queries as regular test cases. Thus, we need to add a test suite using empty tables for ensuring the new code changes will not break them. For example, optimizer/analyzer batches should not exceed the max iteration.
    
    ## How was this patch tested?
    N/A
    
    Author: gatorsmile <gatorsmile@gmail.com>
    
    Closes #19361 from gatorsmile/tpcdsQuerySuite.
    
    (cherry picked from commit 9244957)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    gatorsmile committed Sep 28, 2017
    Configuration menu
    Copy the full SHA
    42e1727 View commit details
    Browse the repository at this point in the history
  2. [SPARK-22135][MESOS] metrics in spark-dispatcher not being registered…

    … properly
    
    ## What changes were proposed in this pull request?
    
    Fix a trivial bug with how metrics are registered in the mesos dispatcher. Bug resulted in creating a new registry each time the metricRegistry() method was called.
    
    ## How was this patch tested?
    
    Verified manually on local mesos setup
    
    Author: Paul Mackles <pmackles@adobe.com>
    
    Closes #19358 from pmackles/SPARK-22135.
    
    (cherry picked from commit f20be4d)
    Signed-off-by: jerryshao <sshao@hortonworks.com>
    Paul Mackles authored and jerryshao committed Sep 28, 2017
    Configuration menu
    Copy the full SHA
    12a74e3 View commit details
    Browse the repository at this point in the history
  3. [SPARK-22143][SQL][BRANCH-2.2] Fix memory leak in OffHeapColumnVector

    This is a backport of 02bb068.
    
    ## What changes were proposed in this pull request?
    `WriteableColumnVector` does not close its child column vectors. This can create memory leaks for `OffHeapColumnVector` where we do not clean up the memory allocated by a vectors children. This can be especially bad for string columns (which uses a child byte column vector).
    
    ## How was this patch tested?
    I have updated the existing tests to always use both on-heap and off-heap vectors. Testing and diagnosis was done locally.
    
    Author: Herman van Hovell <hvanhovell@databricks.com>
    
    Closes #19378 from hvanhovell/SPARK-22143-2.2.
    hvanhovell committed Sep 28, 2017
    Configuration menu
    Copy the full SHA
    8c5ab4e View commit details
    Browse the repository at this point in the history

Commits on Sep 29, 2017

  1. [SPARK-22129][SPARK-22138] Release script improvements

    ## What changes were proposed in this pull request?
    
    Use the GPG_KEY param, fix lsof to non-hardcoded path, remove version swap since it wasn't really needed. Use EXPORT on JAVA_HOME for downstream scripts as well.
    
    ## How was this patch tested?
    
    Rolled 2.1.2 RC2
    
    Author: Holden Karau <holden@us.ibm.com>
    
    Closes #19359 from holdenk/SPARK-22129-fix-signing.
    
    (cherry picked from commit ecbe416)
    Signed-off-by: Holden Karau <holden@us.ibm.com>
    holdenk committed Sep 29, 2017
    Configuration menu
    Copy the full SHA
    8b2d838 View commit details
    Browse the repository at this point in the history
  2. [SPARK-22161][SQL] Add Impala-modified TPC-DS queries

    ## What changes were proposed in this pull request?
    
    Added IMPALA-modified TPCDS queries to TPC-DS query suites.
    
    - Ref: https://github.com/cloudera/impala-tpcds-kit/tree/master/queries
    
    ## How was this patch tested?
    N/A
    
    Author: gatorsmile <gatorsmile@gmail.com>
    
    Closes #19386 from gatorsmile/addImpalaQueries.
    
    (cherry picked from commit 9ed7394)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    gatorsmile committed Sep 29, 2017
    Configuration menu
    Copy the full SHA
    ac9a0f6 View commit details
    Browse the repository at this point in the history
  3. [SPARK-22146] FileNotFoundException while reading ORC files containin…

    …g special characters
    
    ## What changes were proposed in this pull request?
    
    Reading ORC files containing special characters like '%' fails with a FileNotFoundException.
    This PR aims to fix the problem.
    
    ## How was this patch tested?
    
    Added UT.
    
    Author: Marco Gaido <marcogaido91@gmail.com>
    Author: Marco Gaido <mgaido@hortonworks.com>
    
    Closes #19368 from mgaido91/SPARK-22146.
    mgaido91 authored and gatorsmile committed Sep 29, 2017
    Configuration menu
    Copy the full SHA
    7bf25e0 View commit details
    Browse the repository at this point in the history

Commits on Oct 2, 2017

  1. [SPARK-22167][R][BUILD] sparkr packaging issue allow zinc

    ## What changes were proposed in this pull request?
    
    When zinc is running the pwd might be in the root of the project. A quick solution to this is to not go a level up incase we are in the root rather than root/core/. If we are in the root everything works fine, if we are in core add a script which goes and runs the level up
    
    ## How was this patch tested?
    
    set -x in the SparkR install scripts.
    
    Author: Holden Karau <holden@us.ibm.com>
    
    Closes #19402 from holdenk/SPARK-22167-sparkr-packaging-issue-allow-zinc.
    
    (cherry picked from commit 8fab799)
    Signed-off-by: Holden Karau <holden@us.ibm.com>
    holdenk committed Oct 2, 2017
    Configuration menu
    Copy the full SHA
    b9adddb View commit details
    Browse the repository at this point in the history

Commits on Oct 3, 2017

  1. [SPARK-22158][SQL][BRANCH-2.2] convertMetastore should not ignore tab…

    …le property
    
    ## What changes were proposed in this pull request?
    
    From the beginning, **convertMetastoreOrc** ignores table properties and use an empty map instead. This PR fixes that. **convertMetastoreParquet** also ignore.
    
    ```scala
    val options = Map[String, String]()
    ```
    
    - [SPARK-14070: HiveMetastoreCatalog.scala](https://github.com/apache/spark/pull/11891/files#diff-ee66e11b56c21364760a5ed2b783f863R650)
    - [Master branch: HiveStrategies.scala](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala#L197
    )
    
    ## How was this patch tested?
    
    Pass the Jenkins with an updated test suite.
    
    Author: Dongjoon Hyun <dongjoon@apache.org>
    
    Closes #19417 from dongjoon-hyun/SPARK-22158-BRANCH-2.2.
    dongjoon-hyun authored and gatorsmile committed Oct 3, 2017
    Configuration menu
    Copy the full SHA
    3c30be5 View commit details
    Browse the repository at this point in the history
  2. [SPARK-22178][SQL] Refresh Persistent Views by REFRESH TABLE Command

    ## What changes were proposed in this pull request?
    The underlying tables of persistent views are not refreshed when users issue the REFRESH TABLE command against the persistent views.
    
    ## How was this patch tested?
    Added a test case
    
    Author: gatorsmile <gatorsmile@gmail.com>
    
    Closes #19405 from gatorsmile/refreshView.
    
    (cherry picked from commit e65b6b7)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    gatorsmile committed Oct 3, 2017
    Configuration menu
    Copy the full SHA
    5e3f254 View commit details
    Browse the repository at this point in the history
  3. [SPARK-20466][CORE] HadoopRDD#addLocalConfiguration throws NPE

    ## What changes were proposed in this pull request?
    
    Fix for SPARK-20466, full description of the issue in the JIRA. To summarize, `HadoopRDD` uses a metadata cache to cache `JobConf` objects. The cache uses soft-references, which means the JVM can delete entries from the cache whenever there is GC pressure. `HadoopRDD#getJobConf` had a bug where it would check if the cache contained the `JobConf`, if it did it would get the `JobConf` from the cache and return it. This doesn't work when soft-references are used as the JVM can delete the entry between the existence check and the get call.
    
    ## How was this patch tested?
    
    Haven't thought of a good way to test this yet given the issue only occurs sometimes, and happens during high GC pressure. Was thinking of using mocks to verify `#getJobConf` is doing the right thing. I deleted the method `HadoopRDD#containsCachedMetadata` so that we don't hit this issue again.
    
    Author: Sahil Takiar <stakiar@cloudera.com>
    
    Closes #19413 from sahilTakiar/master.
    
    (cherry picked from commit e36ec38)
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
    Sahil Takiar authored and Marcelo Vanzin committed Oct 3, 2017
    Configuration menu
    Copy the full SHA
    81232ce View commit details
    Browse the repository at this point in the history

Commits on Oct 5, 2017

  1. [SPARK-22206][SQL][SPARKR] gapply in R can't work on empty grouping c…

    …olumns
    
    ## What changes were proposed in this pull request?
    
    Looks like `FlatMapGroupsInRExec.requiredChildDistribution` didn't consider empty grouping attributes. It should be a problem when running `EnsureRequirements` and `gapply` in R can't work on empty grouping columns.
    
    ## How was this patch tested?
    
    Added test.
    
    Author: Liang-Chi Hsieh <viirya@gmail.com>
    
    Closes #19436 from viirya/fix-flatmapinr-distribution.
    
    (cherry picked from commit ae61f18)
    Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
    viirya authored and HyukjinKwon committed Oct 5, 2017
    Configuration menu
    Copy the full SHA
    8a4e7dd View commit details
    Browse the repository at this point in the history

Commits on Oct 7, 2017

  1. [SPARK-21549][CORE] Respect OutputFormats with no output directory pr…

    …ovided
    
    ## What changes were proposed in this pull request?
    
    Fix for https://issues.apache.org/jira/browse/SPARK-21549 JIRA issue.
    
    Since version 2.2 Spark does not respect OutputFormat with no output paths provided.
    The examples of such formats are [Cassandra OutputFormat](https://github.com/finn-no/cassandra-hadoop/blob/08dfa3a7ac727bb87269f27a1c82ece54e3f67e6/src/main/java/org/apache/cassandra/hadoop2/AbstractColumnFamilyOutputFormat.java), [Aerospike OutputFormat](https://github.com/aerospike/aerospike-hadoop/blob/master/mapreduce/src/main/java/com/aerospike/hadoop/mapreduce/AerospikeOutputFormat.java), etc. which do not have an ability to rollback the results written to an external systems on job failure.
    
    Provided output directory is required by Spark to allows files to be committed to an absolute output location, that is not the case for output formats which write data to external systems.
    
    This pull request prevents accessing `absPathStagingDir` method that causes the error described in SPARK-21549 unless there are files to rename in `addedAbsPathFiles`.
    
    ## How was this patch tested?
    
    Unit tests
    
    Author: Sergey Zhemzhitsky <szhemzhitski@gmail.com>
    
    Closes #19294 from szhem/SPARK-21549-abs-output-commits.
    
    (cherry picked from commit 2030f19)
    Signed-off-by: Mridul Muralidharan <mridul@gmail.com>
    szhem authored and mridulm committed Oct 7, 2017
    Configuration menu
    Copy the full SHA
    0d3f166 View commit details
    Browse the repository at this point in the history

Commits on Oct 9, 2017

  1. [SPARK-22218] spark shuffle services fails to update secret on app re…

    …-attempts
    
    This patch fixes application re-attempts when running spark on yarn using the external shuffle service with security on.  Currently executors will fail to launch on any application re-attempt when launched on a nodemanager that had an executor from the first attempt.  The reason for this is because we aren't updating the secret key after the first application attempt.  The fix here is to just remove the containskey check to see if it already exists. In this way, we always add it and make sure its the most recent secret.  Similarly remove the check for containsKey on the remove since its just adding extra check that isn't really needed.
    
    Note this worked before spark 2.2 because the check used to be contains (which was looking for the value) rather then containsKey, so that never matched and it was just always adding the new secret.
    
    Patch was tested on a 10 node cluster as well as added the unit test.
    The test ran was a wordcount where the output directory already existed.  With the bug present the application attempt failed with max number of executor Failures which were all saslExceptions.  With the fix present the application re-attempts fail with directory already exists or when you remove the directory between attempts the re-attemps succeed.
    
    Author: Thomas Graves <tgraves@unharmedunarmed.corp.ne1.yahoo.com>
    
    Closes #19450 from tgravescs/SPARK-22218.
    
    (cherry picked from commit a74ec6d)
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
    Thomas Graves authored and Marcelo Vanzin committed Oct 9, 2017
    Configuration menu
    Copy the full SHA
    c5889b5 View commit details
    Browse the repository at this point in the history

Commits on Oct 12, 2017

  1. [SPARK-21907][CORE][BACKPORT 2.2] oom during spill

    back-port #19181 to branch-2.2.
    
    ## What changes were proposed in this pull request?
    1. a test reproducing [SPARK-21907](https://issues.apache.org/jira/browse/SPARK-21907)
    2. a fix for the root cause of the issue.
    
    `org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill` calls `org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.reset` which may trigger another spill,
    when this happens the `array` member is already de-allocated but still referenced by the code, this causes the nested spill to fail with an NPE in `org.apache.spark.memory.TaskMemoryManager.getPage`.
    This patch introduces a reproduction in a test case and a fix, the fix simply sets the in-mem sorter's array member to an empty array before actually performing the allocation. This prevents the spilling code from 'touching' the de-allocated array.
    
    ## How was this patch tested?
    introduced a new test case: `org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorterSuite#testOOMDuringSpill`.
    
    Author: Eyal Farago <eyal@nrgene.com>
    
    Closes #19481 from eyalfa/SPARK-21907__oom_during_spill__BACKPORT-2.2.
    Eyal Farago authored and hvanhovell committed Oct 12, 2017
    Configuration menu
    Copy the full SHA
    cd51e2c View commit details
    Browse the repository at this point in the history

Commits on Oct 13, 2017

  1. [SPARK-22217][SQL] ParquetFileFormat to support arbitrary OutputCommi…

    …tters
    
    ## What changes were proposed in this pull request?
    
    `ParquetFileFormat` to relax its requirement of output committer class from `org.apache.parquet.hadoop.ParquetOutputCommitter` or subclass thereof (and so implicitly Hadoop `FileOutputCommitter`) to any committer implementing `org.apache.hadoop.mapreduce.OutputCommitter`
    
    This enables output committers which don't write to the filesystem the way `FileOutputCommitter` does to save parquet data from a dataframe: at present you cannot do this.
    
    Before a committer which isn't a subclass of `ParquetOutputCommitter`, it checks to see if the context has requested summary metadata by setting `parquet.enable.summary-metadata`. If true, and the committer class isn't a parquet committer, it raises a RuntimeException with an error message.
    
    (It could downgrade, of course, but raising an exception makes it clear there won't be an summary. It also makes the behaviour testable.)
    
    Note that `SQLConf` already states that any `OutputCommitter` can be used, but that typically it's a subclass of ParquetOutputCommitter. That's not currently true. This patch will make the code consistent with the docs, adding tests to verify,
    
    ## How was this patch tested?
    
    The patch includes a test suite, `ParquetCommitterSuite`, with a new committer, `MarkingFileOutputCommitter` which extends `FileOutputCommitter` and writes a marker file in the destination directory. The presence of the marker file can be used to verify the new committer was used. The tests then try the combinations of Parquet committer summary/no-summary and marking committer summary/no-summary.
    
    | committer | summary | outcome |
    |-----------|---------|---------|
    | parquet   | true    | success |
    | parquet   | false   | success |
    | marking   | false   | success with marker |
    | marking   | true    | exception |
    
    All tests are happy.
    
    Author: Steve Loughran <stevel@hortonworks.com>
    
    Closes #19448 from steveloughran/cloud/SPARK-22217-committer.
    steveloughran authored and HyukjinKwon committed Oct 13, 2017
    Configuration menu
    Copy the full SHA
    cfc04e0 View commit details
    Browse the repository at this point in the history
  2. [SPARK-22252][SQL][2.2] FileFormatWriter should respect the input que…

    …ry schema
    
    ## What changes were proposed in this pull request?
    
    #18386 fixes SPARK-21165 but breaks SPARK-22252. This PR reverts #18386 and picks the patch from #19483 to fix SPARK-21165.
    
    ## How was this patch tested?
    
    new regression test
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes #19484 from cloud-fan/bug.
    cloud-fan authored and gatorsmile committed Oct 13, 2017
    Configuration menu
    Copy the full SHA
    c9187db View commit details
    Browse the repository at this point in the history
  3. [SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use Spark schema to read…

    … ORC table instead of ORC file schema
    
    Before Hive 2.0, ORC File schema has invalid column names like `_col1` and `_col2`. This is a well-known limitation and there are several Apache Spark issues with `spark.sql.hive.convertMetastoreOrc=true`. This PR ignores ORC File schema and use Spark schema.
    
    Pass the newly added test case.
    
    Author: Dongjoon Hyun <dongjoon@apache.org>
    
    Closes #19470 from dongjoon-hyun/SPARK-18355.
    
    (cherry picked from commit e6e3600)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    dongjoon-hyun authored and cloud-fan committed Oct 13, 2017
    Configuration menu
    Copy the full SHA
    30d5c9f View commit details
    Browse the repository at this point in the history

Commits on Oct 14, 2017

  1. [SPARK-22273][SQL] Fix key/value schema field names in HashMapGenerat…

    …ors.
    
    ## What changes were proposed in this pull request?
    
    When fixing schema field names using escape characters with `addReferenceMinorObj()` at [SPARK-18952](https://issues.apache.org/jira/browse/SPARK-18952) (#16361), double-quotes around the names were remained and the names become something like `"((java.lang.String) references[1])"`.
    
    ```java
    /* 055 */     private int maxSteps = 2;
    /* 056 */     private int numRows = 0;
    /* 057 */     private org.apache.spark.sql.types.StructType keySchema = new org.apache.spark.sql.types.StructType().add("((java.lang.String) references[1])", org.apache.spark.sql.types.DataTypes.StringType);
    /* 058 */     private org.apache.spark.sql.types.StructType valueSchema = new org.apache.spark.sql.types.StructType().add("((java.lang.String) references[2])", org.apache.spark.sql.types.DataTypes.LongType);
    /* 059 */     private Object emptyVBase;
    ```
    
    We should remove the double-quotes to refer the values in `references` properly:
    
    ```java
    /* 055 */     private int maxSteps = 2;
    /* 056 */     private int numRows = 0;
    /* 057 */     private org.apache.spark.sql.types.StructType keySchema = new org.apache.spark.sql.types.StructType().add(((java.lang.String) references[1]), org.apache.spark.sql.types.DataTypes.StringType);
    /* 058 */     private org.apache.spark.sql.types.StructType valueSchema = new org.apache.spark.sql.types.StructType().add(((java.lang.String) references[2]), org.apache.spark.sql.types.DataTypes.LongType);
    /* 059 */     private Object emptyVBase;
    ```
    
    ## How was this patch tested?
    
    Existing tests.
    
    Author: Takuya UESHIN <ueshin@databricks.com>
    
    Closes #19491 from ueshin/issues/SPARK-22273.
    
    (cherry picked from commit e0503a7)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    ueshin authored and gatorsmile committed Oct 14, 2017
    Configuration menu
    Copy the full SHA
    acbad83 View commit details
    Browse the repository at this point in the history

Commits on Oct 16, 2017

  1. [SPARK-21549][CORE] Respect OutputFormats with no/invalid output dire…

    …ctory provided
    
    ## What changes were proposed in this pull request?
    
    PR #19294 added support for null's - but spark 2.1 handled other error cases where path argument can be invalid.
    Namely:
    
    * empty string
    * URI parse exception while creating Path
    
    This is resubmission of PR #19487, which I messed up while updating my repo.
    
    ## How was this patch tested?
    
    Enhanced test to cover new support added.
    
    Author: Mridul Muralidharan <mridul@gmail.com>
    
    Closes #19497 from mridulm/master.
    
    (cherry picked from commit 13c1559)
    Signed-off-by: Mridul Muralidharan <mridul@gmail.com>
    mridulm committed Oct 16, 2017
    Configuration menu
    Copy the full SHA
    6b6761e View commit details
    Browse the repository at this point in the history
  2. [SPARK-22223][SQL] ObjectHashAggregate should not introduce unnecessa…

    …ry shuffle
    
    `ObjectHashAggregateExec` should override `outputPartitioning` in order to avoid unnecessary shuffle.
    
    Added Jenkins test.
    
    Author: Liang-Chi Hsieh <viirya@gmail.com>
    
    Closes #19501 from viirya/SPARK-22223.
    
    (cherry picked from commit 0ae9649)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    viirya authored and cloud-fan committed Oct 16, 2017
    Configuration menu
    Copy the full SHA
    0f060a2 View commit details
    Browse the repository at this point in the history

Commits on Oct 17, 2017

  1. [SPARK-22249][SQL] isin with empty list throws exception on cached Da…

    …taFrame
    
    ## What changes were proposed in this pull request?
    
    As pointed out in the JIRA, there is a bug which causes an exception to be thrown if `isin` is called with an empty list on a cached DataFrame. The PR fixes it.
    
    ## How was this patch tested?
    
    Added UT.
    
    Author: Marco Gaido <marcogaido91@gmail.com>
    
    Closes #19494 from mgaido91/SPARK-22249.
    
    (cherry picked from commit 8148f19)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    mgaido91 authored and srowen committed Oct 17, 2017
    Configuration menu
    Copy the full SHA
    71d1cb6 View commit details
    Browse the repository at this point in the history
  2. [SPARK-22271][SQL] mean overflows and returns null for some decimal v…

    …ariables
    
    ## What changes were proposed in this pull request?
    
    In Average.scala, it has
    ```
      override lazy val evaluateExpression = child.dataType match {
        case DecimalType.Fixed(p, s) =>
          // increase the precision and scale to prevent precision loss
          val dt = DecimalType.bounded(p + 14, s + 4)
          Cast(Cast(sum, dt) / Cast(count, dt), resultType)
        case _ =>
          Cast(sum, resultType) / Cast(count, resultType)
      }
    
      def setChild (newchild: Expression) = {
        child = newchild
      }
    
    ```
    It is possible that  Cast(count, dt), resultType) will make the precision of the decimal number bigger than 38, and this causes over flow.  Since count is an integer and doesn't need a scale, I will cast it using DecimalType.bounded(38,0)
    ## How was this patch tested?
    In DataFrameSuite, I will add a test case.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Huaxin Gao <huaxing@us.ibm.com>
    
    Closes #19496 from huaxingao/spark-22271.
    
    (cherry picked from commit 28f9f3f)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    
    # Conflicts:
    #	sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
    huaxingao authored and gatorsmile committed Oct 17, 2017
    Configuration menu
    Copy the full SHA
    c423091 View commit details
    Browse the repository at this point in the history

Commits on Oct 18, 2017

  1. [SPARK-22249][FOLLOWUP][SQL] Check if list of value for IN is empty i…

    …n the optimizer
    
    ## What changes were proposed in this pull request?
    
    This PR addresses the comments by gatorsmile on [the previous PR](#19494).
    
    ## How was this patch tested?
    
    Previous UT and added UT.
    
    Author: Marco Gaido <marcogaido91@gmail.com>
    
    Closes #19522 from mgaido91/SPARK-22249_FOLLOWUP.
    
    (cherry picked from commit 1f25d86)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    mgaido91 authored and gatorsmile committed Oct 18, 2017
    Configuration menu
    Copy the full SHA
    010b50c View commit details
    Browse the repository at this point in the history

Commits on Oct 19, 2017

  1. [SPARK-21551][PYTHON] Increase timeout for PythonRDD.serveIterator

    Backport of #18752 (https://issues.apache.org/jira/browse/SPARK-21551)
    
    (cherry picked from commit 9d3c664)
    
    Author: peay <peay@protonmail.com>
    
    Closes #19512 from FRosner/branch-2.2.
    peay authored and HyukjinKwon committed Oct 19, 2017
    Configuration menu
    Copy the full SHA
    f8c83fd View commit details
    Browse the repository at this point in the history

Commits on Oct 23, 2017

  1. [SPARK-22319][CORE][BACKPORT-2.2] call loginUserFromKeytab before acc…

    …essing hdfs
    
    In SparkSubmit, call loginUserFromKeytab before attempting to make RPC calls to the NameNode.
    
    Same as ##19540, but for branch-2.2.
    
    Manually tested for master as described in #19540.
    
    Author: Steven Rand <srand@palantir.com>
    
    Closes #19554 from sjrand/SPARK-22319-branch-2.2.
    
    Change-Id: Ic550a818fd6a3f38b356ac48029942d463738458
    sjrand authored and jerryshao committed Oct 23, 2017
    Configuration menu
    Copy the full SHA
    bf8163f View commit details
    Browse the repository at this point in the history

Commits on Oct 24, 2017

  1. [SPARK-21936][SQL][FOLLOW-UP] backward compatibility test framework f…

    …or HiveExternalCatalog
    
    ## What changes were proposed in this pull request?
    
    Adjust Spark download in test to use Apache mirrors and respect its load balancer, and use Spark 2.1.2. This follows on a recent PMC list thread about removing the cloudfront download rather than update it further.
    
    ## How was this patch tested?
    
    Existing tests.
    
    Author: Sean Owen <sowen@cloudera.com>
    
    Closes #19564 from srowen/SPARK-21936.2.
    
    (cherry picked from commit 8beeaed)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    srowen authored and cloud-fan committed Oct 24, 2017
    Configuration menu
    Copy the full SHA
    daa838b View commit details
    Browse the repository at this point in the history

Commits on Oct 25, 2017

  1. [SPARK-21991][LAUNCHER] Fix race condition in LauncherServer#acceptCo…

    …nnections
    
    ## What changes were proposed in this pull request?
    This patch changes the order in which _acceptConnections_ starts the client thread and schedules the client timeout action ensuring that the latter has been scheduled before the former get a chance to cancel it.
    
    ## How was this patch tested?
    Due to the non-deterministic nature of the patch I wasn't able to add a new test for this issue.
    
    Author: Andrea zito <andrea.zito@u-hopper.com>
    
    Closes #19217 from nivox/SPARK-21991.
    
    (cherry picked from commit 6ea8a56)
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
    Andrea zito authored and Marcelo Vanzin committed Oct 25, 2017
    Configuration menu
    Copy the full SHA
    4c1a868 View commit details
    Browse the repository at this point in the history
  2. [SPARK-22227][CORE] DiskBlockManager.getAllBlocks now tolerates temp …

    …files
    
    Prior to this commit getAllBlocks implicitly assumed that the directories
    managed by the DiskBlockManager contain only the files corresponding to
    valid block IDs. In reality, this assumption was violated during shuffle,
    which produces temporary files in the same directory as the resulting
    blocks. As a result, calls to getAllBlocks during shuffle were unreliable.
    
    The fix could be made more efficient, but this is probably good enough.
    
    `DiskBlockManagerSuite`
    
    Author: Sergei Lebedev <s.lebedev@criteo.com>
    
    Closes #19458 from superbobry/block-id-option.
    
    (cherry picked from commit b377ef1)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    Sergei Lebedev authored and cloud-fan committed Oct 25, 2017
    Configuration menu
    Copy the full SHA
    9ed6404 View commit details
    Browse the repository at this point in the history
  3. [SPARK-22332][ML][TEST] Fix NaiveBayes unit test occasionly fail (cau…

    …se by test dataset not deterministic)
    
    ## What changes were proposed in this pull request?
    
    Fix NaiveBayes unit test occasionly fail:
    Set seed for `BrzMultinomial.sample`, make `generateNaiveBayesInput` output deterministic dataset.
    (If we do not set seed, the generated dataset will be random, and the model will be possible to exceed the tolerance in the test, which trigger this failure)
    
    ## How was this patch tested?
    
    Manually run tests multiple times and check each time output models contains the same values.
    
    Author: WeichenXu <weichen.xu@databricks.com>
    
    Closes #19558 from WeichenXu123/fix_nb_test_seed.
    
    (cherry picked from commit 841f1d7)
    Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
    WeichenXu123 authored and jkbradley committed Oct 25, 2017
    Configuration menu
    Copy the full SHA
    35725f7 View commit details
    Browse the repository at this point in the history
  4. [SPARK-21991][LAUNCHER][FOLLOWUP] Fix java lint

    ## What changes were proposed in this pull request?
    
    Fix java lint
    
    ## How was this patch tested?
    
    Run `./dev/lint-java`
    
    Author: Andrew Ash <andrew@andrewash.com>
    
    Closes #19574 from ash211/aash/fix-java-lint.
    
    (cherry picked from commit 5433be4)
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
    ash211 authored and Marcelo Vanzin committed Oct 25, 2017
    Configuration menu
    Copy the full SHA
    d2dc175 View commit details
    Browse the repository at this point in the history

Commits on Oct 26, 2017

  1. [SPARK-17902][R] Revive stringsAsFactors option for collect() in SparkR

    ## What changes were proposed in this pull request?
    
    This PR proposes to revive `stringsAsFactors` option in collect API, which was mistakenly removed in 71a138c.
    
    Simply, it casts `charactor` to `factor` if it meets the condition, `stringsAsFactors && is.character(vec)` in primitive type conversion.
    
    ## How was this patch tested?
    
    Unit test in `R/pkg/tests/fulltests/test_sparkSQL.R`.
    
    Author: hyukjinkwon <gurwls223@gmail.com>
    
    Closes #19551 from HyukjinKwon/SPARK-17902.
    
    (cherry picked from commit a83d8d5)
    Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
    HyukjinKwon committed Oct 26, 2017
    Configuration menu
    Copy the full SHA
    24fe7cc View commit details
    Browse the repository at this point in the history
  2. [SPARK-22328][CORE] ClosureCleaner should not miss referenced supercl…

    …ass fields
    
    When the given closure uses some fields defined in super class, `ClosureCleaner` can't figure them and don't set it properly. Those fields will be in null values.
    
    Added test.
    
    Author: Liang-Chi Hsieh <viirya@gmail.com>
    
    Closes #19556 from viirya/SPARK-22328.
    
    (cherry picked from commit 4f8dc6b)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    viirya authored and cloud-fan committed Oct 26, 2017
    Configuration menu
    Copy the full SHA
    a607ddc View commit details
    Browse the repository at this point in the history

Commits on Oct 27, 2017

  1. [SPARK-22355][SQL] Dataset.collect is not threadsafe

    It's possible that users create a `Dataset`, and call `collect` of this `Dataset` in many threads at the same time. Currently `Dataset#collect` just call `encoder.fromRow` to convert spark rows to objects of type T, and this encoder is per-dataset. This means `Dataset#collect` is not thread-safe, because the encoder uses a projection to output the object to a re-usable row.
    
    This PR fixes this problem, by creating a new projection when calling `Dataset#collect`, so that we have the re-usable row for each method call, instead of each Dataset.
    
    N/A
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes #19577 from cloud-fan/encoder.
    
    (cherry picked from commit 5c3a1f3)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    cloud-fan authored and gatorsmile committed Oct 27, 2017
    Configuration menu
    Copy the full SHA
    2839280 View commit details
    Browse the repository at this point in the history
  2. [SPARK-22356][SQL] data source table should support overlapped column…

    …s between data and partition schema
    
    This is a regression introduced by #14207. After Spark 2.1, we store the inferred schema when creating the table, to avoid inferring schema again at read path. However, there is one special case: overlapped columns between data and partition. For this case, it breaks the assumption of table schema that there is on ovelap between data and partition schema, and partition columns should be at the end. The result is, for Spark 2.1, the table scan has incorrect schema that puts partition columns at the end. For Spark 2.2, we add a check in CatalogTable to validate table schema, which fails at this case.
    
    To fix this issue, a simple and safe approach is to fallback to old behavior when overlapeed columns detected, i.e. store empty schema in metastore.
    
    new regression test
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes #19579 from cloud-fan/bug2.
    cloud-fan authored and gatorsmile committed Oct 27, 2017
    Configuration menu
    Copy the full SHA
    cb54f29 View commit details
    Browse the repository at this point in the history

Commits on Oct 29, 2017

  1. [SPARK-19727][SQL][FOLLOWUP] Fix for round function that modifies ori…

    …ginal column
    
    ## What changes were proposed in this pull request?
    
    This is a followup of #17075 , to fix the bug in codegen path.
    
    ## How was this patch tested?
    
    new regression test
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes #19576 from cloud-fan/bug.
    
    (cherry picked from commit 7fdacbc)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    cloud-fan authored and gatorsmile committed Oct 29, 2017
    Configuration menu
    Copy the full SHA
    cac6506 View commit details
    Browse the repository at this point in the history

Commits on Oct 30, 2017

  1. [SPARK-22344][SPARKR] Set java.io.tmpdir for SparkR tests

    This PR sets the java.io.tmpdir for CRAN checks and also disables the hsperfdata for the JVM when running CRAN checks. Together this prevents files from being left behind in `/tmp`
    
    ## How was this patch tested?
    Tested manually on a clean EC2 machine
    
    Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
    
    Closes #19589 from shivaram/sparkr-tmpdir-clean.
    
    (cherry picked from commit 1fe2761)
    Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
    shivaram committed Oct 30, 2017
    Configuration menu
    Copy the full SHA
    f973587 View commit details
    Browse the repository at this point in the history
  2. [SPARK-22291][SQL] Conversion error when transforming array types of …

    …uuid, inet and cidr to StingType in PostgreSQL
    
    ## What changes were proposed in this pull request?
    
    This PR fixes the conversion error when transforming array types of `uuid`, `inet` and `cidr` to `StingType` in PostgreSQL.
    
    ## How was this patch tested?
    
    Added test in `PostgresIntegrationSuite`.
    
    Author: Jen-Ming Chung <jenmingisme@gmail.com>
    
    Closes #19604 from jmchung/SPARK-22291-FOLLOWUP.
    jmchung authored and cloud-fan committed Oct 30, 2017
    Configuration menu
    Copy the full SHA
    7f8236c View commit details
    Browse the repository at this point in the history

Commits on Oct 31, 2017

  1. [SPARK-19611][SQL][FOLLOWUP] set dataSchema correctly in HiveMetastor…

    …eCatalog.convertToLogicalRelation
    
    We made a mistake in #16944 . In `HiveMetastoreCatalog#inferIfNeeded` we infer the data schema, merge with full schema, and return the new full schema. At caller side we treat the full schema as data schema and set it to `HadoopFsRelation`.
    
    This doesn't cause any problem because both parquet and orc can work with a wrong data schema that has extra columns, but it's better to fix this mistake.
    
    N/A
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes #19615 from cloud-fan/infer.
    
    (cherry picked from commit 4d9ebf3)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    cloud-fan committed Oct 31, 2017
    Configuration menu
    Copy the full SHA
    dd69ac6 View commit details
    Browse the repository at this point in the history
  2. [SPARK-22333][SQL][BACKPORT-2.2] timeFunctionCall(CURRENT_DATE, CURRE…

    …NT_TIMESTAMP) has conflicts with columnReference
    
    ## What changes were proposed in this pull request?
    
    This is a backport pr of #19559
    for branch-2.2
    
    ## How was this patch tested?
    unit tests
    
    Author: donnyzone <wellfengzhu@gmail.com>
    
    Closes #19606 from DonnyZone/branch-2.2.
    DonnyZone authored and gatorsmile committed Oct 31, 2017
    Configuration menu
    Copy the full SHA
    ab87a92 View commit details
    Browse the repository at this point in the history

Commits on Nov 2, 2017

  1. [MINOR][DOC] automatic type inference supports also Date and Timestamp

    ## What changes were proposed in this pull request?
    
    Easy fix in the documentation, which is reporting that only numeric types and string are supported in type inference for partition columns, while Date and Timestamp are supported too since 2.1.0, thanks to SPARK-17388.
    
    ## How was this patch tested?
    
    n/a
    
    Author: Marco Gaido <mgaido@hortonworks.com>
    
    Closes #19628 from mgaido91/SPARK-22398.
    
    (cherry picked from commit b04eefa)
    Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
    mgaido91 authored and HyukjinKwon committed Nov 2, 2017
    Configuration menu
    Copy the full SHA
    c311c5e View commit details
    Browse the repository at this point in the history
  2. [SPARK-22306][SQL][2.2] alter table schema should not erase the bucke…

    …ting metadata at hive side
    
    ## What changes were proposed in this pull request?
    
    When we alter table schema, we set the new schema to spark `CatalogTable`, convert it to hive table, and finally call `hive.alterTable`. This causes a problem in Spark 2.2, because hive bucketing metedata is not recognized by Spark, which means a Spark `CatalogTable` representing a hive table is always non-bucketed, and when we convert it to hive table and call `hive.alterTable`, the original hive bucketing metadata will be removed.
    
    To fix this bug, we should read out the raw hive table metadata, update its schema, and call `hive.alterTable`. By doing this we can guarantee only the schema is changed, and nothing else.
    
    Note that this bug doesn't exist in the master branch, because we've added hive bucketing support and the hive bucketing metadata can be recognized by Spark. I think we should merge this PR to master too, for code cleanup and reduce the difference between master and 2.2 branch for backporting.
    
    ## How was this patch tested?
    
    new regression test
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes #19622 from cloud-fan/infer.
    cloud-fan committed Nov 2, 2017
    Configuration menu
    Copy the full SHA
    4074ed2 View commit details
    Browse the repository at this point in the history

Commits on Nov 5, 2017

  1. [SPARK-22211][SQL] Remove incorrect FOJ limit pushdown

    It's not safe in all cases to push down a LIMIT below a FULL OUTER
    JOIN. If the limit is pushed to one side of the FOJ, the physical
    join operator can not tell if a row in the non-limited side would have a
    match in the other side.
    
    *If* the join operator guarantees that unmatched tuples from the limited
    side are emitted before any unmatched tuples from the other side,
    pushing down the limit is safe. But this is impractical for some join
    implementations, e.g. SortMergeJoin.
    
    For now, disable limit pushdown through a FULL OUTER JOIN, and we can
    evaluate whether a more complicated solution is necessary in the future.
    
    Ran org.apache.spark.sql.* tests. Altered full outer join tests in
    LimitPushdownSuite.
    
    Author: Henry Robinson <henry@cloudera.com>
    
    Closes #19647 from henryr/spark-22211.
    
    (cherry picked from commit 6c66266)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    Henry Robinson authored and gatorsmile committed Nov 5, 2017
    Configuration menu
    Copy the full SHA
    5e38373 View commit details
    Browse the repository at this point in the history
  2. [SPARK-22429][STREAMING] Streaming checkpointing code does not retry …

    …after failure
    
    ## What changes were proposed in this pull request?
    
    SPARK-14930/SPARK-13693 put in a change to set the fs object to null after a failure, however the retry loop does not include initialization. Moved fs initialization inside the retry while loop to aid recoverability.
    
    ## How was this patch tested?
    
    Passes all existing unit tests.
    
    Author: Tristan Stevens <tristan@cloudera.com>
    
    Closes #19645 from tmgstevens/SPARK-22429.
    
    (cherry picked from commit fe258a7)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    Tristan Stevens authored and srowen committed Nov 5, 2017
    Configuration menu
    Copy the full SHA
    e35c53a View commit details
    Browse the repository at this point in the history

Commits on Nov 6, 2017

  1. [SPARK-22315][SPARKR] Warn if SparkR package version doesn't match Sp…

    …arkContext
    
    ## What changes were proposed in this pull request?
    
    This PR adds a check between the R package version used and the version reported by SparkContext running in the JVM. The goal here is to warn users when they have a R package downloaded from CRAN and are using that to connect to an existing Spark cluster.
    
    This is raised as a warning rather than an error as users might want to use patch versions interchangeably (e.g., 2.1.3 with 2.1.2 etc.)
    
    ## How was this patch tested?
    
    Manually by changing the `DESCRIPTION` file
    
    Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
    
    Closes #19624 from shivaram/sparkr-version-check.
    
    (cherry picked from commit 65a8bf6)
    Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
    shivaram committed Nov 6, 2017
    Configuration menu
    Copy the full SHA
    2695b92 View commit details
    Browse the repository at this point in the history

Commits on Nov 7, 2017

  1. [SPARK-22417][PYTHON] Fix for createDataFrame from pandas.DataFrame w…

    …ith timestamp
    
    Currently, a pandas.DataFrame that contains a timestamp of type 'datetime64[ns]' when converted to a Spark DataFrame with `createDataFrame` will interpret the values as LongType. This fix will check for a timestamp type and convert it to microseconds which will allow Spark to read as TimestampType.
    
    Added unit test to verify Spark schema is expected for TimestampType and DateType when created from pandas
    
    Author: Bryan Cutler <cutlerb@gmail.com>
    
    Closes #19646 from BryanCutler/pyspark-non-arrow-createDataFrame-ts-fix-SPARK-22417.
    
    (cherry picked from commit 1d34104)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    BryanCutler authored and cloud-fan committed Nov 7, 2017
    Configuration menu
    Copy the full SHA
    161ba18 View commit details
    Browse the repository at this point in the history

Commits on Nov 8, 2017

  1. [SPARK-22327][SPARKR][TEST][BACKPORT-2.2] check for version warning

    ## What changes were proposed in this pull request?
    
    Will need to port to this to branch-~~1.6~~, -2.0, -2.1, -2.2
    
    ## How was this patch tested?
    
    manually
    Jenkins, AppVeyor
    
    Author: Felix Cheung <felixcheung_m@hotmail.com>
    
    Closes #19619 from felixcheung/rcranversioncheck22.
    felixcheung authored and Felix Cheung committed Nov 8, 2017
    Configuration menu
    Copy the full SHA
    5c9035b View commit details
    Browse the repository at this point in the history
  2. [SPARK-22281][SPARKR] Handle R method breaking signature changes

    ## What changes were proposed in this pull request?
    
    This is to fix the code for the latest R changes in R-devel, when running CRAN check
    ```
    checking for code/documentation mismatches ... WARNING
    Codoc mismatches from documentation object 'attach':
    attach
    Code: function(what, pos = 2L, name = deparse(substitute(what),
    backtick = FALSE), warn.conflicts = TRUE)
    Docs: function(what, pos = 2L, name = deparse(substitute(what)),
    warn.conflicts = TRUE)
    Mismatches in argument default values:
    Name: 'name' Code: deparse(substitute(what), backtick = FALSE) Docs: deparse(substitute(what))
    
    Codoc mismatches from documentation object 'glm':
    glm
    Code: function(formula, family = gaussian, data, weights, subset,
    na.action, start = NULL, etastart, mustart, offset,
    control = list(...), model = TRUE, method = "glm.fit",
    x = FALSE, y = TRUE, singular.ok = TRUE, contrasts =
    NULL, ...)
    Docs: function(formula, family = gaussian, data, weights, subset,
    na.action, start = NULL, etastart, mustart, offset,
    control = list(...), model = TRUE, method = "glm.fit",
    x = FALSE, y = TRUE, contrasts = NULL, ...)
    Argument names in code not in docs:
    singular.ok
    Mismatches in argument names:
    Position: 16 Code: singular.ok Docs: contrasts
    Position: 17 Code: contrasts Docs: ...
    ```
    
    With attach, it's pulling in the function definition from base::attach. We need to disable that but we would still need a function signature for roxygen2 to build with.
    
    With glm it's pulling in the function definition (ie. "usage") from the stats::glm function. Since this is "compiled in" when we build the source package into the .Rd file, when it changes at runtime or in CRAN check it won't match the latest signature. The solution is not to pull in from stats::glm since there isn't much value in doing that (none of the param we actually use, the ones we do use we have explicitly documented them)
    
    Also with attach we are changing to call dynamically.
    
    ## How was this patch tested?
    
    Manually.
    - [x] check documentation output - yes
    - [x] check help `?attach` `?glm` - yes
    - [x] check on other platforms, r-hub, on r-devel etc..
    
    Author: Felix Cheung <felixcheung_m@hotmail.com>
    
    Closes #19557 from felixcheung/rattachglmdocerror.
    
    (cherry picked from commit 2ca5aae)
    Signed-off-by: Felix Cheung <felixcheung@apache.org>
    felixcheung authored and Felix Cheung committed Nov 8, 2017
    Configuration menu
    Copy the full SHA
    73a2ca0 View commit details
    Browse the repository at this point in the history

Commits on Nov 9, 2017

  1. [SPARK-22211][SQL][FOLLOWUP] Fix bad merge for tests

    ## What changes were proposed in this pull request?
    
    The merge of SPARK-22211 to branch-2.2 dropped a couple of important lines that made sure the tests that compared plans did so with both plans having been analyzed. Fix by reintroducing the correct analysis statements.
    
    ## How was this patch tested?
    
    Re-ran LimitPushdownSuite. All tests passed.
    
    Author: Henry Robinson <henry@apache.org>
    
    Closes #19701 from henryr/branch-2.2.
    henryr authored and gatorsmile committed Nov 9, 2017
    Configuration menu
    Copy the full SHA
    efaf73f View commit details
    Browse the repository at this point in the history
  2. [SPARK-22417][PYTHON][FOLLOWUP][BRANCH-2.2] Fix for createDataFrame f…

    …rom pandas.DataFrame with timestamp
    
    ## What changes were proposed in this pull request?
    
    This is a follow-up of #19646 for branch-2.2.
    The original pr breaks branch-2.2 because the cherry-picked patch doesn't include some code which exists in master.
    
    ## How was this patch tested?
    
    Existing tests.
    
    Author: Takuya UESHIN <ueshin@databricks.com>
    
    Closes #19704 from ueshin/issues/SPARK-22417_2.2/fup1.
    ueshin committed Nov 9, 2017
    Configuration menu
    Copy the full SHA
    0e97c8e View commit details
    Browse the repository at this point in the history

Commits on Nov 10, 2017

  1. [SPARK-22403][SS] Add optional checkpointLocation argument to Structu…

    …redKafkaWordCount example
    
    ## What changes were proposed in this pull request?
    
    When run in YARN cluster mode, the StructuredKafkaWordCount example fails because Spark tries to create a temporary checkpoint location in a subdirectory of the path given by java.io.tmpdir, and YARN sets java.io.tmpdir to a path in the local filesystem that usually does not correspond to an existing path in the distributed filesystem.
    Add an optional checkpointLocation argument to the StructuredKafkaWordCount example so that users can specify the checkpoint location and avoid this issue.
    
    ## How was this patch tested?
    
    Built and ran the example manually on YARN client and cluster mode.
    
    Author: Wing Yew Poon <wypoon@cloudera.com>
    
    Closes #19703 from wypoon/SPARK-22403.
    
    (cherry picked from commit 11c4021)
    Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
    wypoon authored and zsxwing committed Nov 10, 2017
    Configuration menu
    Copy the full SHA
    ede0e1a View commit details
    Browse the repository at this point in the history
  2. [SPARK-22287][MESOS] SPARK_DAEMON_MEMORY not honored by MesosClusterD…

    …ispatcher
    
    ## What changes were proposed in this pull request?
    
    Allow JVM max heap size to be controlled for MesosClusterDispatcher via SPARK_DAEMON_MEMORY environment variable.
    
    ## How was this patch tested?
    
    Tested on local Mesos cluster
    
    Author: Paul Mackles <pmackles@adobe.com>
    
    Closes #19515 from pmackles/SPARK-22287.
    
    (cherry picked from commit f5fe63f)
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
    Paul Mackles authored and Marcelo Vanzin committed Nov 10, 2017
    Configuration menu
    Copy the full SHA
    1b70c66 View commit details
    Browse the repository at this point in the history
  3. [SPARK-22472][SQL] add null check for top-level primitive values

    ## What changes were proposed in this pull request?
    
    One powerful feature of `Dataset` is, we can easily map SQL rows to Scala/Java objects and do runtime null check automatically.
    
    For example, let's say we have a parquet file with schema `<a: int, b: string>`, and we have a `case class Data(a: Int, b: String)`. Users can easily read this parquet file into `Data` objects, and Spark will throw NPE if column `a` has null values.
    
    However the null checking is left behind for top-level primitive values. For example, let's say we have a parquet file with schema `<a: Int>`, and we read it into Scala `Int`. If column `a` has null values, we will get some weird results.
    ```
    scala> val ds = spark.read.parquet(...).as[Int]
    
    scala> ds.show()
    +----+
    |v   |
    +----+
    |null|
    |1   |
    +----+
    
    scala> ds.collect
    res0: Array[Long] = Array(0, 1)
    
    scala> ds.map(_ * 2).show
    +-----+
    |value|
    +-----+
    |-2   |
    |2    |
    +-----+
    ```
    
    This is because internally Spark use some special default values for primitive types, but never expect users to see/operate these default value directly.
    
    This PR adds null check for top-level primitive values
    
    ## How was this patch tested?
    
    new test
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes #19707 from cloud-fan/bug.
    
    (cherry picked from commit 0025dde)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    
    # Conflicts:
    #	sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/ScalaReflectionSuite.scala
    #	sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala
    cloud-fan authored and gatorsmile committed Nov 10, 2017
    Configuration menu
    Copy the full SHA
    7551524 View commit details
    Browse the repository at this point in the history
  4. [SPARK-22344][SPARKR] clean up install dir if running test as source …

    …package
    
    ## What changes were proposed in this pull request?
    
    remove spark if spark downloaded & installed
    
    ## How was this patch tested?
    
    manually by building package
    Jenkins, AppVeyor
    
    Author: Felix Cheung <felixcheung_m@hotmail.com>
    
    Closes #19657 from felixcheung/rinstalldir.
    
    (cherry picked from commit b70aa9e)
    Signed-off-by: Felix Cheung <felixcheung@apache.org>
    felixcheung authored and Felix Cheung committed Nov 10, 2017
    Configuration menu
    Copy the full SHA
    0568f28 View commit details
    Browse the repository at this point in the history
  5. [SPARK-22243][DSTREAM] spark.yarn.jars should reload from config when…

    … checkpoint recovery
    
    ## What changes were proposed in this pull request?
    the previous [PR](#19469) is deleted by mistake.
    the solution is straight forward.
    adding  "spark.yarn.jars" to propertiesToReload so this property will load from config.
    
    ## How was this patch tested?
    
    manual tests
    
    Author: ZouChenjun <zouchenjun@youzan.com>
    
    Closes #19637 from ChenjunZou/checkpoint-yarn-jars.
    ZouChenjun authored and zsxwing committed Nov 10, 2017
    Configuration menu
    Copy the full SHA
    eb49c32 View commit details
    Browse the repository at this point in the history
  6. [SPARK-22294][DEPLOY] Reset spark.driver.bindAddress when starting a …

    …Checkpoint
    
    ## What changes were proposed in this pull request?
    
    It seems that recovering from a checkpoint can replace the old
    driver and executor IP addresses, as the workload can now be taking
    place in a different cluster configuration. It follows that the
    bindAddress for the master may also have changed. Thus we should not be
    keeping the old one, and instead be added to the list of properties to
    reset and recreate from the new environment.
    
    ## How was this patch tested?
    
    This patch was tested via manual testing on AWS, using the experimental (not yet merged) Kubernetes scheduler, which uses bindAddress to bind to a Kubernetes service (and thus was how I first encountered the bug too), but it is not a code-path related to the scheduler and this may have slipped through when merging SPARK-4563.
    
    Author: Santiago Saavedra <ssaavedra@openshine.com>
    
    Closes #19427 from ssaavedra/fix-checkpointing-master.
    
    (cherry picked from commit 5ebdcd1)
    Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
    ssaavedra authored and zsxwing committed Nov 10, 2017
    Configuration menu
    Copy the full SHA
    371be22 View commit details
    Browse the repository at this point in the history
  7. [SPARK-22284][SQL] Fix 64KB JVM bytecode limit problem in calculating…

    … hash for nested structs
    
    ## What changes were proposed in this pull request?
    
    This PR avoids to generate a huge method for calculating a murmur3 hash for nested structs. This PR splits a huge method (e.g. `apply_4`) into multiple smaller methods.
    
    Sample program
    ```
      val structOfString = new StructType().add("str", StringType)
      var inner = new StructType()
      for (_ <- 0 until 800) {
        inner = inner1.add("structOfString", structOfString)
      }
      var schema = new StructType()
      for (_ <- 0 until 50) {
        schema = schema.add("structOfStructOfStrings", inner)
      }
      GenerateMutableProjection.generate(Seq(Murmur3Hash(exprs, 42)))
    ```
    
    Without this PR
    ```
    /* 005 */ class SpecificMutableProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection {
    /* 006 */
    /* 007 */   private Object[] references;
    /* 008 */   private InternalRow mutableRow;
    /* 009 */   private int value;
    /* 010 */   private int value_0;
    ...
    /* 034 */   public java.lang.Object apply(java.lang.Object _i) {
    /* 035 */     InternalRow i = (InternalRow) _i;
    /* 036 */
    /* 037 */
    /* 038 */
    /* 039 */     value = 42;
    /* 040 */     apply_0(i);
    /* 041 */     apply_1(i);
    /* 042 */     apply_2(i);
    /* 043 */     apply_3(i);
    /* 044 */     apply_4(i);
    /* 045 */     nestedClassInstance.apply_5(i);
    ...
    /* 089 */     nestedClassInstance8.apply_49(i);
    /* 090 */     value_0 = value;
    /* 091 */
    /* 092 */     // copy all the results into MutableRow
    /* 093 */     mutableRow.setInt(0, value_0);
    /* 094 */     return mutableRow;
    /* 095 */   }
    /* 096 */
    /* 097 */
    /* 098 */   private void apply_4(InternalRow i) {
    /* 099 */
    /* 100 */     boolean isNull5 = i.isNullAt(4);
    /* 101 */     InternalRow value5 = isNull5 ? null : (i.getStruct(4, 800));
    /* 102 */     if (!isNull5) {
    /* 103 */
    /* 104 */       if (!value5.isNullAt(0)) {
    /* 105 */
    /* 106 */         final InternalRow element6400 = value5.getStruct(0, 1);
    /* 107 */
    /* 108 */         if (!element6400.isNullAt(0)) {
    /* 109 */
    /* 110 */           final UTF8String element6401 = element6400.getUTF8String(0);
    /* 111 */           value = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(element6401.getBaseObject(), element6401.getBaseOffset(), element6401.numBytes(), value);
    /* 112 */
    /* 113 */         }
    /* 114 */
    /* 115 */
    /* 116 */       }
    /* 117 */
    /* 118 */
    /* 119 */       if (!value5.isNullAt(1)) {
    /* 120 */
    /* 121 */         final InternalRow element6402 = value5.getStruct(1, 1);
    /* 122 */
    /* 123 */         if (!element6402.isNullAt(0)) {
    /* 124 */
    /* 125 */           final UTF8String element6403 = element6402.getUTF8String(0);
    /* 126 */           value = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(element6403.getBaseObject(), element6403.getBaseOffset(), element6403.numBytes(), value);
    /* 127 */
    /* 128 */         }
    /* 128 */         }
    /* 129 */
    /* 130 */
    /* 131 */       }
    /* 132 */
    /* 133 */
    /* 134 */       if (!value5.isNullAt(2)) {
    /* 135 */
    /* 136 */         final InternalRow element6404 = value5.getStruct(2, 1);
    /* 137 */
    /* 138 */         if (!element6404.isNullAt(0)) {
    /* 139 */
    /* 140 */           final UTF8String element6405 = element6404.getUTF8String(0);
    /* 141 */           value = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(element6405.getBaseObject(), element6405.getBaseOffset(), element6405.numBytes(), value);
    /* 142 */
    /* 143 */         }
    /* 144 */
    /* 145 */
    /* 146 */       }
    /* 147 */
    ...
    /* 12074 */       if (!value5.isNullAt(798)) {
    /* 12075 */
    /* 12076 */         final InternalRow element7996 = value5.getStruct(798, 1);
    /* 12077 */
    /* 12078 */         if (!element7996.isNullAt(0)) {
    /* 12079 */
    /* 12080 */           final UTF8String element7997 = element7996.getUTF8String(0);
    /* 12083 */         }
    /* 12084 */
    /* 12085 */
    /* 12086 */       }
    /* 12087 */
    /* 12088 */
    /* 12089 */       if (!value5.isNullAt(799)) {
    /* 12090 */
    /* 12091 */         final InternalRow element7998 = value5.getStruct(799, 1);
    /* 12092 */
    /* 12093 */         if (!element7998.isNullAt(0)) {
    /* 12094 */
    /* 12095 */           final UTF8String element7999 = element7998.getUTF8String(0);
    /* 12096 */           value = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(element7999.getBaseObject(), element7999.getBaseOffset(), element7999.numBytes(), value);
    /* 12097 */
    /* 12098 */         }
    /* 12099 */
    /* 12100 */
    /* 12101 */       }
    /* 12102 */
    /* 12103 */     }
    /* 12104 */
    /* 12105 */   }
    /* 12106 */
    /* 12106 */
    /* 12107 */
    /* 12108 */   private void apply_1(InternalRow i) {
    ...
    ```
    
    With this PR
    ```
    /* 005 */ class SpecificMutableProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection {
    /* 006 */
    /* 007 */   private Object[] references;
    /* 008 */   private InternalRow mutableRow;
    /* 009 */   private int value;
    /* 010 */   private int value_0;
    /* 011 */
    ...
    /* 034 */   public java.lang.Object apply(java.lang.Object _i) {
    /* 035 */     InternalRow i = (InternalRow) _i;
    /* 036 */
    /* 037 */
    /* 038 */
    /* 039 */     value = 42;
    /* 040 */     nestedClassInstance11.apply50_0(i);
    /* 041 */     nestedClassInstance11.apply50_1(i);
    ...
    /* 088 */     nestedClassInstance11.apply50_48(i);
    /* 089 */     nestedClassInstance11.apply50_49(i);
    /* 090 */     value_0 = value;
    /* 091 */
    /* 092 */     // copy all the results into MutableRow
    /* 093 */     mutableRow.setInt(0, value_0);
    /* 094 */     return mutableRow;
    /* 095 */   }
    /* 096 */
    ...
    /* 37717 */   private void apply4_0(InternalRow value5, InternalRow i) {
    /* 37718 */
    /* 37719 */     if (!value5.isNullAt(0)) {
    /* 37720 */
    /* 37721 */       final InternalRow element6400 = value5.getStruct(0, 1);
    /* 37722 */
    /* 37723 */       if (!element6400.isNullAt(0)) {
    /* 37724 */
    /* 37725 */         final UTF8String element6401 = element6400.getUTF8String(0);
    /* 37726 */         value = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(element6401.getBaseObject(), element6401.getBaseOffset(), element6401.numBytes(), value);
    /* 37727 */
    /* 37728 */       }
    /* 37729 */
    /* 37730 */
    /* 37731 */     }
    /* 37732 */
    /* 37733 */     if (!value5.isNullAt(1)) {
    /* 37734 */
    /* 37735 */       final InternalRow element6402 = value5.getStruct(1, 1);
    /* 37736 */
    /* 37737 */       if (!element6402.isNullAt(0)) {
    /* 37738 */
    /* 37739 */         final UTF8String element6403 = element6402.getUTF8String(0);
    /* 37740 */         value = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(element6403.getBaseObject(), element6403.getBaseOffset(), element6403.numBytes(), value);
    /* 37741 */
    /* 37742 */       }
    /* 37743 */
    /* 37744 */
    /* 37745 */     }
    /* 37746 */
    /* 37747 */     if (!value5.isNullAt(2)) {
    /* 37748 */
    /* 37749 */       final InternalRow element6404 = value5.getStruct(2, 1);
    /* 37750 */
    /* 37751 */       if (!element6404.isNullAt(0)) {
    /* 37752 */
    /* 37753 */         final UTF8String element6405 = element6404.getUTF8String(0);
    /* 37754 */         value = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(element6405.getBaseObject(), element6405.getBaseOffset(), element6405.numBytes(), value);
    /* 37755 */
    /* 37756 */       }
    /* 37757 */
    /* 37758 */
    /* 37759 */     }
    /* 37760 */
    /* 37761 */   }
    ...
    /* 218470 */
    /* 218471 */     private void apply50_4(InternalRow i) {
    /* 218472 */
    /* 218473 */       boolean isNull5 = i.isNullAt(4);
    /* 218474 */       InternalRow value5 = isNull5 ? null : (i.getStruct(4, 800));
    /* 218475 */       if (!isNull5) {
    /* 218476 */         apply4_0(value5, i);
    /* 218477 */         apply4_1(value5, i);
    /* 218478 */         apply4_2(value5, i);
    ...
    /* 218742 */         nestedClassInstance.apply4_266(value5, i);
    /* 218743 */       }
    /* 218744 */
    /* 218745 */     }
    ```
    
    ## How was this patch tested?
    
    Added new test to `HashExpressionsSuite`
    
    Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
    
    Closes #19563 from kiszk/SPARK-22284.
    
    (cherry picked from commit f2da738)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    kiszk authored and cloud-fan committed Nov 10, 2017
    Configuration menu
    Copy the full SHA
    6b4ec22 View commit details
    Browse the repository at this point in the history
  8. [SPARK-19644][SQL] Clean up Scala reflection garbage after creating E…

    …ncoder (branch-2.2)
    
    ## What changes were proposed in this pull request?
    
    Backport #19687 to branch-2.2. The major difference is `cleanUpReflectionObjects` is protected by `ScalaReflectionLock.synchronized` in this PR for Scala 2.10.
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <zsxwing@gmail.com>
    
    Closes #19718 from zsxwing/SPARK-19644-2.2.
    zsxwing committed Nov 10, 2017
    Configuration menu
    Copy the full SHA
    8b7f72e View commit details
    Browse the repository at this point in the history
  9. [SPARK-21667][STREAMING] ConsoleSink should not fail streaming query …

    …with checkpointLocation option
    
    ## What changes were proposed in this pull request?
    Fix to allow recovery on console , avoid checkpoint exception
    
    ## How was this patch tested?
    existing tests
    manual tests [ Replicating error and seeing no checkpoint error after fix]
    
    Author: Rekha Joshi <rekhajoshm@gmail.com>
    Author: rjoshi2 <rekhajoshm@gmail.com>
    
    Closes #19407 from rekhajoshm/SPARK-21667.
    
    (cherry picked from commit 808e886)
    Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
    rekhajoshm authored and zsxwing committed Nov 10, 2017
    Configuration menu
    Copy the full SHA
    4ef0bef View commit details
    Browse the repository at this point in the history

Commits on Nov 12, 2017

  1. [SPARK-19606][MESOS] Support constraints in spark-dispatcher

    A discussed in SPARK-19606, the addition of a new config property named "spark.mesos.constraints.driver" for constraining drivers running on a Mesos cluster
    
    Corresponding unit test added also tested locally on a Mesos cluster
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Paul Mackles <pmackles@adobe.com>
    
    Closes #19543 from pmackles/SPARK-19606.
    
    (cherry picked from commit b3f9dbf)
    Signed-off-by: Felix Cheung <felixcheung@apache.org>
    Paul Mackles authored and Felix Cheung committed Nov 12, 2017
    Configuration menu
    Copy the full SHA
    f6ee3d9 View commit details
    Browse the repository at this point in the history
  2. [SPARK-21720][SQL] Fix 64KB JVM bytecode limit problem with AND or OR

    This PR changes `AND` or `OR` code generation to place condition and then expressions' generated code into separated methods if these size could be large. When the method is newly generated, variables for `isNull` and `value` are declared as an instance variable to pass these values (e.g. `isNull1409` and `value1409`) to the callers of the generated method.
    
    This PR resolved two cases:
    
    * large code size of left expression
    * large code size of right expression
    
    Added a new test case into `CodeGenerationSuite`
    
    Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
    
    Closes #18972 from kiszk/SPARK-21720.
    
    (cherry picked from commit 9bf696d)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    kiszk authored and cloud-fan committed Nov 12, 2017
    Configuration menu
    Copy the full SHA
    114dc42 View commit details
    Browse the repository at this point in the history
  3. [SPARK-22488][BACKPORT-2.2][SQL] Fix the view resolution issue in the…

    … SparkSession internal table() API
    
    ## What changes were proposed in this pull request?
    
    The current internal `table()` API of `SparkSession` bypasses the Analyzer and directly calls `sessionState.catalog.lookupRelation` API. This skips the view resolution logics in our Analyzer rule `ResolveRelations`. This internal API is widely used by various DDL commands, public and internal APIs.
    
    Users might get the strange error caused by view resolution when the default database is different.
    ```
    Table or view not found: t1; line 1 pos 14
    org.apache.spark.sql.AnalysisException: Table or view not found: t1; line 1 pos 14
    	at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
    ```
    
    This PR is to fix it by enforcing it to use `ResolveRelations` to resolve the table.
    
    ## How was this patch tested?
    Added a test case and modified the existing test cases
    
    Author: gatorsmile <gatorsmile@gmail.com>
    
    Closes #19723 from gatorsmile/backport22488.
    gatorsmile authored and cloud-fan committed Nov 12, 2017
    Configuration menu
    Copy the full SHA
    00cb9d0 View commit details
    Browse the repository at this point in the history
  4. [SPARK-22464][BACKPORT-2.2][SQL] No pushdown for Hive metastore parti…

    …tion predicates containing null-safe equality
    
    ## What changes were proposed in this pull request?
    `<=>` is not supported by Hive metastore partition predicate pushdown. We should not push down it to Hive metastore when they are be using in partition predicates.
    
    ## How was this patch tested?
    Added a test case
    
    Author: gatorsmile <gatorsmile@gmail.com>
    
    Closes #19724 from gatorsmile/backportSPARK-22464.
    gatorsmile authored and cloud-fan committed Nov 12, 2017
    Configuration menu
    Copy the full SHA
    95981fa View commit details
    Browse the repository at this point in the history
  5. [SPARK-19606][BUILD][BACKPORT-2.2][MESOS] fix mesos break

    ## What changes were proposed in this pull request?
    
    Fix break from cherry pick
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Felix Cheung <felixcheung_m@hotmail.com>
    
    Closes #19732 from felixcheung/fixmesosdriverconstraint.
    felixcheung authored and Felix Cheung committed Nov 12, 2017
    Configuration menu
    Copy the full SHA
    2a04cfa View commit details
    Browse the repository at this point in the history
  6. [SPARK-21694][R][ML] Reduce max iterations in Linear SVM test in R to…

    … speed up AppVeyor build
    
    This PR proposes to reduce max iteration in Linear SVM test in SparkR. This particular test elapses roughly 5 mins on my Mac and over 20 mins on Windows.
    
    The root cause appears, it triggers 2500ish jobs by the default 100 max iterations. In Linux, `daemon.R` is forked but on Windows another process is launched, which is extremely slow.
    
    So, given my observation, there are many processes (not forked) ran on Windows, which makes the differences of elapsed time.
    
    After reducing the max iteration to 10, the total jobs in this single test is reduced to 550ish.
    
    After reducing the max iteration to 5, the total jobs in this single test is reduced to 360ish.
    
    Manually tested the elapsed times.
    
    Author: hyukjinkwon <gurwls223@gmail.com>
    
    Closes #19722 from HyukjinKwon/SPARK-21693-test.
    
    (cherry picked from commit 3d90b2c)
    Signed-off-by: Felix Cheung <felixcheung@apache.org>
    HyukjinKwon authored and Felix Cheung committed Nov 12, 2017
    Configuration menu
    Copy the full SHA
    8acd02f View commit details
    Browse the repository at this point in the history

Commits on Nov 13, 2017

  1. [SPARK-22442][SQL][BRANCH-2.2] ScalaReflection should produce correct…

    … field names for special characters
    
    ## What changes were proposed in this pull request?
    
    For a class with field name of special characters, e.g.:
    ```scala
    case class MyType(`field.1`: String, `field 2`: String)
    ```
    
    Although we can manipulate DataFrame/Dataset, the field names are encoded:
    ```scala
    scala> val df = Seq(MyType("a", "b"), MyType("c", "d")).toDF
    df: org.apache.spark.sql.DataFrame = [field$u002E1: string, field$u00202: string]
    scala> df.as[MyType].collect
    res7: Array[MyType] = Array(MyType(a,b), MyType(c,d))
    ```
    
    It causes resolving problem when we try to convert the data with non-encoded field names:
    ```scala
    spark.read.json(path).as[MyType]
    ...
    [info]   org.apache.spark.sql.AnalysisException: cannot resolve '`field$u002E1`' given input columns: [field 2, fie
    ld.1];
    [info]   at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
    ...
    ```
    
    We should use decoded field name in Dataset schema.
    
    ## How was this patch tested?
    
    Added tests.
    
    Author: Liang-Chi Hsieh <viirya@gmail.com>
    
    Closes #19734 from viirya/SPARK-22442-2.2.
    viirya authored and Felix Cheung committed Nov 13, 2017
    Configuration menu
    Copy the full SHA
    f736377 View commit details
    Browse the repository at this point in the history
  2. [SPARK-22442][SQL][BRANCH-2.2][FOLLOWUP] ScalaReflection should produ…

    …ce correct field names for special characters
    
    ## What changes were proposed in this pull request?
    
    `val TermName: TermNameExtractor` is new in scala 2.11. For 2.10, we should use deprecated `newTermName`.
    
    ## How was this patch tested?
    
    Build locally with scala 2.10.
    
    Author: Liang-Chi Hsieh <viirya@gmail.com>
    
    Closes #19736 from viirya/SPARK-22442-2.2-followup.
    viirya authored and cloud-fan committed Nov 13, 2017
    Configuration menu
    Copy the full SHA
    2f6dece View commit details
    Browse the repository at this point in the history
  3. [MINOR][CORE] Using bufferedInputStream for dataDeserializeStream

    ## What changes were proposed in this pull request?
    
    Small fix. Using bufferedInputStream for dataDeserializeStream.
    
    ## How was this patch tested?
    
    Existing UT.
    
    Author: Xianyang Liu <xianyang.liu@intel.com>
    
    Closes #19735 from ConeyLiu/smallfix.
    
    (cherry picked from commit 176ae4d)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    ConeyLiu authored and srowen committed Nov 13, 2017
    Configuration menu
    Copy the full SHA
    c68b4c5 View commit details
    Browse the repository at this point in the history
  4. Preparing Spark release v2.2.1-rc1

    Felix Cheung committed Nov 13, 2017
    Configuration menu
    Copy the full SHA
    41116ab View commit details
    Browse the repository at this point in the history
  5. Preparing development version 2.2.2-SNAPSHOT

    Felix Cheung committed Nov 13, 2017
    Configuration menu
    Copy the full SHA
    af0b185 View commit details
    Browse the repository at this point in the history
  6. [SPARK-22471][SQL] SQLListener consumes much memory causing OutOfMemo…

    …ryError
    
    ## What changes were proposed in this pull request?
    
    This PR addresses the issue [SPARK-22471](https://issues.apache.org/jira/browse/SPARK-22471). The modified version of `SQLListener` respects the setting `spark.ui.retainedStages` and keeps the number of the tracked stages within the specified limit. The hash map `_stageIdToStageMetrics` does not outgrow the limit, hence overall memory consumption does not grow with time anymore.
    
    A 2.2-compatible fix. Maybe incompatible with 2.3 due to #19681.
    
    ## How was this patch tested?
    
    A new unit test covers this fix - see `SQLListenerMemorySuite.scala`.
    
    Author: Arseniy Tashoyan <tashoyan@gmail.com>
    
    Closes #19711 from tashoyan/SPARK-22471-branch-2.2.
    tashoyan authored and Marcelo Vanzin committed Nov 13, 2017
    Configuration menu
    Copy the full SHA
    d905e85 View commit details
    Browse the repository at this point in the history
  7. [SPARK-22377][BUILD] Use /usr/sbin/lsof if lsof does not exists in re…

    …lease-build.sh
    
    ## What changes were proposed in this pull request?
    
    This PR proposes to use `/usr/sbin/lsof` if `lsof` is missing in the path to fix nightly snapshot jenkins jobs. Please refer #19359 (comment):
    
    > Looks like some of the snapshot builds are having lsof issues:
    >
    > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-branch-2.1-maven-snapshots/182/console
    >
    >https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-branch-2.2-maven-snapshots/134/console
    >
    >spark-build/dev/create-release/release-build.sh: line 344: lsof: command not found
    >usage: kill [ -s signal | -p ] [ -a ] pid ...
    >kill -l [ signal ]
    
    Up to my knowledge,  the full path of `lsof` is required for non-root user in few OSs.
    
    ## How was this patch tested?
    
    Manually tested as below:
    
    ```bash
    #!/usr/bin/env bash
    
    LSOF=lsof
    if ! hash $LSOF 2>/dev/null; then
      echo "a"
      LSOF=/usr/sbin/lsof
    fi
    
    $LSOF -P | grep "a"
    ```
    
    Author: hyukjinkwon <gurwls223@gmail.com>
    
    Closes #19695 from HyukjinKwon/SPARK-22377.
    
    (cherry picked from commit c8b7f97)
    Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
    HyukjinKwon committed Nov 13, 2017
    Configuration menu
    Copy the full SHA
    3ea6fd0 View commit details
    Browse the repository at this point in the history

Commits on Nov 14, 2017

  1. [SPARK-22511][BUILD] Update maven central repo address

    Use repo.maven.apache.org repo address; use latest ASF parent POM version 18
    
    Existing tests; no functional change
    
    Author: Sean Owen <sowen@cloudera.com>
    
    Closes #19742 from srowen/SPARK-22511.
    
    (cherry picked from commit b009722)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    srowen committed Nov 14, 2017
    Configuration menu
    Copy the full SHA
    210f292 View commit details
    Browse the repository at this point in the history

Commits on Nov 15, 2017

  1. [SPARK-22490][DOC] Add PySpark doc for SparkSession.builder

    ## What changes were proposed in this pull request?
    
    In PySpark API Document, [SparkSession.build](http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html) is not documented and shows default value description.
    ```
    SparkSession.builder = <pyspark.sql.session.Builder object ...
    ```
    
    This PR adds the doc.
    
    ![screen](https://user-images.githubusercontent.com/9700541/32705514-1bdcafaa-c7ca-11e7-88bf-05566fea42de.png)
    
    The following is the diff of the generated result.
    
    ```
    $ diff old.html new.html
    95a96,101
    > <dl class="attribute">
    > <dt id="pyspark.sql.SparkSession.builder">
    > <code class="descname">builder</code><a class="headerlink" href="#pyspark.sql.SparkSession.builder" title="Permalink to this definition">¶</a></dt>
    > <dd><p>A class attribute having a <a class="reference internal" href="#pyspark.sql.SparkSession.Builder" title="pyspark.sql.SparkSession.Builder"><code class="xref py py-class docutils literal"><span class="pre">Builder</span></code></a> to construct <a class="reference internal" href="#pyspark.sql.SparkSession" title="pyspark.sql.SparkSession"><code class="xref py py-class docutils literal"><span class="pre">SparkSession</span></code></a> instances</p>
    > </dd></dl>
    >
    212,216d217
    < <dt id="pyspark.sql.SparkSession.builder">
    < <code class="descname">builder</code><em class="property"> = &lt;pyspark.sql.session.SparkSession.Builder object&gt;</em><a class="headerlink" href="#pyspark.sql.SparkSession.builder" title="Permalink to this definition">¶</a></dt>
    < <dd></dd></dl>
    <
    < <dl class="attribute">
    ```
    
    ## How was this patch tested?
    
    Manual.
    
    ```
    cd python/docs
    make html
    open _build/html/pyspark.sql.html
    ```
    
    Author: Dongjoon Hyun <dongjoon@apache.org>
    
    Closes #19726 from dongjoon-hyun/SPARK-22490.
    
    (cherry picked from commit aa88b8d)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    dongjoon-hyun authored and gatorsmile committed Nov 15, 2017
    Configuration menu
    Copy the full SHA
    3cefdde View commit details
    Browse the repository at this point in the history

Commits on Nov 16, 2017

  1. [SPARK-22469][SQL] Accuracy problem in comparison with string and num…

    …eric
    
    This fixes a problem caused by #15880
    `select '1.5' > 0.5; // Result is NULL in Spark but is true in Hive.
    `
    When compare string and numeric, cast them as double like Hive.
    
    Author: liutang123 <liutang123@yeah.net>
    
    Closes #19692 from liutang123/SPARK-22469.
    
    (cherry picked from commit bc0848b)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    liutang123 authored and cloud-fan committed Nov 16, 2017
    Configuration menu
    Copy the full SHA
    3ae187b View commit details
    Browse the repository at this point in the history
  2. [SPARK-22479][SQL][BRANCH-2.2] Exclude credentials from SaveintoDataS…

    …ourceCommand.simpleString
    
    ## What changes were proposed in this pull request?
    
    Do not include jdbc properties which may contain credentials in logging a logical plan with `SaveIntoDataSourceCommand` in it.
    
    ## How was this patch tested?
    new tests
    
    Author: osatici <osatici@palantir.com>
    
    Closes #19761 from onursatici/os/redact-jdbc-creds-2.2.
    onursatici authored and cloud-fan committed Nov 16, 2017
    Configuration menu
    Copy the full SHA
    b17ba35 View commit details
    Browse the repository at this point in the history
  3. [SPARK-22499][SQL] Fix 64KB JVM bytecode limit problem with least and…

    … greatest
    
    ## What changes were proposed in this pull request?
    
    This PR changes `least` and `greatest` code generation to place generated code for expression for arguments into separated methods if these size could be large.
    This PR resolved two cases:
    
    * `least` with a lot of argument
    * `greatest` with a lot of argument
    
    ## How was this patch tested?
    
    Added a new test case into `ArithmeticExpressionsSuite`
    
    Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
    
    Closes #19729 from kiszk/SPARK-22499.
    
    (cherry picked from commit ed885e7)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    kiszk authored and cloud-fan committed Nov 16, 2017
    Configuration menu
    Copy the full SHA
    17ba7b9 View commit details
    Browse the repository at this point in the history
  4. [SPARK-22494][SQL] Fix 64KB limit exception with Coalesce and Atleast…

    …NNonNulls
    
    ## What changes were proposed in this pull request?
    
    Both `Coalesce` and `AtLeastNNonNulls` can cause the 64KB limit exception when used with a lot of arguments and/or complex expressions.
    This PR splits their expressions in order to avoid the issue.
    
    ## How was this patch tested?
    
    Added UTs
    
    Author: Marco Gaido <marcogaido91@gmail.com>
    Author: Marco Gaido <mgaido@hortonworks.com>
    
    Closes #19720 from mgaido91/SPARK-22494.
    
    (cherry picked from commit 4e7f07e)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    mgaido91 authored and cloud-fan committed Nov 16, 2017
    Configuration menu
    Copy the full SHA
    52b05b6 View commit details
    Browse the repository at this point in the history
  5. [SPARK-22501][SQL] Fix 64KB JVM bytecode limit problem with in

    ## What changes were proposed in this pull request?
    
    This PR changes `In` code generation to place generated code for expression for expressions for arguments into separated methods if these size could be large.
    
    ## How was this patch tested?
    
    Added new test cases into `PredicateSuite`
    
    Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
    
    Closes #19733 from kiszk/SPARK-22501.
    
    (cherry picked from commit 7f2e62e)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    kiszk authored and cloud-fan committed Nov 16, 2017
    Configuration menu
    Copy the full SHA
    0b51fd3 View commit details
    Browse the repository at this point in the history
  6. [SPARK-22535][PYSPARK] Sleep before killing the python worker in Pyth…

    …Runner.MonitorThread (branch-2.2)
    
    ## What changes were proposed in this pull request?
    
    Backport #19762 to 2.2
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <zsxwing@gmail.com>
    
    Closes #19768 from zsxwing/SPARK-22535-2.2.
    zsxwing committed Nov 16, 2017
    Configuration menu
    Copy the full SHA
    be68f86 View commit details
    Browse the repository at this point in the history

Commits on Nov 17, 2017

  1. [SPARK-22540][SQL] Ensure HighlyCompressedMapStatus calculates correc…

    …t avgSize
    
    ## What changes were proposed in this pull request?
    
    Ensure HighlyCompressedMapStatus calculates correct avgSize
    
    ## How was this patch tested?
    
    New unit test added.
    
    Author: yucai <yucai.yu@intel.com>
    
    Closes #19765 from yucai/avgsize.
    
    (cherry picked from commit d00b55d)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    yucai authored and srowen committed Nov 17, 2017
    Configuration menu
    Copy the full SHA
    ef7ccc1 View commit details
    Browse the repository at this point in the history
  2. [SPARK-22538][ML] SQLTransformer should not unpersist possibly cached…

    … input dataset
    
    ## What changes were proposed in this pull request?
    
    `SQLTransformer.transform` unpersists input dataset when dropping temporary view. We should not change input dataset's cache status.
    
    ## How was this patch tested?
    
    Added test.
    
    Author: Liang-Chi Hsieh <viirya@gmail.com>
    
    Closes #19772 from viirya/SPARK-22538.
    
    (cherry picked from commit fccb337)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    viirya authored and cloud-fan committed Nov 17, 2017
    Configuration menu
    Copy the full SHA
    3bc37e5 View commit details
    Browse the repository at this point in the history
  3. [SPARK-22544][SS] FileStreamSource should use its own hadoop conf to …

    …call globPathIfNecessary
    
    ## What changes were proposed in this pull request?
    
    Pass the FileSystem created using the correct Hadoop conf into `globPathIfNecessary` so that it can pick up user's hadoop configurations, such as credentials.
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <zsxwing@gmail.com>
    
    Closes #19771 from zsxwing/fix-file-stream-conf.
    
    (cherry picked from commit bf0c0ae)
    Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
    zsxwing committed Nov 17, 2017
    Configuration menu
    Copy the full SHA
    53a6076 View commit details
    Browse the repository at this point in the history

Commits on Nov 18, 2017

  1. [SPARK-22498][SQL] Fix 64KB JVM bytecode limit problem with concat

    ## What changes were proposed in this pull request?
    
    This PR changes `concat` code generation to place generated code for expression for arguments into separated methods if these size could be large.
    This PR resolved the case of `concat` with a lot of argument
    
    ## How was this patch tested?
    
    Added new test cases into `StringExpressionsSuite`
    
    Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
    
    Closes #19728 from kiszk/SPARK-22498.
    
    (cherry picked from commit d54bfec)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    kiszk authored and cloud-fan committed Nov 18, 2017
    Configuration menu
    Copy the full SHA
    710d618 View commit details
    Browse the repository at this point in the history

Commits on Nov 21, 2017

  1. [SPARK-22549][SQL] Fix 64KB JVM bytecode limit problem with concat_ws

    ## What changes were proposed in this pull request?
    
    This PR changes `concat_ws` code generation to place generated code for expression for arguments into separated methods if these size could be large.
    This PR resolved the case of `concat_ws` with a lot of argument
    
    ## How was this patch tested?
    
    Added new test cases into `StringExpressionsSuite`
    
    Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
    
    Closes #19777 from kiszk/SPARK-22549.
    
    (cherry picked from commit 41c6f36)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    kiszk authored and cloud-fan committed Nov 21, 2017
    Configuration menu
    Copy the full SHA
    ca02575 View commit details
    Browse the repository at this point in the history
  2. [SPARK-22508][SQL] Fix 64KB JVM bytecode limit problem with GenerateU…

    …nsafeRowJoiner.create()
    
    ## What changes were proposed in this pull request?
    
    This PR changes `GenerateUnsafeRowJoiner.create()` code generation to place generated code for statements to operate bitmap and offset into separated methods if these size could be large.
    
    ## How was this patch tested?
    
    Added a new test case into `GenerateUnsafeRowJoinerSuite`
    
    Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
    
    Closes #19737 from kiszk/SPARK-22508.
    
    (cherry picked from commit c957714)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    kiszk authored and cloud-fan committed Nov 21, 2017
    Configuration menu
    Copy the full SHA
    23eb4d7 View commit details
    Browse the repository at this point in the history
  3. [SPARK-22550][SQL] Fix 64KB JVM bytecode limit problem with elt

    This PR changes `elt` code generation to place generated code for expression for arguments into separated methods if these size could be large.
    This PR resolved the case of `elt` with a lot of argument
    
    Added new test cases into `StringExpressionsSuite`
    
    Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
    
    Closes #19778 from kiszk/SPARK-22550.
    
    (cherry picked from commit 9bdff0b)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    kiszk authored and cloud-fan committed Nov 21, 2017
    Configuration menu
    Copy the full SHA
    94f9227 View commit details
    Browse the repository at this point in the history
  4. [SPARK-22500][SQL] Fix 64KB JVM bytecode limit problem with cast

    This PR changes `cast` code generation to place generated code for expression for fields of a structure into separated methods if these size could be large.
    
    Added new test cases into `CastSuite`
    
    Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
    
    Closes #19730 from kiszk/SPARK-22500.
    
    (cherry picked from commit ac10171)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    kiszk authored and cloud-fan committed Nov 21, 2017
    Configuration menu
    Copy the full SHA
    11a599b View commit details
    Browse the repository at this point in the history

Commits on Nov 22, 2017

  1. [SPARK-22548][SQL] Incorrect nested AND expression pushed down to JDB…

    …C data source
    
    ## What changes were proposed in this pull request?
    
    Let’s say I have a nested AND expression shown below and p2 can not be pushed down,
    
    (p1 AND p2) OR p3
    
    In current Spark code, during data source filter translation, (p1 AND p2) is returned as p1 only and p2 is simply lost. This issue occurs with JDBC data source and is similar to [SPARK-12218](#10362) for Parquet. When we have AND nested below another expression, we should either push both legs or nothing.
    
    Note that:
    - The current Spark code will always split conjunctive predicate before it determines if a predicate can be pushed down or not
    - If I have (p1 AND p2) AND p3, it will be split into p1, p2, p3. There won't be nested AND expression.
    - The current Spark code logic for OR is OK. It either pushes both legs or nothing.
    
    The same translation method is also called by Data Source V2.
    
    ## How was this patch tested?
    
    Added new unit test cases to JDBCSuite
    
    gatorsmile
    
    Author: Jia Li <jiali@us.ibm.com>
    
    Closes #19776 from jliwork/spark-22548.
    
    (cherry picked from commit 881c5c8)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    jliwork authored and gatorsmile committed Nov 22, 2017
    Configuration menu
    Copy the full SHA
    df9228b View commit details
    Browse the repository at this point in the history
  2. [SPARK-17920][SPARK-19580][SPARK-19878][SQL] Backport PR 19779 to bra…

    …nch-2.2 - Support writing to Hive table which uses Avro schema url 'avro.schema.url'
    
    ## What changes were proposed in this pull request?
    
    > Backport #19779 to branch-2.2
    
    SPARK-19580 Support for avro.schema.url while writing to hive table
    SPARK-19878 Add hive configuration when initialize hive serde in InsertIntoHiveTable.scala
    SPARK-17920 HiveWriterContainer passes null configuration to serde.initialize, causing NullPointerException in AvroSerde when using avro.schema.url
    
    Support writing to Hive table which uses Avro schema url 'avro.schema.url'
    For ex:
    create external table avro_in (a string) stored as avro location '/avro-in/' tblproperties ('avro.schema.url'='/avro-schema/avro.avsc');
    
    create external table avro_out (a string) stored as avro location '/avro-out/' tblproperties ('avro.schema.url'='/avro-schema/avro.avsc');
    
    insert overwrite table avro_out select * from avro_in; // fails with java.lang.NullPointerException
    
    WARN AvroSerDe: Encountered exception determining schema. Returning signal schema to indicate problem
    java.lang.NullPointerException
    at org.apache.hadoop.fs.FileSystem.getDefaultUri(FileSystem.java:182)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:174)
    ## Changes proposed in this fix
    Currently 'null' value is passed to serializer, which causes NPE during insert operation, instead pass Hadoop configuration object
    ## How was this patch tested?
    Added new test case in VersionsSuite
    
    Author: vinodkc <vinod.kc.in@gmail.com>
    
    Closes #19795 from vinodkc/br_Fix_SPARK-17920_branch-2.2.
    vinodkc authored and gatorsmile committed Nov 22, 2017
    Configuration menu
    Copy the full SHA
    b17f406 View commit details
    Browse the repository at this point in the history

Commits on Nov 24, 2017

  1. [SPARK-17920][SQL] [FOLLOWUP] Backport PR 19779 to branch-2.2

    ## What changes were proposed in this pull request?
    
    A followup of  #19795 , to simplify the file creation.
    
    ## How was this patch tested?
    
    Only test case is updated
    
    Author: vinodkc <vinod.kc.in@gmail.com>
    
    Closes #19809 from vinodkc/br_FollowupSPARK-17920_branch-2.2.
    vinodkc authored and cloud-fan committed Nov 24, 2017
    Configuration menu
    Copy the full SHA
    f8e73d0 View commit details
    Browse the repository at this point in the history
  2. [SPARK-22591][SQL] GenerateOrdering shouldn't change CodegenContext.I…

    …NPUT_ROW
    
    ## What changes were proposed in this pull request?
    
    When I played with codegen in developing another PR, I found the value of `CodegenContext.INPUT_ROW` is not reliable. Under wholestage codegen, it is assigned to null first and then suddenly changed to `i`.
    
    The reason is `GenerateOrdering` changes `CodegenContext.INPUT_ROW` but doesn't restore it back.
    
    ## How was this patch tested?
    
    Added test.
    
    Author: Liang-Chi Hsieh <viirya@gmail.com>
    
    Closes #19800 from viirya/SPARK-22591.
    
    (cherry picked from commit 62a826f)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    viirya authored and cloud-fan committed Nov 24, 2017
    Configuration menu
    Copy the full SHA
    f4c457a View commit details
    Browse the repository at this point in the history
  3. [SPARK-22595][SQL] fix flaky test: CastSuite.SPARK-22500: cast for st…

    …ruct should not generate codes beyond 64KB
    
    This PR reduces the number of fields in the test case of `CastSuite` to fix an issue that is pointed at [here](#19800 (comment)).
    
    ```
    java.lang.OutOfMemoryError: GC overhead limit exceeded
    java.lang.OutOfMemoryError: GC overhead limit exceeded
    	at org.codehaus.janino.UnitCompiler.findClass(UnitCompiler.java:10971)
    	at org.codehaus.janino.UnitCompiler.findTypeByName(UnitCompiler.java:7607)
    	at org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:5758)
    	at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:5732)
    	at org.codehaus.janino.UnitCompiler.access$13200(UnitCompiler.java:206)
    	at org.codehaus.janino.UnitCompiler$18.visitReferenceType(UnitCompiler.java:5668)
    	at org.codehaus.janino.UnitCompiler$18.visitReferenceType(UnitCompiler.java:5660)
    	at org.codehaus.janino.Java$ReferenceType.accept(Java.java:3356)
    	at org.codehaus.janino.UnitCompiler.getType(UnitCompiler.java:5660)
    	at org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:2892)
    	at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2764)
    	at org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
    	at org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
    	at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538)
    	at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
    	at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
    	at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
    	at org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
    	at org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
    	at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
    	at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
    	at org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
    	at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564)
    	at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
    	at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
    	at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
    	at org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
    	at org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
    	at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
    	at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
    	at org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
    	at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564)
    ...
    ```
    
    Used existing test case
    
    Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
    
    Closes #19806 from kiszk/SPARK-22595.
    
    (cherry picked from commit 554adc7)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    kiszk authored and cloud-fan committed Nov 24, 2017
    Configuration menu
    Copy the full SHA
    ad57141 View commit details
    Browse the repository at this point in the history
  4. [SPARK-22495] Fix setup of SPARK_HOME variable on Windows

    ## What changes were proposed in this pull request?
    
    This is a cherry pick of the original PR 19370 onto branch-2.2 as suggested in #19370 (comment).
    
    Fixing the way how `SPARK_HOME` is resolved on Windows. While the previous version was working with the built release download, the set of directories changed slightly for the PySpark `pip` or `conda` install. This has been reflected in Linux files in `bin` but not for Windows `cmd` files.
    
    First fix improves the way how the `jars` directory is found, as this was stoping Windows version of `pip/conda` install from working; JARs were not found by on Session/Context setup.
    
    Second fix is adding `find-spark-home.cmd` script, which uses `find_spark_home.py` script, as the Linux version, to resolve `SPARK_HOME`. It is based on `find-spark-home` bash script, though, some operations are done in different order due to the `cmd` script language limitations. If environment variable is set, the Python script `find_spark_home.py` will not be run. The process can fail if Python is not installed, but it will mostly use this way if PySpark is installed via `pip/conda`, thus, there is some Python in the system.
    
    ## How was this patch tested?
    
    Tested on local installation.
    
    Author: Jakub Nowacki <j.s.nowacki@gmail.com>
    
    Closes #19807 from jsnowacki/fix_spark_cmds_2.
    jsnowacki authored and Felix Cheung committed Nov 24, 2017
    Configuration menu
    Copy the full SHA
    b606cc2 View commit details
    Browse the repository at this point in the history
  5. fix typo

    Felix Cheung committed Nov 24, 2017
    Configuration menu
    Copy the full SHA
    c3b5df2 View commit details
    Browse the repository at this point in the history
  6. Preparing Spark release v2.2.1-rc2

    Felix Cheung committed Nov 24, 2017
    Configuration menu
    Copy the full SHA
    e30e269 View commit details
    Browse the repository at this point in the history
  7. Preparing development version 2.2.2-SNAPSHOT

    Felix Cheung committed Nov 24, 2017
    Configuration menu
    Copy the full SHA
    455cea6 View commit details
    Browse the repository at this point in the history

Commits on Nov 26, 2017

  1. [SPARK-22607][BUILD] Set large stack size consistently for tests to a…

    …void StackOverflowError
    
    Set `-ea` and `-Xss4m` consistently for tests, to fix in particular:
    
    ```
    OrderingSuite:
    ...
    - GenerateOrdering with ShortType
    *** RUN ABORTED ***
    java.lang.StackOverflowError:
    at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:370)
    at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
    at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
    at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
    at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
    at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
    at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
    at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
    ...
    ```
    
    Existing tests. Manually verified it resolves the StackOverflowError this intends to resolve.
    
    Author: Sean Owen <sowen@cloudera.com>
    
    Closes #19820 from srowen/SPARK-22607.
    
    (cherry picked from commit fba63c1)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    srowen committed Nov 26, 2017
    Configuration menu
    Copy the full SHA
    2cd4898 View commit details
    Browse the repository at this point in the history

Commits on Nov 27, 2017

  1. [SPARK-22603][SQL] Fix 64KB JVM bytecode limit problem with FormatString

    ## What changes were proposed in this pull request?
    
    This PR changes `FormatString` code generation to place generated code for expressions for arguments into separated methods if these size could be large.
    This PR passes variable arguments by using an `Object` array.
    
    ## How was this patch tested?
    
    Added new test cases into `StringExpressionSuite`
    
    Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
    
    Closes #19817 from kiszk/SPARK-22603.
    
    (cherry picked from commit 2dbe275)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    kiszk authored and cloud-fan committed Nov 27, 2017
    Configuration menu
    Copy the full SHA
    eef72d3 View commit details
    Browse the repository at this point in the history

Commits on Nov 29, 2017

  1. [SPARK-22637][SQL] Only refresh a logical plan once.

    ## What changes were proposed in this pull request?
    `CatalogImpl.refreshTable` uses `foreach(..)` to refresh all tables in a view. This traverses all nodes in the subtree and calls `LogicalPlan.refresh()` on these nodes. However `LogicalPlan.refresh()` is also refreshing its children, as a result refreshing a large view can be quite expensive.
    
    This PR just calls `LogicalPlan.refresh()` on the top node.
    
    ## How was this patch tested?
    Existing tests.
    
    Author: Herman van Hovell <hvanhovell@databricks.com>
    
    Closes #19837 from hvanhovell/SPARK-22637.
    
    (cherry picked from commit 475a29f)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    hvanhovell authored and gatorsmile committed Nov 29, 2017
    Configuration menu
    Copy the full SHA
    38a0532 View commit details
    Browse the repository at this point in the history

Commits on Nov 30, 2017

  1. [SPARK-22654][TESTS] Retry Spark tarball download if failed in HiveEx…

    …ternalCatalogVersionsSuite
    
    ## What changes were proposed in this pull request?
    
    Adds a simple loop to retry download of Spark tarballs from different mirrors if the download fails.
    
    ## How was this patch tested?
    
    Existing tests
    
    Author: Sean Owen <sowen@cloudera.com>
    
    Closes #19851 from srowen/SPARK-22654.
    
    (cherry picked from commit 6eb203f)
    Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
    srowen authored and HyukjinKwon committed Nov 30, 2017
    Configuration menu
    Copy the full SHA
    d7b1474 View commit details
    Browse the repository at this point in the history

Commits on Dec 1, 2017

  1. [SPARK-22373] Bump Janino dependency version to fix thread safety issue…

    … with Janino when compiling generated code.
    
    ## What changes were proposed in this pull request?
    
    Bump up Janino dependency version to fix thread safety issue during compiling generated code
    
    ## How was this patch tested?
    
    Check https://issues.apache.org/jira/browse/SPARK-22373 for details.
    Converted part of the code in CodeGenerator into a standalone application, so the issue can be consistently reproduced locally.
    Verified that changing Janino dependency version resolved this issue.
    
    Author: Min Shen <mshen@linkedin.com>
    
    Closes #19839 from Victsm/SPARK-22373.
    
    (cherry picked from commit 7da1f57)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    Victsm authored and srowen committed Dec 1, 2017
    Configuration menu
    Copy the full SHA
    0121ebc View commit details
    Browse the repository at this point in the history
  2. [SPARK-22653] executorAddress registered in CoarseGrainedSchedulerBac…

    https://issues.apache.org/jira/browse/SPARK-22653
    executorRef.address can be null, pass the executorAddress which accounts for it being null a few lines above the fix.
    
    Manually tested this patch. You can reproduce the issue by running a simple spark-shell in yarn client mode with dynamic allocation and request some executors up front. Let those executors idle timeout. Get a heap dump. Without this fix, you will see that addressToExecutorId still contains the ids, with the fix addressToExecutorId is properly cleaned up.
    
    Author: Thomas Graves <tgraves@oath.com>
    
    Closes #19850 from tgravescs/SPARK-22653.
    
    (cherry picked from commit dc36542)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    tgravescs authored and cloud-fan committed Dec 1, 2017
    Configuration menu
    Copy the full SHA
    af8a692 View commit details
    Browse the repository at this point in the history
  3. [SPARK-22601][SQL] Data load is getting displayed successful on provi…

    …ding non existing nonlocal file path
    
    ## What changes were proposed in this pull request?
    When user tries to load data with a non existing hdfs file path system is not validating it and the load command operation is getting successful.
    This is misleading to the user. already there is a validation in the scenario of none existing local file path. This PR has added validation in the scenario of nonexisting hdfs file path
    ## How was this patch tested?
    UT has been added for verifying the issue, also snapshots has been added after the verification in a spark yarn cluster
    
    Author: sujith71955 <sujithchacko.2010@gmail.com>
    
    Closes #19823 from sujith71955/master_LoadComand_Issue.
    
    (cherry picked from commit 16adaf6)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    sujith71955 authored and gatorsmile committed Dec 1, 2017
    Configuration menu
    Copy the full SHA
    ba00bd9 View commit details
    Browse the repository at this point in the history
  4. [SPARK-22635][SQL][ORC] FileNotFoundException while reading ORC files…

    … containing special characters
    
    ## What changes were proposed in this pull request?
    
    SPARK-22146 fix the FileNotFoundException issue only for the `inferSchema` method, ie. only for the schema inference, but it doesn't fix the problem when actually reading the data. Thus nearly the same exception happens when someone tries to use the data. This PR covers fixing the problem also there.
    
    ## How was this patch tested?
    
    enhanced UT
    
    Author: Marco Gaido <mgaido@hortonworks.com>
    
    Closes #19844 from mgaido91/SPARK-22635.
    mgaido91 authored and HyukjinKwon committed Dec 1, 2017
    Configuration menu
    Copy the full SHA
    f3f8c87 View commit details
    Browse the repository at this point in the history

Commits on Dec 5, 2017

  1. [SPARK-22162][BRANCH-2.2] Executors and the driver should use consist…

    …ent JobIDs in the RDD commit protocol
    
    I have modified SparkHadoopMapReduceWriter so that executors and the driver always use consistent JobIds during the hadoop commit. Before SPARK-18191, spark always used the rddId, it just incorrectly named the variable stageId. After SPARK-18191, it used the rddId as the jobId on the driver's side, and the stageId as the jobId on the executors' side. With this change executors and the driver will consistently uses rddId as the jobId. Also with this change, during the hadoop commit protocol spark uses  actual stageId to check whether a stage can be committed unlike before that  it was using executors' jobId to do this check.
    In addition to the existing unit tests, a test has been added to check whether executors and the driver are using the same JobId. The test failed before this change and passed after applying this fix.
    
    Author: Reza Safi <rezasafi@cloudera.com>
    
    Closes #19886 from rezasafi/stagerdd22.
    Reza Safi authored and Marcelo Vanzin committed Dec 5, 2017
    Configuration menu
    Copy the full SHA
    5b63000 View commit details
    Browse the repository at this point in the history

Commits on Dec 6, 2017

  1. [SPARK-22686][SQL] DROP TABLE IF EXISTS should not show AnalysisExcep…

    …tion
    
    ## What changes were proposed in this pull request?
    
    During [SPARK-22488](#19713) to fix view resolution issue, there occurs a regression at `2.2.1` and `master` branch like the following. This PR fixes that.
    
    ```scala
    scala> spark.version
    res2: String = 2.2.1
    
    scala> sql("DROP TABLE IF EXISTS t").show
    17/12/04 21:01:06 WARN DropTableCommand: org.apache.spark.sql.AnalysisException:
    Table or view not found: t;
    org.apache.spark.sql.AnalysisException: Table or view not found: t;
    ```
    
    ## How was this patch tested?
    
    Manual.
    
    Author: Dongjoon Hyun <dongjoon@apache.org>
    
    Closes #19888 from dongjoon-hyun/SPARK-22686.
    
    (cherry picked from commit 82183f7)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    dongjoon-hyun authored and cloud-fan committed Dec 6, 2017
    Configuration menu
    Copy the full SHA
    7fd6d53 View commit details
    Browse the repository at this point in the history

Commits on Dec 7, 2017

  1. [SPARK-22688][SQL] Upgrade Janino version to 3.0.8

    This PR upgrade Janino version to 3.0.8. [Janino 3.0.8](https://janino-compiler.github.io/janino/changelog.html) includes an important fix to reduce the number of constant pool entries by using 'sipush' java bytecode.
    
    * SIPUSH bytecode is not used for short integer constant [#33](janino-compiler/janino#33).
    
    Please see detail in [this discussion thread](#19518 (comment)).
    
    Existing tests
    
    Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
    
    Closes #19890 from kiszk/SPARK-22688.
    
    (cherry picked from commit 8ae004b)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    kiszk authored and srowen committed Dec 7, 2017
    Configuration menu
    Copy the full SHA
    2084675 View commit details
    Browse the repository at this point in the history
  2. [SPARK-22688][SQL][HOTFIX] Upgrade Janino version to 3.0.8

    ## What changes were proposed in this pull request?
    
    Hotfix inadvertent change to xmlbuilder dep when updating Janino.
    See backport of #19890
    
    ## How was this patch tested?
    
    N/A
    
    Author: Sean Owen <sowen@cloudera.com>
    
    Closes #19922 from srowen/SPARK-22688.2.
    srowen committed Dec 7, 2017
    Configuration menu
    Copy the full SHA
    9e2d96d View commit details
    Browse the repository at this point in the history

Commits on Dec 12, 2017

  1. [SPARK-22289][ML] Add JSON support for Matrix parameters (LR with coe…

    …fficients bound)
    
    ## What changes were proposed in this pull request?
    jira: https://issues.apache.org/jira/browse/SPARK-22289
    
    add JSON encoding/decoding for Param[Matrix].
    
    The issue was reported by Nic Eggert during saving LR model with LowerBoundsOnCoefficients.
    There're two ways to resolve this as I see:
    1. Support save/load on LogisticRegressionParams, and also adjust the save/load in LogisticRegression and LogisticRegressionModel.
    2. Directly support Matrix in Param.jsonEncode, similar to what we have done for Vector.
    
    After some discussion in jira, we prefer the fix to support Matrix as a valid Param type, for simplicity and convenience for other classes.
    
    Note that in the implementation, I added a "class" field in the JSON object to match different JSON converters when loading, which is for preciseness and future extension.
    
    ## How was this patch tested?
    
    new unit test to cover the LR case and JsonMatrixConverter
    
    Author: Yuhao Yang <yuhao.yang@intel.com>
    
    Closes #19525 from hhbyyh/lrsave.
    
    (cherry picked from commit 10c27a6)
    Signed-off-by: Yanbo Liang <ybliang8@gmail.com>
    hhbyyh authored and yanboliang committed Dec 12, 2017
    Configuration menu
    Copy the full SHA
    00cdb38 View commit details
    Browse the repository at this point in the history
  2. [SPARK-22574][MESOS][SUBMIT] Check submission request parameters

    ## What changes were proposed in this pull request?
    
    It solves the problem when submitting a wrong CreateSubmissionRequest to Spark Dispatcher was causing a bad state of Dispatcher and making it inactive as a Mesos framework.
    
    https://issues.apache.org/jira/browse/SPARK-22574
    
    ## How was this patch tested?
    
    All spark test passed successfully.
    
    It was tested sending a wrong request (without appArgs) before and after the change. The point is easy, check if the value is null before being accessed.
    
    This was before the change, leaving the dispatcher inactive:
    
    ```
    Exception in thread "Thread-22" java.lang.NullPointerException
    	at org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.getDriverCommandValue(MesosClusterScheduler.scala:444)
    	at org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.buildDriverCommand(MesosClusterScheduler.scala:451)
    	at org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.org$apache$spark$scheduler$cluster$mesos$MesosClusterScheduler$$createTaskInfo(MesosClusterScheduler.scala:538)
    	at org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:570)
    	at org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:555)
    	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    	at org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.scheduleTasks(MesosClusterScheduler.scala:555)
    	at org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.resourceOffers(MesosClusterScheduler.scala:621)
    ```
    
    And after:
    
    ```
      "message" : "Malformed request: org.apache.spark.deploy.rest.SubmitRestProtocolException: Validation of message CreateSubmissionRequest failed!\n\torg.apache.spark.deploy.rest.SubmitRestProtocolMessage.validate(SubmitRestProtocolMessage.scala:70)\n\torg.apache.spark.deploy.rest.SubmitRequestServlet.doPost(RestSubmissionServer.scala:272)\n\tjavax.servlet.http.HttpServlet.service(HttpServlet.java:707)\n\tjavax.servlet.http.HttpServlet.service(HttpServlet.java:790)\n\torg.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:845)\n\torg.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:583)\n\torg.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)\n\torg.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)\n\torg.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)\n\torg.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\n\torg.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)\n\torg.spark_project.jetty.server.Server.handle(Server.java:524)\n\torg.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:319)\n\torg.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:253)\n\torg.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)\n\torg.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:95)\n\torg.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)\n\torg.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)\n\torg.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)\n\torg.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)\n\torg.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)\n\torg.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)\n\tjava.lang.Thread.run(Thread.java:745)"
    ```
    
    Author: German Schiavon <germanschiavon@gmail.com>
    
    Closes #19793 from Gschiavon/fix-submission-request.
    
    (cherry picked from commit 7a51e71)
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
    Gschiavon authored and Marcelo Vanzin committed Dec 12, 2017
    Configuration menu
    Copy the full SHA
    728a45e View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    0230515 View commit details
    Browse the repository at this point in the history

Commits on Dec 13, 2017

  1. [SPARK-22574][MESOS][SUBMIT] Check submission request parameters

    ## What changes were proposed in this pull request?
    
    PR closed with all the comments -> #19793
    
    It solves the problem when submitting a wrong CreateSubmissionRequest to Spark Dispatcher was causing a bad state of Dispatcher and making it inactive as a Mesos framework.
    
    https://issues.apache.org/jira/browse/SPARK-22574
    
    ## How was this patch tested?
    
    All spark test passed successfully.
    
    It was tested sending a wrong request (without appArgs) before and after the change. The point is easy, check if the value is null before being accessed.
    
    This was before the change, leaving the dispatcher inactive:
    
    ```
    Exception in thread "Thread-22" java.lang.NullPointerException
    	at org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.getDriverCommandValue(MesosClusterScheduler.scala:444)
    	at org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.buildDriverCommand(MesosClusterScheduler.scala:451)
    	at org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.org$apache$spark$scheduler$cluster$mesos$MesosClusterScheduler$$createTaskInfo(MesosClusterScheduler.scala:538)
    	at org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:570)
    	at org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:555)
    	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    	at org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.scheduleTasks(MesosClusterScheduler.scala:555)
    	at org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.resourceOffers(MesosClusterScheduler.scala:621)
    ```
    
    And after:
    
    ```
      "message" : "Malformed request: org.apache.spark.deploy.rest.SubmitRestProtocolException: Validation of message CreateSubmissionRequest failed!\n\torg.apache.spark.deploy.rest.SubmitRestProtocolMessage.validate(SubmitRestProtocolMessage.scala:70)\n\torg.apache.spark.deploy.rest.SubmitRequestServlet.doPost(RestSubmissionServer.scala:272)\n\tjavax.servlet.http.HttpServlet.service(HttpServlet.java:707)\n\tjavax.servlet.http.HttpServlet.service(HttpServlet.java:790)\n\torg.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:845)\n\torg.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:583)\n\torg.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)\n\torg.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)\n\torg.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)\n\torg.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\n\torg.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)\n\torg.spark_project.jetty.server.Server.handle(Server.java:524)\n\torg.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:319)\n\torg.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:253)\n\torg.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)\n\torg.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:95)\n\torg.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)\n\torg.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)\n\torg.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)\n\torg.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)\n\torg.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)\n\torg.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)\n\tjava.lang.Thread.run(Thread.java:745)"
    ```
    
    Author: German Schiavon <germanschiavon@gmail.com>
    
    Closes #19966 from Gschiavon/fix-submission-request.
    
    (cherry picked from commit 0bdb4e5)
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
    Gschiavon authored and Marcelo Vanzin committed Dec 13, 2017
    Configuration menu
    Copy the full SHA
    b4f4be3 View commit details
    Browse the repository at this point in the history

Commits on Dec 17, 2017

  1. [SPARK-22817][R] Use fixed testthat version for SparkR tests in AppVeyor

    ## What changes were proposed in this pull request?
    
    `testthat` 2.0.0 is released and AppVeyor now started to use it instead of 1.0.2. And then, we started to have R tests failed in AppVeyor. See - https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1967-master
    
    ```
    Error in get(name, envir = asNamespace(pkg), inherits = FALSE) :
      object 'run_tests' not found
    Calls: ::: -> get
    ```
    
    This seems because we rely on internal `testthat:::run_tests` here:
    
    https://github.com/r-lib/testthat/blob/v1.0.2/R/test-package.R#L62-L75
    
    https://github.com/apache/spark/blob/dc4c351837879dab26ad8fb471dc51c06832a9e4/R/pkg/tests/run-all.R#L49-L52
    
    However, seems it was removed out from 2.0.0.  I tried few other exposed APIs like `test_dir` but I failed to make a good compatible fix.
    
    Seems we better fix the `testthat` version first to make the build passed.
    
    ## How was this patch tested?
    
    Manually tested and AppVeyor tests.
    
    Author: hyukjinkwon <gurwls223@gmail.com>
    
    Closes #20003 from HyukjinKwon/SPARK-22817.
    
    (cherry picked from commit c2aeddf)
    Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
    HyukjinKwon committed Dec 17, 2017
    Configuration menu
    Copy the full SHA
    1e4cca0 View commit details
    Browse the repository at this point in the history

Commits on Dec 22, 2017

  1. [SPARK-22862] Docs on lazy elimination of columns missing from an enc…

    …oder
    
    This behavior has confused some users, so lets clarify it.
    
    Author: Michael Armbrust <michael@databricks.com>
    
    Closes #20048 from marmbrus/datasetAsDocs.
    
    (cherry picked from commit 8df1da3)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    marmbrus authored and gatorsmile committed Dec 22, 2017
    Configuration menu
    Copy the full SHA
    1cf3e3a View commit details
    Browse the repository at this point in the history

Commits on Dec 23, 2017

  1. [SPARK-20694][EXAMPLES] Update SQLDataSourceExample.scala

    ## What changes were proposed in this pull request?
    
    Create table using the right DataFrame. peopleDF->usersDF
    
    peopleDF:
    +----+-------+
    | age|   name|
    +----+-------+
    usersDF:
    +------+--------------+----------------+
    |  name|favorite_color|favorite_numbers|
    +------+--------------+----------------+
    
    ## How was this patch tested?
    
    Manually tested.
    
    Author: CNRui <13266776177@163.com>
    
    Closes #20052 from CNRui/patch-2.
    
    (cherry picked from commit ea2642e)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    CNRui authored and srowen committed Dec 23, 2017
    Configuration menu
    Copy the full SHA
    7a97943 View commit details
    Browse the repository at this point in the history
  2. [SPARK-22889][SPARKR] Set overwrite=T when install SparkR in tests

    ## What changes were proposed in this pull request?
    
    Since all CRAN checks go through the same machine, if there is an older partial download or partial install of Spark left behind the tests fail. This PR overwrites the install files when running tests. This shouldn't affect Jenkins as `SPARK_HOME` is set when running Jenkins tests.
    
    ## How was this patch tested?
    
    Test manually by running `R CMD check --as-cran`
    
    Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
    
    Closes #20060 from shivaram/sparkr-overwrite-cran.
    
    (cherry picked from commit 1219d7a)
    Signed-off-by: Felix Cheung <felixcheung@apache.org>
    shivaram authored and Felix Cheung committed Dec 23, 2017
    Configuration menu
    Copy the full SHA
    41f705a View commit details
    Browse the repository at this point in the history

Commits on Jan 8, 2018

  1. [SPARK-22983] Don't push filters beneath aggregates with empty groupi…

    …ng expressions
    
    ## What changes were proposed in this pull request?
    
    The following SQL query should return zero rows, but in Spark it actually returns one row:
    
    ```
    SELECT 1 from (
      SELECT 1 AS z,
      MIN(a.x)
      FROM (select 1 as x) a
      WHERE false
    ) b
    where b.z != b.z
    ```
    
    The problem stems from the `PushDownPredicate` rule: when this rule encounters a filter on top of an Aggregate operator, e.g. `Filter(Agg(...))`, it removes the original filter and adds a new filter onto Aggregate's child, e.g. `Agg(Filter(...))`. This is sometimes okay, but the case above is a counterexample: because there is no explicit `GROUP BY`, we are implicitly computing a global aggregate over the entire table so the original filter was not acting like a `HAVING` clause filtering the number of groups: if we push this filter then it fails to actually reduce the cardinality of the Aggregate output, leading to the wrong answer.
    
    In 2016 I fixed a similar problem involving invalid pushdowns of data-independent filters (filters which reference no columns of the filtered relation). There was additional discussion after my fix was merged which pointed out that my patch was an incomplete fix (see #15289), but it looks I must have either misunderstood the comment or forgot to follow up on the additional points raised there.
    
    This patch fixes the problem by choosing to never push down filters in cases where there are no grouping expressions. Since there are no grouping keys, the only columns are aggregate columns and we can't push filters defined over aggregate results, so this change won't cause us to miss out on any legitimate pushdown opportunities.
    
    ## How was this patch tested?
    
    New regression tests in `SQLQueryTestSuite` and `FilterPushdownSuite`.
    
    Author: Josh Rosen <joshrosen@databricks.com>
    
    Closes #20180 from JoshRosen/SPARK-22983-dont-push-filters-beneath-aggs-with-empty-grouping-expressions.
    
    (cherry picked from commit 2c73d2a)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    JoshRosen authored and gatorsmile committed Jan 8, 2018
    Configuration menu
    Copy the full SHA
    7c30ae3 View commit details
    Browse the repository at this point in the history

Commits on Jan 9, 2018

  1. [SPARK-22984] Fix incorrect bitmap copying and offset adjustment in G…

    …enerateUnsafeRowJoiner
    
    ## What changes were proposed in this pull request?
    
    This PR fixes a longstanding correctness bug in `GenerateUnsafeRowJoiner`. This class was introduced in #7821 (July 2015 / Spark 1.5.0+) and is used to combine pairs of UnsafeRows in TungstenAggregationIterator, CartesianProductExec, and AppendColumns.
    
    ### Bugs fixed by this patch
    
    1. **Incorrect combining of null-tracking bitmaps**: when concatenating two UnsafeRows, the implementation "Concatenate the two bitsets together into a single one, taking padding into account". If one row has no columns then it has a bitset size of 0, but the code was incorrectly assuming that if the left row had a non-zero number of fields then the right row would also have at least one field, so it was copying invalid bytes and and treating them as part of the bitset. I'm not sure whether this bug was also present in the original implementation or whether it was introduced in #7892 (which fixed another bug in this code).
    2. **Incorrect updating of data offsets for null variable-length fields**: after updating the bitsets and copying fixed-length and variable-length data, we need to perform adjustments to the offsets pointing the start of variable length fields's data. The existing code was _conditionally_ adding a fixed offset to correct for the new length of the combined row, but it is unsafe to do this if the variable-length field has a null value: we always represent nulls by storing `0` in the fixed-length slot, but this code was incorrectly incrementing those values. This bug was present since the original version of `GenerateUnsafeRowJoiner`.
    
    ### Why this bug remained latent for so long
    
    The PR which introduced `GenerateUnsafeRowJoiner` features several randomized tests, including tests of the cases where one side of the join has no fields and where string-valued fields are null. However, the existing assertions were too weak to uncover this bug:
    
    - If a null field has a non-zero value in its fixed-length data slot then this will not cause problems for field accesses because the null-tracking bitmap should still be correct and we will not try to use the incorrect offset for anything.
    - If the null tracking bitmap is corrupted by joining against a row with no fields then the corruption occurs in field numbers past the actual field numbers contained in the row. Thus valid `isNullAt()` calls will not read the incorrectly-set bits.
    
    The existing `GenerateUnsafeRowJoinerSuite` tests only exercised `.get()` and `isNullAt()`, but didn't actually check the UnsafeRows for bit-for-bit equality, preventing these bugs from failing assertions. It turns out that there was even a [GenerateUnsafeRowJoinerBitsetSuite](https://github.com/apache/spark/blob/03377d2522776267a07b7d6ae9bddf79a4e0f516/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeRowJoinerBitsetSuite.scala) but it looks like it also didn't catch this problem because it only tested the bitsets in an end-to-end fashion by accessing them through the `UnsafeRow` interface instead of actually comparing the bitsets' bytes.
    
    ### Impact of these bugs
    
    - This bug will cause `equals()` and `hashCode()` to be incorrect for these rows, which will be problematic in case`GenerateUnsafeRowJoiner`'s results are used as join or grouping keys.
    - Chained / repeated invocations of `GenerateUnsafeRowJoiner` may result in reads from invalid null bitmap positions causing fields to incorrectly become NULL (see the end-to-end example below).
      - It looks like this generally only happens in `CartesianProductExec`, which our query optimizer often avoids executing (usually we try to plan a `BroadcastNestedLoopJoin` instead).
    
    ### End-to-end test case demonstrating the problem
    
    The following query demonstrates how this bug may result in incorrect query results:
    
    ```sql
    set spark.sql.autoBroadcastJoinThreshold=-1; -- Needed to trigger CartesianProductExec
    
    create table a as select * from values 1;
    create table b as select * from values 2;
    
    SELECT
      t3.col1,
      t1.col1
    FROM a t1
    CROSS JOIN b t2
    CROSS JOIN b t3
    ```
    
    This should return `(2, 1)` but instead was returning `(null, 1)`.
    
    Column pruning ends up trimming off all columns from `t2`, so when `t2` joins with another table this triggers the bitmap-copying bug. This incorrect bitmap is subsequently copied again when performing the final join, causing the final output to have an incorrectly-set null bit for the first field.
    
    ## How was this patch tested?
    
    Strengthened the assertions in existing tests in GenerateUnsafeRowJoinerSuite. Also verified that the end-to-end test case which uncovered this now passes.
    
    Author: Josh Rosen <joshrosen@databricks.com>
    
    Closes #20181 from JoshRosen/SPARK-22984-fix-generate-unsaferow-joiner-bitmap-bugs.
    
    (cherry picked from commit f20131d)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    JoshRosen authored and cloud-fan committed Jan 9, 2018
    Configuration menu
    Copy the full SHA
    24f1f2a View commit details
    Browse the repository at this point in the history

Commits on Jan 10, 2018

  1. [SPARK-22972] Couldn't find corresponding Hive SerDe for data source …

    …provider org.apache.spark.sql.hive.orc
    
    ## What changes were proposed in this pull request?
    
    Fix the warning: Couldn't find corresponding Hive SerDe for data source provider org.apache.spark.sql.hive.orc.
    This PR is for branch-2.2 and cherry-pick from 8032cf8
    
    The old PR is #20165
    
    ## How was this patch tested?
    
     Please see test("SPARK-22972: hive orc source")
    
    Author: xubo245 <601450868@qq.com>
    
    Closes #20195 from xubo245/HiveSerDeForBranch2.2.
    xubo245 authored and gatorsmile committed Jan 10, 2018
    Configuration menu
    Copy the full SHA
    0d943d9 View commit details
    Browse the repository at this point in the history

Commits on Jan 11, 2018

  1. [SPARK-23001][SQL] Fix NullPointerException when DESC a database with…

    … NULL description
    
    ## What changes were proposed in this pull request?
    When users' DB description is NULL, users might hit `NullPointerException`. This PR is to fix the issue.
    
    ## How was this patch tested?
    Added test cases
    
    Author: gatorsmile <gatorsmile@gmail.com>
    
    Closes #20215 from gatorsmile/SPARK-23001.
    
    (cherry picked from commit 87c98de)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    gatorsmile authored and cloud-fan committed Jan 11, 2018
    Configuration menu
    Copy the full SHA
    acab4e7 View commit details
    Browse the repository at this point in the history

Commits on Jan 12, 2018

  1. [SPARK-22982] Remove unsafe asynchronous close() call from FileDownlo…

    …adChannel
    
    ## What changes were proposed in this pull request?
    
    This patch fixes a severe asynchronous IO bug in Spark's Netty-based file transfer code. At a high-level, the problem is that an unsafe asynchronous `close()` of a pipe's source channel creates a race condition where file transfer code closes a file descriptor then attempts to read from it. If the closed file descriptor's number has been reused by an `open()` call then this invalid read may cause unrelated file operations to return incorrect results. **One manifestation of this problem is incorrect query results.**
    
    For a high-level overview of how file download works, take a look at the control flow in `NettyRpcEnv.openChannel()`: this code creates a pipe to buffer results, then submits an asynchronous stream request to a lower-level TransportClient. The callback passes received data to the sink end of the pipe. The source end of the pipe is passed back to the caller of `openChannel()`. Thus `openChannel()` returns immediately and callers interact with the returned pipe source channel.
    
    Because the underlying stream request is asynchronous, errors may occur after `openChannel()` has returned and after that method's caller has started to `read()` from the returned channel. For example, if a client requests an invalid stream from a remote server then the "stream does not exist" error may not be received from the remote server until after `openChannel()` has returned. In order to be able to propagate the "stream does not exist" error to the file-fetching application thread, this code wraps the pipe's source channel in a special `FileDownloadChannel` which adds an `setError(t: Throwable)` method, then calls this `setError()` method in the FileDownloadCallback's `onFailure` method.
    
    It is possible for `FileDownloadChannel`'s `read()` and `setError()` methods to be called concurrently from different threads: the `setError()` method is called from within the Netty RPC system's stream callback handlers, while the `read()` methods are called from higher-level application code performing remote stream reads.
    
    The problem lies in `setError()`: the existing code closed the wrapped pipe source channel. Because `read()` and `setError()` occur in different threads, this means it is possible for one thread to be calling `source.read()` while another asynchronously calls `source.close()`. Java's IO libraries do not guarantee that this will be safe and, in fact, it's possible for these operations to interleave in such a way that a lower-level `read()` system call occurs right after a `close()` call. In the best-case, this fails as a read of a closed file descriptor; in the worst-case, the file descriptor number has been re-used by an intervening `open()` operation and the read corrupts the result of an unrelated file IO operation being performed by a different thread.
    
    The solution here is to remove the `stream.close()` call in `onError()`: the thread that is performing the `read()` calls is responsible for closing the stream in a `finally` block, so there's no need to close it here. If that thread is blocked in a `read()` then it will become unblocked when the sink end of the pipe is closed in `FileDownloadCallback.onFailure()`.
    
    After making this change, we also need to refine the `read()` method to always check for a `setError()` result, even if the underlying channel `read()` call has succeeded.
    
    This patch also makes a slight cleanup to a dodgy-looking `catch e: Exception` block to use a safer `try-finally` error handling idiom.
    
    This bug was introduced in SPARK-11956 / #9941 and is present in Spark 1.6.0+.
    
    ## How was this patch tested?
    
    This fix was tested manually against a workload which non-deterministically hit this bug.
    
    Author: Josh Rosen <joshrosen@databricks.com>
    
    Closes #20179 from JoshRosen/SPARK-22982-fix-unsafe-async-io-in-file-download-channel.
    
    (cherry picked from commit edf0a48)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    JoshRosen authored and srowen committed Jan 12, 2018
    Configuration menu
    Copy the full SHA
    20eea20 View commit details
    Browse the repository at this point in the history
  2. [SPARK-22975][SS] MetricsReporter should not throw exception when the…

    …re was no progress reported
    
    ## What changes were proposed in this pull request?
    
    `MetricsReporter ` assumes that there has been some progress for the query, ie. `lastProgress` is not null. If this is not true, as it might happen in particular conditions, a `NullPointerException` can be thrown.
    
    The PR checks whether there is a `lastProgress` and if this is not true, it returns a default value for the metrics.
    
    ## How was this patch tested?
    
    added UT
    
    Author: Marco Gaido <marcogaido91@gmail.com>
    
    Closes #20189 from mgaido91/SPARK-22975.
    
    (cherry picked from commit 5427739)
    Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
    mgaido91 authored and zsxwing committed Jan 12, 2018
    Configuration menu
    Copy the full SHA
    105ae86 View commit details
    Browse the repository at this point in the history

Commits on Jan 14, 2018

  1. [SPARK-23038][TEST] Update docker/spark-test (JDK/OS)

    ## What changes were proposed in this pull request?
    
    This PR aims to update the followings in `docker/spark-test`.
    
    - JDK7 -> JDK8
    Spark 2.2+ supports JDK8 only.
    
    - Ubuntu 12.04.5 LTS(precise) -> Ubuntu 16.04.3 LTS(xeniel)
    The end of life of `precise` was April 28, 2017.
    
    ## How was this patch tested?
    
    Manual.
    
    * Master
    ```
    $ cd external/docker
    $ ./build
    $ export SPARK_HOME=...
    $ docker run -v $SPARK_HOME:/opt/spark spark-test-master
    CONTAINER_IP=172.17.0.3
    ...
    18/01/11 06:50:25 INFO MasterWebUI: Bound MasterWebUI to 172.17.0.3, and started at http://172.17.0.3:8080
    18/01/11 06:50:25 INFO Utils: Successfully started service on port 6066.
    18/01/11 06:50:25 INFO StandaloneRestServer: Started REST server for submitting applications on port 6066
    18/01/11 06:50:25 INFO Master: I have been elected leader! New state: ALIVE
    ```
    
    * Slave
    ```
    $ docker run -v $SPARK_HOME:/opt/spark spark-test-worker spark://172.17.0.3:7077
    CONTAINER_IP=172.17.0.4
    ...
    18/01/11 06:51:54 INFO Worker: Successfully registered with master spark://172.17.0.3:7077
    ```
    
    After slave starts, master will show
    ```
    18/01/11 06:51:54 INFO Master: Registering worker 172.17.0.4:8888 with 4 cores, 1024.0 MB RAM
    ```
    
    Author: Dongjoon Hyun <dongjoon@apache.org>
    
    Closes #20230 from dongjoon-hyun/SPARK-23038.
    
    (cherry picked from commit 7a3d0aa)
    Signed-off-by: Felix Cheung <felixcheung@apache.org>
    dongjoon-hyun authored and Felix Cheung committed Jan 14, 2018
    Configuration menu
    Copy the full SHA
    7022ef8 View commit details
    Browse the repository at this point in the history

Commits on Jan 17, 2018

  1. [SPARK-23095][SQL] Decorrelation of scalar subquery fails with java.u…

    …til.NoSuchElementException
    
    ## What changes were proposed in this pull request?
    The following SQL involving scalar correlated query returns a map exception.
    ``` SQL
    SELECT t1a
    FROM   t1
    WHERE  t1a = (SELECT   count(*)
                  FROM     t2
                  WHERE    t2c = t1c
                  HAVING   count(*) >= 1)
    ```
    ``` SQL
    key not found: ExprId(278,786682bb-41f9-4bd5-a397-928272cc8e4e)
    java.util.NoSuchElementException: key not found: ExprId(278,786682bb-41f9-4bd5-a397-928272cc8e4e)
            at scala.collection.MapLike$class.default(MapLike.scala:228)
            at scala.collection.AbstractMap.default(Map.scala:59)
            at scala.collection.MapLike$class.apply(MapLike.scala:141)
            at scala.collection.AbstractMap.apply(Map.scala:59)
            at org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery$.org$apache$spark$sql$catalyst$optimizer$RewriteCorrelatedScalarSubquery$$evalSubqueryOnZeroTups(subquery.scala:378)
            at org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery$$anonfun$org$apache$spark$sql$catalyst$optimizer$RewriteCorrelatedScalarSubquery$$constructLeftJoins$1.apply(subquery.scala:430)
            at org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery$$anonfun$org$apache$spark$sql$catalyst$optimizer$RewriteCorrelatedScalarSubquery$$constructLeftJoins$1.apply(subquery.scala:426)
    ```
    
    In this case, after evaluating the HAVING clause "count(*) > 1" statically
    against the binding of aggregtation result on empty input, we determine
    that this query will not have a the count bug. We should simply return
    the evalSubqueryOnZeroTups with empty value.
    (Please fill in changes proposed in this fix)
    
    ## How was this patch tested?
    A new test was added in the Subquery bucket.
    
    Author: Dilip Biswal <dbiswal@us.ibm.com>
    
    Closes #20283 from dilipbiswal/scalar-count-defect.
    
    (cherry picked from commit 0c2ba42)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    dilipbiswal authored and gatorsmile committed Jan 17, 2018
    Configuration menu
    Copy the full SHA
    d09eecc View commit details
    Browse the repository at this point in the history

Commits on Jan 19, 2018

  1. [DOCS] change to dataset for java code in structured-streaming-kafka-…

    …integration document
    
    ## What changes were proposed in this pull request?
    
    In latest structured-streaming-kafka-integration document, Java code example for Kafka integration is using `DataFrame<Row>`, shouldn't it be changed to `DataSet<Row>`?
    
    ## How was this patch tested?
    
    manual test has been performed to test the updated example Java code in Spark 2.2.1 with Kafka 1.0
    
    Author: brandonJY <brandonJY@users.noreply.github.com>
    
    Closes #20312 from brandonJY/patch-2.
    
    (cherry picked from commit 6121e91)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    brandonJY authored and srowen committed Jan 19, 2018
    Configuration menu
    Copy the full SHA
    0e58fee View commit details
    Browse the repository at this point in the history

Commits on Jan 31, 2018

  1. [SPARK-23281][SQL] Query produces results in incorrect order when a c…

    …omposite order by clause refers to both original columns and aliases
    
    ## What changes were proposed in this pull request?
    Here is the test snippet.
    ``` SQL
    scala> Seq[(Integer, Integer)](
         |         (1, 1),
         |         (1, 3),
         |         (2, 3),
         |         (3, 3),
         |         (4, null),
         |         (5, null)
         |       ).toDF("key", "value").createOrReplaceTempView("src")
    
    scala> sql(
         |         """
         |           |SELECT MAX(value) as value, key as col2
         |           |FROM src
         |           |GROUP BY key
         |           |ORDER BY value desc, key
         |         """.stripMargin).show
    +-----+----+
    |value|col2|
    +-----+----+
    |    3|   3|
    |    3|   2|
    |    3|   1|
    | null|   5|
    | null|   4|
    +-----+----+
    ```SQL
    Here is the explain output :
    
    ```SQL
    == Parsed Logical Plan ==
    'Sort ['value DESC NULLS LAST, 'key ASC NULLS FIRST], true
    +- 'Aggregate ['key], ['MAX('value) AS value#9, 'key AS col2#10]
       +- 'UnresolvedRelation `src`
    
    == Analyzed Logical Plan ==
    value: int, col2: int
    Project [value#9, col2#10]
    +- Sort [value#9 DESC NULLS LAST, col2#10 DESC NULLS LAST], true
       +- Aggregate [key#5], [max(value#6) AS value#9, key#5 AS col2#10]
          +- SubqueryAlias src
             +- Project [_1#2 AS key#5, _2#3 AS value#6]
                +- LocalRelation [_1#2, _2#3]
    ``` SQL
    The sort direction is being wrongly changed from ASC to DSC while resolving ```Sort``` in
    resolveAggregateFunctions.
    
    The above testcase models TPCDS-Q71 and thus we have the same issue in Q71 as well.
    
    ## How was this patch tested?
    A few tests are added in SQLQuerySuite.
    
    Author: Dilip Biswal <dbiswal@us.ibm.com>
    
    Closes #20453 from dilipbiswal/local_spark.
    dilipbiswal authored and gatorsmile committed Jan 31, 2018
    Configuration menu
    Copy the full SHA
    5273cc7 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    cb73ecd View commit details
    Browse the repository at this point in the history

Commits on Feb 9, 2018

  1. [SPARK-23358][CORE] When the number of partitions is greater than 2^2…

    …8, it will result in an error result
    
    ## What changes were proposed in this pull request?
    In the `checkIndexAndDataFile`,the `blocks` is the ` Int` type,  when it is greater than 2^28, `blocks*8` will overflow, and this will result in an error result.
    In fact, `blocks` is actually the number of partitions.
    
    ## How was this patch tested?
    Manual test
    
    Author: liuxian <liu.xian3@zte.com.cn>
    
    Closes #20544 from 10110346/overflow.
    
    (cherry picked from commit f77270b)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    10110346 authored and srowen committed Feb 9, 2018
    Configuration menu
    Copy the full SHA
    f65e653 View commit details
    Browse the repository at this point in the history

Commits on Feb 10, 2018

  1. [SPARK-23186][SQL][BRANCH-2.2] Initialize DriverManager first before …

    …loading JDBC Drivers
    
    ## What changes were proposed in this pull request?
    
    Since some JDBC Drivers have class initialization code to call `DriverManager`, we need to initialize `DriverManager` first in order to avoid potential executor-side **deadlock** situations like the following (or [STORM-2527](https://issues.apache.org/jira/browse/STORM-2527)).
    
    ```
    Thread 9587: (state = BLOCKED)
     - sun.reflect.NativeConstructorAccessorImpl.newInstance0(java.lang.reflect.Constructor, java.lang.Object[]) bci=0 (Compiled frame; information may be imprecise)
     - sun.reflect.NativeConstructorAccessorImpl.newInstance(java.lang.Object[]) bci=85, line=62 (Compiled frame)
     - sun.reflect.DelegatingConstructorAccessorImpl.newInstance(java.lang.Object[]) bci=5, line=45 (Compiled frame)
     - java.lang.reflect.Constructor.newInstance(java.lang.Object[]) bci=79, line=423 (Compiled frame)
     - java.lang.Class.newInstance() bci=138, line=442 (Compiled frame)
     - java.util.ServiceLoader$LazyIterator.nextService() bci=119, line=380 (Interpreted frame)
     - java.util.ServiceLoader$LazyIterator.next() bci=11, line=404 (Interpreted frame)
     - java.util.ServiceLoader$1.next() bci=37, line=480 (Interpreted frame)
     - java.sql.DriverManager$2.run() bci=21, line=603 (Interpreted frame)
     - java.sql.DriverManager$2.run() bci=1, line=583 (Interpreted frame)
     - java.security.AccessController.doPrivileged(java.security.PrivilegedAction) bci=0 (Compiled frame)
     - java.sql.DriverManager.loadInitialDrivers() bci=27, line=583 (Interpreted frame)
     - java.sql.DriverManager.<clinit>() bci=32, line=101 (Interpreted frame)
     - org.apache.phoenix.mapreduce.util.ConnectionUtil.getConnection(java.lang.String, java.lang.Integer, java.lang.String, java.util.Properties) bci=12, line=98 (Interpreted frame)
     - org.apache.phoenix.mapreduce.util.ConnectionUtil.getInputConnection(org.apache.hadoop.conf.Configuration, java.util.Properties) bci=22, line=57 (Interpreted frame)
     - org.apache.phoenix.mapreduce.PhoenixInputFormat.getQueryPlan(org.apache.hadoop.mapreduce.JobContext, org.apache.hadoop.conf.Configuration) bci=61, line=116 (Interpreted frame)
     - org.apache.phoenix.mapreduce.PhoenixInputFormat.createRecordReader(org.apache.hadoop.mapreduce.InputSplit, org.apache.hadoop.mapreduce.TaskAttemptContext) bci=10, line=71 (Interpreted frame)
     - org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(org.apache.spark.rdd.NewHadoopRDD, org.apache.spark.Partition, org.apache.spark.TaskContext) bci=233, line=156 (Interpreted frame)
    
    Thread 9170: (state = BLOCKED)
     - org.apache.phoenix.jdbc.PhoenixDriver.<clinit>() bci=35, line=125 (Interpreted frame)
     - sun.reflect.NativeConstructorAccessorImpl.newInstance0(java.lang.reflect.Constructor, java.lang.Object[]) bci=0 (Compiled frame)
     - sun.reflect.NativeConstructorAccessorImpl.newInstance(java.lang.Object[]) bci=85, line=62 (Compiled frame)
     - sun.reflect.DelegatingConstructorAccessorImpl.newInstance(java.lang.Object[]) bci=5, line=45 (Compiled frame)
     - java.lang.reflect.Constructor.newInstance(java.lang.Object[]) bci=79, line=423 (Compiled frame)
     - java.lang.Class.newInstance() bci=138, line=442 (Compiled frame)
     - org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(java.lang.String) bci=89, line=46 (Interpreted frame)
     - org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2.apply() bci=7, line=53 (Interpreted frame)
     - org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2.apply() bci=1, line=52 (Interpreted frame)
     - org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.<init>(org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD, org.apache.spark.Partition, org.apache.spark.TaskContext) bci=81, line=347 (Interpreted frame)
     - org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(org.apache.spark.Partition, org.apache.spark.TaskContext) bci=7, line=339 (Interpreted frame)
    ```
    
    ## How was this patch tested?
    
    N/A
    
    Author: Dongjoon Hyun <dongjoon@apache.org>
    
    Closes #20563 from dongjoon-hyun/SPARK-23186-2.
    dongjoon-hyun authored and cloud-fan committed Feb 10, 2018
    Configuration menu
    Copy the full SHA
    1b4c6ab View commit details
    Browse the repository at this point in the history