Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Master #8767

Closed
wants to merge 633 commits into from
Closed

Master #8767

wants to merge 633 commits into from
This pull request is big! We’re only showing the most recent 250 commits.

Commits on Aug 24, 2015

  1. [SPARK-9791] [PACKAGE] Change private class to private class to preve…

    …nt unnecessary classes from showing up in the docs
    
    In addition, some random cleanup of import ordering
    
    Author: Tathagata Das <tathagata.das1565@gmail.com>
    
    Closes #8387 from tdas/SPARK-9791 and squashes the following commits:
    
    67f3ee9 [Tathagata Das] Change private class to private[package] class to prevent them from showing up in the docs
    tdas committed Aug 24, 2015
    Configuration menu
    Copy the full SHA
    7478c8b View commit details
    Browse the repository at this point in the history
  2. [SPARK-7710] [SPARK-7998] [DOCS] Docs for DataFrameStatFunctions

    This PR contains examples on how to use some of the Stat Functions available for DataFrames under `df.stat`.
    
    rxin
    
    Author: Burak Yavuz <brkyvz@gmail.com>
    
    Closes #8378 from brkyvz/update-sql-docs.
    brkyvz authored and rxin committed Aug 24, 2015
    Configuration menu
    Copy the full SHA
    9ce0c7a View commit details
    Browse the repository at this point in the history
  3. [SPARK-10144] [UI] Actually show peak execution memory by default

    The peak execution memory metric was introduced in SPARK-8735. That was before Tungsten was enabled by default, so it assumed that `spark.sql.unsafe.enabled` must be explicitly set to true. The result is that the memory is not displayed by default.
    
    Author: Andrew Or <andrew@databricks.com>
    
    Closes #8345 from andrewor14/show-memory-default.
    Andrew Or authored and yhuai committed Aug 24, 2015
    Configuration menu
    Copy the full SHA
    662bb96 View commit details
    Browse the repository at this point in the history
  4. [SPARK-8580] [SQL] Refactors ParquetHiveCompatibilitySuite and adds m…

    …ore test cases
    
    This PR refactors `ParquetHiveCompatibilitySuite` so that it's easier to add new test cases.
    
    Hit two bugs, SPARK-10177 and HIVE-11625, while working on this, added test cases for them and marked as ignored for now. SPARK-10177 will be addressed in a separate PR.
    
    Author: Cheng Lian <lian@databricks.com>
    
    Closes #8392 from liancheng/spark-8580/parquet-hive-compat-tests.
    liancheng authored and davies committed Aug 24, 2015
    Configuration menu
    Copy the full SHA
    a2f4cdc View commit details
    Browse the repository at this point in the history
  5. [SPARK-9758] [TEST] [SQL] Compilation issue for hive test / wrong pac…

    …kage?
    
    Move `test.org.apache.spark.sql.hive` package tests to apparent intended `org.apache.spark.sql.hive` as they don't intend to test behavior from outside org.apache.spark.*
    
    Alternate take, per discussion at #8051
    I think this is what vanzin and I had in mind but also CC rxin to cross-check, as this does indeed depend on whether these tests were accidentally in this package or not. Testing from a `test.org.apache.spark` package is legitimate but didn't seem to be the intent here.
    
    Author: Sean Owen <sowen@cloudera.com>
    
    Closes #8307 from srowen/SPARK-9758.
    srowen committed Aug 24, 2015
    Configuration menu
    Copy the full SHA
    cb2d2e1 View commit details
    Browse the repository at this point in the history
  6. [SPARK-10061] [DOC] ML ensemble docs

    User guide for spark.ml GBTs and Random Forests.
    The examples are copied from the decision tree guide and modified to run.
    
    I caught some issues I had somehow missed in the tree guide as well.
    
    I have run all examples, including Java ones.  (Of course, I thought I had previously as well...)
    
    CC: mengxr manishamde yanboliang
    
    Author: Joseph K. Bradley <joseph@databricks.com>
    
    Closes #8369 from jkbradley/ml-ensemble-docs.
    jkbradley authored and mengxr committed Aug 24, 2015
    Configuration menu
    Copy the full SHA
    13db11c View commit details
    Browse the repository at this point in the history
  7. [SPARK-10190] Fix NPE in CatalystTypeConverters Decimal toScala conve…

    …rter
    
    This adds a missing null check to the Decimal `toScala` converter in `CatalystTypeConverters`, fixing an NPE.
    
    Author: Josh Rosen <joshrosen@databricks.com>
    
    Closes #8401 from JoshRosen/SPARK-10190.
    JoshRosen authored and rxin committed Aug 24, 2015
    Configuration menu
    Copy the full SHA
    d7b4c09 View commit details
    Browse the repository at this point in the history

Commits on Aug 25, 2015

  1. [SPARK-10165] [SQL] Await child resolution in ResolveFunctions

    Currently, we eagerly attempt to resolve functions, even before their children are resolved.  However, this is not valid in cases where we need to know the types of the input arguments (i.e. when resolving Hive UDFs).
    
    As a fix, this PR delays function resolution until the functions children are resolved.  This change also necessitates a change to the way we resolve aggregate expressions that are not in aggregate operators (e.g., in `HAVING` or `ORDER BY` clauses).  Specifically, we can't assume that these misplaced functions will be resolved, allowing us to differentiate aggregate functions from normal functions.  To compensate for this change we now attempt to resolve these unresolved expressions in the context of the aggregate operator, before checking to see if any aggregate expressions are present.
    
    Author: Michael Armbrust <michael@databricks.com>
    
    Closes #8371 from marmbrus/hiveUDFResolution.
    marmbrus committed Aug 25, 2015
    Configuration menu
    Copy the full SHA
    2bf338c View commit details
    Browse the repository at this point in the history
  2. [SPARK-10118] [SPARKR] [DOCS] Improve SparkR API docs for 1.5 release

    cc: shivaram
    
    ## Summary
    
    - Modify `tdname` of expression functions. i.e. `ascii`: `rdname functions` => `rdname ascii`
    - Replace the dynamical function definitions to the static ones because of thir documentations.
    
    ## Generated PDF File
    https://drive.google.com/file/d/0B9biIZIU47lLX2t6ZjRoRnBTSEU/view?usp=sharing
    
    ## JIRA
    [[SPARK-10118] Improve SparkR API docs for 1.5 release - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-10118)
    
    Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
    Author: Yuu ISHIKAWA <yuu.ishikawa@gmail.com>
    
    Closes #8386 from yu-iskw/SPARK-10118.
    yu-iskw authored and shivaram committed Aug 25, 2015
    Configuration menu
    Copy the full SHA
    6511bf5 View commit details
    Browse the repository at this point in the history
  3. [SQL] [MINOR] [DOC] Clarify docs for inferring DataFrame from RDD of …

    …Products
    
     * Makes `SQLImplicits.rddToDataFrameHolder` scaladoc consistent with `SQLContext.createDataFrame[A <: Product](rdd: RDD[A])` since the former is essentially a wrapper for the latter
     * Clarifies `createDataFrame[A <: Product]` scaladoc to apply for any `RDD[Product]`, not just case classes
    
    Author: Feynman Liang <fliang@databricks.com>
    
    Closes #8406 from feynmanliang/sql-doc-fixes.
    Feynman Liang authored and rxin committed Aug 25, 2015
    Configuration menu
    Copy the full SHA
    642c43c View commit details
    Browse the repository at this point in the history
  4. [SPARK-10121] [SQL] Thrift server always use the latest class loader …

    …provided by the conf of executionHive's state
    
    https://issues.apache.org/jira/browse/SPARK-10121
    
    Looks like the problem is that if we add a jar through another thread, the thread handling the JDBC session will not get the latest classloader.
    
    Author: Yin Huai <yhuai@databricks.com>
    
    Closes #8368 from yhuai/SPARK-10121.
    yhuai authored and liancheng committed Aug 25, 2015
    Configuration menu
    Copy the full SHA
    a0c0aae View commit details
    Browse the repository at this point in the history
  5. [SPARK-10178] [SQL] HiveComparisionTest should print out dependent ta…

    …bles
    
    In `HiveComparisionTest`s it is possible to fail a query of the form `SELECT * FROM dest1`, where `dest1` is the query that is actually computing the incorrect results.  To aid debugging this patch improves the harness to also print these query plans and their results.
    
    Author: Michael Armbrust <michael@databricks.com>
    
    Closes #8388 from marmbrus/generatedTables.
    marmbrus authored and rxin committed Aug 25, 2015
    Configuration menu
    Copy the full SHA
    5175ca0 View commit details
    Browse the repository at this point in the history
  6. [SPARK-9786] [STREAMING] [KAFKA] fix backpressure so it works with defa…

    …ult maxRatePerPartition setting of 0
    
    Author: cody koeninger <cody@koeninger.org>
    
    Closes #8413 from koeninger/backpressure-testing-master.
    koeninger authored and tdas committed Aug 25, 2015
    Configuration menu
    Copy the full SHA
    d9c25de View commit details
    Browse the repository at this point in the history
  7. [SPARK-10137] [STREAMING] Avoid to restart receivers if scheduleRecei…

    …vers returns balanced results
    
    This PR fixes the following cases for `ReceiverSchedulingPolicy`.
    
    1) Assume there are 4 executors: host1, host2, host3, host4, and 5 receivers: r1, r2, r3, r4, r5. Then `ReceiverSchedulingPolicy.scheduleReceivers` will return (r1 -> host1, r2 -> host2, r3 -> host3, r4 -> host4, r5 -> host1).
    Let's assume r1 starts at first on `host1` as `scheduleReceivers` suggested,  and try to register with ReceiverTracker. But the previous `ReceiverSchedulingPolicy.rescheduleReceiver` will return (host2, host3, host4) according to the current executor weights (host1 -> 1.0, host2 -> 0.5, host3 -> 0.5, host4 -> 0.5), so ReceiverTracker will reject `r1`. This is unexpected since r1 is starting exactly where `scheduleReceivers` suggested.
    
    This case can be fixed by ignoring the information of the receiver that is rescheduling in `receiverTrackingInfoMap`.
    
    2) Assume there are 3 executors (host1, host2, host3) and each executors has 3 cores, and 3 receivers: r1, r2, r3. Assume r1 is running on host1. Now r2 is restarting, the previous `ReceiverSchedulingPolicy.rescheduleReceiver` will always return (host1, host2, host3). So it's possible that r2 will be scheduled to host1 by TaskScheduler. r3 is similar. Then at last, it's possible that there are 3 receivers running on host1, while host2 and host3 are idle.
    
    This issue can be fixed by returning only executors that have the minimum wight rather than returning at least 3 executors.
    
    Author: zsxwing <zsxwing@gmail.com>
    
    Closes #8340 from zsxwing/fix-receiver-scheduling.
    zsxwing authored and tdas committed Aug 25, 2015
    Configuration menu
    Copy the full SHA
    f023aa2 View commit details
    Browse the repository at this point in the history
  8. [SPARK-10196] [SQL] Correctly saving decimals in internal rows to JSON.

    https://issues.apache.org/jira/browse/SPARK-10196
    
    Author: Yin Huai <yhuai@databricks.com>
    
    Closes #8408 from yhuai/DecimalJsonSPARK-10196.
    yhuai authored and davies committed Aug 25, 2015
    Configuration menu
    Copy the full SHA
    df7041d View commit details
    Browse the repository at this point in the history
  9. [SPARK-10136] [SQL] A more robust fix for SPARK-10136

    PR #8341 is a valid fix for SPARK-10136, but it didn't catch the real root cause.  The real problem can be rather tricky to explain, and requires audiences to be pretty familiar with parquet-format spec, especially details of `LIST` backwards-compatibility rules.  Let me have a try to give an explanation here.
    
    The structure of the problematic Parquet schema generated by parquet-avro is something like this:
    
    ```
    message m {
      <repetition> group f (LIST) {         // Level 1
        repeated group array (LIST) {       // Level 2
          repeated <primitive-type> array;  // Level 3
        }
      }
    }
    ```
    
    (The schema generated by parquet-thrift is structurally similar, just replace the `array` at level 2 with `f_tuple`, and the other one at level 3 with `f_tuple_tuple`.)
    
    This structure consists of two nested legacy 2-level `LIST`-like structures:
    
    1. The repeated group type at level 2 is the element type of the outer array defined at level 1
    
       This group should map to an `CatalystArrayConverter.ElementConverter` when building converters.
    
    2. The repeated primitive type at level 3 is the element type of the inner array defined at level 2
    
       This group should also map to an `CatalystArrayConverter.ElementConverter`.
    
    The root cause of SPARK-10136 is that, the group at level 2 isn't properly recognized as the element type of level 1.  Thus, according to parquet-format spec, the repeated primitive at level 3 is left as a so called "unannotated repeated primitive type", and is recognized as a required list of required primitive type, thus a `RepeatedPrimitiveConverter` instead of a `CatalystArrayConverter.ElementConverter` is created for it.
    
    According to  parquet-format spec, unannotated repeated type shouldn't appear in a `LIST`- or `MAP`-annotated group.  PR #8341 fixed this issue by allowing such unannotated repeated type appear in `LIST`-annotated groups, which is a non-standard, hacky, but valid fix.  (I didn't realize this when authoring #8341 though.)
    
    As for the reason why level 2 isn't recognized as a list element type, it's because of the following `LIST` backwards-compatibility rule defined in the parquet-format spec:
    
    > If the repeated field is a group with one field and is named either `array` or uses the `LIST`-annotated group's name with `_tuple` appended then the repeated type is the element type and elements are required.
    
    (The `array` part is for parquet-avro compatibility, while the `_tuple` part is for parquet-thrift.)
    
    This rule is implemented in [`CatalystSchemaConverter.isElementType`] [1], but neglected in [`CatalystRowConverter.isElementType`] [2].  This PR delivers a more robust fix by adding this rule in the latter method.
    
    Note that parquet-avro 1.7.0 also suffers from this issue. Details can be found at [PARQUET-364] [3].
    
    [1]: https://github.com/apache/spark/blob/85f9a61357994da5023b08b0a8a2eb09388ce7f8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystSchemaConverter.scala#L259-L305
    [2]: https://github.com/apache/spark/blob/85f9a61357994da5023b08b0a8a2eb09388ce7f8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystRowConverter.scala#L456-L463
    [3]: https://issues.apache.org/jira/browse/PARQUET-364
    
    Author: Cheng Lian <lian@databricks.com>
    
    Closes #8361 from liancheng/spark-10136/proper-version.
    liancheng committed Aug 25, 2015
    Configuration menu
    Copy the full SHA
    bf03fe6 View commit details
    Browse the repository at this point in the history
  10. [SPARK-9293] [SPARK-9813] Analysis should check that set operations a…

    …re only performed on tables with equal numbers of columns
    
    This patch adds an analyzer rule to ensure that set operations (union, intersect, and except) are only applied to tables with the same number of columns. Without this rule, there are scenarios where invalid queries can return incorrect results instead of failing with error messages; SPARK-9813 provides one example of this problem. In other cases, the invalid query can crash at runtime with extremely confusing exceptions.
    
    I also performed a bit of cleanup to refactor some of those logical operators' code into a common `SetOperation` base class.
    
    Author: Josh Rosen <joshrosen@databricks.com>
    
    Closes #7631 from JoshRosen/SPARK-9293.
    JoshRosen authored and marmbrus committed Aug 25, 2015
    Configuration menu
    Copy the full SHA
    82268f0 View commit details
    Browse the repository at this point in the history
  11. [SPARK-10214] [SPARKR] [DOCS] Improve SparkR Column, DataFrame API docs

    cc: shivaram
    
    ## Summary
    
    - Add name tags to each methods in DataFrame.R and column.R
    - Replace `rdname column` with `rdname {each_func}`. i.e. alias method : `rdname column` =>  `rdname alias`
    
    ## Generated PDF File
    https://drive.google.com/file/d/0B9biIZIU47lLNHN2aFpnQXlSeGs/view?usp=sharing
    
    ## JIRA
    [[SPARK-10214] Improve SparkR Column, DataFrame API docs - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-10214)
    
    Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
    
    Closes #8414 from yu-iskw/SPARK-10214.
    yu-iskw authored and shivaram committed Aug 25, 2015
    Configuration menu
    Copy the full SHA
    d4549fe View commit details
    Browse the repository at this point in the history
  12. [SPARK-6196] [BUILD] Remove MapR profiles in favor of hadoop-provided

    Follow up to #7047
    
    pwendell mentioned that MapR should use `hadoop-provided` now, and indeed the new build script does not produce `mapr3`/`mapr4` artifacts anymore. Hence the action seems to be to remove the profiles, which are now not used.
    
    CC trystanleftwich
    
    Author: Sean Owen <sowen@cloudera.com>
    
    Closes #8338 from srowen/SPARK-6196.
    srowen committed Aug 25, 2015
    Configuration menu
    Copy the full SHA
    57b960b View commit details
    Browse the repository at this point in the history
  13. [SPARK-10210] [STREAMING] Filter out non-existent blocks before creat…

    …ing BlockRDD
    
    When write ahead log is not enabled, a recovered streaming driver still tries to run jobs using pre-failure block ids, and fails as the block do not exists in-memory any more (and cannot be recovered as receiver WAL is not enabled).
    
    This occurs because the driver-side WAL of ReceivedBlockTracker is recovers that past block information, and ReceiveInputDStream creates BlockRDDs even if those blocks do not exist.
    
    The solution in this PR is to filter out block ids that do not exist before creating the BlockRDD. In addition, it adds unit tests to verify other logic in ReceiverInputDStream.
    
    Author: Tathagata Das <tathagata.das1565@gmail.com>
    
    Closes #8405 from tdas/SPARK-10210.
    tdas committed Aug 25, 2015
    Configuration menu
    Copy the full SHA
    1fc3758 View commit details
    Browse the repository at this point in the history
  14. [SPARK-10177] [SQL] fix reading Timestamp in parquet from Hive

    We misunderstood the Julian days and nanoseconds of the day in parquet (as TimestampType) from Hive/Impala, they are overlapped, so can't be added together directly.
    
    In order to avoid the confusing rounding when do the converting, we use `2440588` as the Julian Day of epoch of unix timestamp (which should be 2440587.5).
    
    Author: Davies Liu <davies@databricks.com>
    Author: Cheng Lian <lian@databricks.com>
    
    Closes #8400 from davies/timestamp_parquet.
    Davies Liu authored and liancheng committed Aug 25, 2015
    Configuration menu
    Copy the full SHA
    2f493f7 View commit details
    Browse the repository at this point in the history
  15. [SPARK-10195] [SQL] Data sources Filter should not expose internal types

    Spark SQL's data sources API exposes Catalyst's internal types through its Filter interfaces. This is a problem because types like UTF8String are not stable developer APIs and should not be exposed to third-parties.
    
    This issue caused incompatibilities when upgrading our `spark-redshift` library to work against Spark 1.5.0.  To avoid these issues in the future we should only expose public types through these Filter objects. This patch accomplishes this by using CatalystTypeConverters to add the appropriate conversions.
    
    Author: Josh Rosen <joshrosen@databricks.com>
    
    Closes #8403 from JoshRosen/datasources-internal-vs-external-types.
    JoshRosen authored and rxin committed Aug 25, 2015
    Configuration menu
    Copy the full SHA
    7bc9a8c View commit details
    Browse the repository at this point in the history
  16. [SPARK-10197] [SQL] Add null check in wrapperFor (inside HiveInspecto…

    …rs).
    
    https://issues.apache.org/jira/browse/SPARK-10197
    
    Author: Yin Huai <yhuai@databricks.com>
    
    Closes #8407 from yhuai/ORCSPARK-10197.
    yhuai authored and liancheng committed Aug 25, 2015
    Configuration menu
    Copy the full SHA
    0e6368f View commit details
    Browse the repository at this point in the history
  17. [DOC] add missing parameters in SparkContext.scala for scala doc

    Author: Zhang, Liye <liye.zhang@intel.com>
    
    Closes #8412 from liyezhang556520/minorDoc.
    liyezhang556520 authored and srowen committed Aug 25, 2015
    Configuration menu
    Copy the full SHA
    5c14890 View commit details
    Browse the repository at this point in the history
  18. Fixed a typo in DAGScheduler.

    Author: ehnalis <zoltan.zvara@gmail.com>
    
    Closes #8308 from ehnalis/master.
    zzvara authored and srowen committed Aug 25, 2015
    Configuration menu
    Copy the full SHA
    7f1e507 View commit details
    Browse the repository at this point in the history
  19. [SPARK-9613] [CORE] Ban use of JavaConversions and migrate all existi…

    …ng uses to JavaConverters
    
    Replace `JavaConversions` implicits with `JavaConverters`
    
    Most occurrences I've seen so far are necessary conversions; a few have been avoidable. None are in critical code as far as I see, yet.
    
    Author: Sean Owen <sowen@cloudera.com>
    
    Closes #8033 from srowen/SPARK-9613.
    srowen committed Aug 25, 2015
    Configuration menu
    Copy the full SHA
    69c9c17 View commit details
    Browse the repository at this point in the history
  20. [SPARK-10198] [SQL] Turn off partition verification by default

    Author: Michael Armbrust <michael@databricks.com>
    
    Closes #8404 from marmbrus/turnOffPartitionVerification.
    marmbrus committed Aug 25, 2015
    1 Configuration menu
    Copy the full SHA
    5c08c86 View commit details
    Browse the repository at this point in the history
  21. [SPARK-8531] [ML] Update ML user guide for MinMaxScaler

    jira: https://issues.apache.org/jira/browse/SPARK-8531
    
    Update ML user guide for MinMaxScaler
    
    Author: Yuhao Yang <hhbyyh@gmail.com>
    Author: unknown <yuhaoyan@yuhaoyan-MOBL1.ccr.corp.intel.com>
    
    Closes #7211 from hhbyyh/minmaxdoc.
    hhbyyh authored and jkbradley committed Aug 25, 2015
    Configuration menu
    Copy the full SHA
    b37f0cc View commit details
    Browse the repository at this point in the history
  22. [SPARK-10230] [MLLIB] Rename optimizeAlpha to optimizeDocConcentration

    See [discussion](#8254 (comment))
    
    CC jkbradley
    
    Author: Feynman Liang <fliang@databricks.com>
    
    Closes #8422 from feynmanliang/SPARK-10230.
    Feynman Liang authored and jkbradley committed Aug 25, 2015
    Configuration menu
    Copy the full SHA
    881208a View commit details
    Browse the repository at this point in the history
  23. [SPARK-10231] [MLLIB] update @SInCE annotation for mllib.classification

    Update `Since` annotation in `mllib.classification`:
    
    1. add version to classes, objects, constructors, and public variables declared in constructors
    2. correct some versions
    3. remove `Since` on `toString`
    
    MechCoder dbtsai
    
    Author: Xiangrui Meng <meng@databricks.com>
    
    Closes #8421 from mengxr/SPARK-10231 and squashes the following commits:
    
    b2dce80 [Xiangrui Meng] update @SInCE annotation for mllib.classification
    mengxr authored and DB Tsai committed Aug 25, 2015
    Configuration menu
    Copy the full SHA
    16a2be1 View commit details
    Browse the repository at this point in the history
  24. [SPARK-10048] [SPARKR] Support arbitrary nested Java array in serde.

    This PR:
    1. supports transferring arbitrary nested array from JVM to R side in SerDe;
    2. based on 1, collect() implemenation is improved. Now it can support collecting data of complex types
       from a DataFrame.
    
    Author: Sun Rui <rui.sun@intel.com>
    
    Closes #8276 from sun-rui/SPARK-10048.
    Sun Rui authored and shivaram committed Aug 25, 2015
    Configuration menu
    Copy the full SHA
    71a138c View commit details
    Browse the repository at this point in the history
  25. [SPARK-9800] Adds docs for GradientDescent$.runMiniBatchSGD alias

    * Adds doc for alias of runMIniBatchSGD documenting default value for convergeTol
    * Cleans up a note in code
    
    Author: Feynman Liang <fliang@databricks.com>
    
    Closes #8425 from feynmanliang/SPARK-9800.
    Feynman Liang authored and jkbradley committed Aug 25, 2015
    Configuration menu
    Copy the full SHA
    c0e9ff1 View commit details
    Browse the repository at this point in the history
  26. [SPARK-10237] [MLLIB] update since versions in mllib.fpm

    Same as #8421 but for `mllib.fpm`.
    
    cc feynmanliang
    
    Author: Xiangrui Meng <meng@databricks.com>
    
    Closes #8429 from mengxr/SPARK-10237.
    mengxr committed Aug 25, 2015
    Configuration menu
    Copy the full SHA
    c619c75 View commit details
    Browse the repository at this point in the history
  27. [SPARK-9797] [MLLIB] [DOC] StreamingLinearRegressionWithSGD.setConver…

    …genceTol default value
    
    Adds default convergence tolerance (0.001, set in `GradientDescent.convergenceTol`) to `setConvergenceTol`'s scaladoc
    
    Author: Feynman Liang <fliang@databricks.com>
    
    Closes #8424 from feynmanliang/SPARK-9797.
    Feynman Liang authored and jkbradley committed Aug 25, 2015
    Configuration menu
    Copy the full SHA
    9205907 View commit details
    Browse the repository at this point in the history
  28. [SPARK-10239] [SPARK-10244] [MLLIB] update since versions in mllib.pm…

    …ml and mllib.util
    
    Same as #8421 but for `mllib.pmml` and `mllib.util`.
    
    cc dbtsai
    
    Author: Xiangrui Meng <meng@databricks.com>
    
    Closes #8430 from mengxr/SPARK-10239 and squashes the following commits:
    
    a189acf [Xiangrui Meng] update since versions in mllib.pmml and mllib.util
    mengxr authored and DB Tsai committed Aug 25, 2015
    Configuration menu
    Copy the full SHA
    00ae4be View commit details
    Browse the repository at this point in the history
  29. [SPARK-10245] [SQL] Fix decimal literals with precision < scale

    In BigDecimal or java.math.BigDecimal, the precision could be smaller than scale, for example, BigDecimal("0.001") has precision = 1 and scale = 3. But DecimalType require that the precision should be larger than scale, so we should use the maximum of precision and scale when inferring the schema from decimal literal.
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes #8428 from davies/smaller_decimal.
    Davies Liu authored and yhuai committed Aug 25, 2015
    Configuration menu
    Copy the full SHA
    ec89bd8 View commit details
    Browse the repository at this point in the history
  30. [SPARK-10215] [SQL] Fix precision of division (follow the rule in Hive)

    Follow the rule in Hive for decimal division. see https://github.com/apache/hive/blob/ac755ebe26361a4647d53db2a28500f71697b276/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFOPDivide.java#L113
    
    cc chenghao-intel
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes #8415 from davies/decimal_div2.
    Davies Liu authored and yhuai committed Aug 25, 2015
    Configuration menu
    Copy the full SHA
    7467b52 View commit details
    Browse the repository at this point in the history

Commits on Aug 26, 2015

  1. [SPARK-9888] [MLLIB] User guide for new LDA features

     * Adds two new sections to LDA's user guide; one for each optimizer/model
     * Documents new features added to LDA (e.g. topXXXperXXX, asymmetric priors, hyperpam optimization)
     * Cleans up a TODO and sets a default parameter in LDA code
    
    jkbradley hhbyyh
    
    Author: Feynman Liang <fliang@databricks.com>
    
    Closes #8254 from feynmanliang/SPARK-9888.
    Feynman Liang authored and jkbradley committed Aug 26, 2015
    Configuration menu
    Copy the full SHA
    125205c View commit details
    Browse the repository at this point in the history
  2. [SPARK-10233] [MLLIB] update since version in mllib.evaluation

    Same as #8421 but for `mllib.evaluation`.
    
    cc avulanov
    
    Author: Xiangrui Meng <meng@databricks.com>
    
    Closes #8423 from mengxr/SPARK-10233.
    mengxr committed Aug 26, 2015
    Configuration menu
    Copy the full SHA
    8668ead View commit details
    Browse the repository at this point in the history
  3. [SPARK-10238] [MLLIB] update since versions in mllib.linalg

    Same as #8421 but for `mllib.linalg`.
    
    cc dbtsai
    
    Author: Xiangrui Meng <meng@databricks.com>
    
    Closes #8440 from mengxr/SPARK-10238 and squashes the following commits:
    
    b38437e [Xiangrui Meng] update since versions in mllib.linalg
    mengxr authored and DB Tsai committed Aug 26, 2015
    Configuration menu
    Copy the full SHA
    ab431f8 View commit details
    Browse the repository at this point in the history
  4. [SPARK-10240] [SPARK-10242] [MLLIB] update since versions in mlilb.ra…

    …ndom and mllib.stat
    
    The same as #8241 but for `mllib.stat` and `mllib.random`.
    
    cc feynmanliang
    
    Author: Xiangrui Meng <meng@databricks.com>
    
    Closes #8439 from mengxr/SPARK-10242.
    mengxr committed Aug 26, 2015
    Configuration menu
    Copy the full SHA
    c3a5484 View commit details
    Browse the repository at this point in the history
  5. [SPARK-10234] [MLLIB] update since version in mllib.clustering

    Same as #8421 but for `mllib.clustering`.
    
    cc feynmanliang yu-iskw
    
    Author: Xiangrui Meng <meng@databricks.com>
    
    Closes #8435 from mengxr/SPARK-10234.
    mengxr committed Aug 26, 2015
    Configuration menu
    Copy the full SHA
    d703372 View commit details
    Browse the repository at this point in the history
  6. [SPARK-10243] [MLLIB] update since versions in mllib.tree

    Same as #8421 but for `mllib.tree`.
    
    cc jkbradley
    
    Author: Xiangrui Meng <meng@databricks.com>
    
    Closes #8442 from mengxr/SPARK-10236.
    mengxr committed Aug 26, 2015
    Configuration menu
    Copy the full SHA
    fb7e12f View commit details
    Browse the repository at this point in the history
  7. [SPARK-10235] [MLLIB] update since versions in mllib.regression

    Same as #8421 but for `mllib.regression`.
    
    cc freeman-lab dbtsai
    
    Author: Xiangrui Meng <meng@databricks.com>
    
    Closes #8426 from mengxr/SPARK-10235 and squashes the following commits:
    
    6cd28e4 [Xiangrui Meng] update since versions in mllib.regression
    mengxr authored and DB Tsai committed Aug 26, 2015
    Configuration menu
    Copy the full SHA
    4657fa1 View commit details
    Browse the repository at this point in the history
  8. [SPARK-10236] [MLLIB] update since versions in mllib.feature

    Same as #8421 but for `mllib.feature`.
    
    cc dbtsai
    
    Author: Xiangrui Meng <meng@databricks.com>
    
    Closes #8449 from mengxr/SPARK-10236.feature and squashes the following commits:
    
    0e8d658 [Xiangrui Meng] remove unnecessary comment
    ad70b03 [Xiangrui Meng] update since versions in mllib.feature
    mengxr authored and DB Tsai committed Aug 26, 2015
    Configuration menu
    Copy the full SHA
    321d775 View commit details
    Browse the repository at this point in the history
  9. [SPARK-9316] [SPARKR] Add support for filtering using [ (synonym fo…

    …r filter / select)
    
    Add support for
    ```
       df[df$name == "Smith", c(1,2)]
       df[df$age %in% c(19, 30), 1:2]
    ```
    
    shivaram
    
    Author: felixcheung <felixcheung_m@hotmail.com>
    
    Closes #8394 from felixcheung/rsubset.
    felixcheung authored and shivaram committed Aug 26, 2015
    Configuration menu
    Copy the full SHA
    75d4773 View commit details
    Browse the repository at this point in the history
  10. Closes #8443

    rxin committed Aug 26, 2015
    Configuration menu
    Copy the full SHA
    bb16405 View commit details
    Browse the repository at this point in the history
  11. [SPARK-9665] [MLLIB] audit MLlib API annotations

    I only found `ml.NaiveBayes` missing `Experimental` annotation. This PR doesn't cover Python APIs.
    
    cc jkbradley
    
    Author: Xiangrui Meng <meng@databricks.com>
    
    Closes #8452 from mengxr/SPARK-9665.
    mengxr authored and jkbradley committed Aug 26, 2015
    Configuration menu
    Copy the full SHA
    6519fd0 View commit details
    Browse the repository at this point in the history
  12. HOTFIX: Increase PRB timeout

    pwendell committed Aug 26, 2015
    Configuration menu
    Copy the full SHA
    de7209c View commit details
    Browse the repository at this point in the history
  13. [SPARK-10241] [MLLIB] update since versions in mllib.recommendation

    Same as #8421 but for `mllib.recommendation`.
    
    cc srowen coderxiang
    
    Author: Xiangrui Meng <meng@databricks.com>
    
    Closes #8432 from mengxr/SPARK-10241.
    mengxr committed Aug 26, 2015
    Configuration menu
    Copy the full SHA
    086d468 View commit details
    Browse the repository at this point in the history
  14. [SPARK-10305] [SQL] fix create DataFrame from Python class

    cc jkbradley
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes #8470 from davies/fix_create_df.
    Davies Liu authored and davies committed Aug 26, 2015
    Configuration menu
    Copy the full SHA
    d41d6c4 View commit details
    Browse the repository at this point in the history

Commits on Aug 27, 2015

  1. [SPARK-10308] [SPARKR] Add %in% to the exported namespace

    I also checked all the other functions defined in column.R, functions.R and DataFrame.R and everything else looked fine.
    
    cc yu-iskw
    
    Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
    
    Closes #8473 from shivaram/in-namespace.
    shivaram committed Aug 27, 2015
    Configuration menu
    Copy the full SHA
    ad7f0f1 View commit details
    Browse the repository at this point in the history
  2. [MINOR] [SPARKR] Fix some validation problems in SparkR

    Getting rid of some validation problems in SparkR
    #7883
    
    cc shivaram
    
    ```
    inst/tests/test_Serde.R:26:1: style: Trailing whitespace is superfluous.
    
    ^~
    inst/tests/test_Serde.R:34:1: style: Trailing whitespace is superfluous.
    
    ^~
    inst/tests/test_Serde.R:37:38: style: Trailing whitespace is superfluous.
      expect_equal(class(x), "character")
                                         ^~
    inst/tests/test_Serde.R:50:1: style: Trailing whitespace is superfluous.
    
    ^~
    inst/tests/test_Serde.R:55:1: style: Trailing whitespace is superfluous.
    
    ^~
    inst/tests/test_Serde.R:60:1: style: Trailing whitespace is superfluous.
    
    ^~
    inst/tests/test_sparkSQL.R:611:1: style: Trailing whitespace is superfluous.
    
    ^~
    R/DataFrame.R:664:1: style: Trailing whitespace is superfluous.
    
    ^~~~~~~~~~~~~~
    R/DataFrame.R:670:55: style: Trailing whitespace is superfluous.
                    df <- data.frame(row.names = 1 : nrow)
                                                          ^~~~~~~~~~~~~~~~
    R/DataFrame.R:672:1: style: Trailing whitespace is superfluous.
    
    ^~~~~~~~~~~~~~
    R/DataFrame.R:686:49: style: Trailing whitespace is superfluous.
                        df[[names[colIndex]]] <- vec
                                                    ^~~~~~~~~~~~~~~~~~
    ```
    
    Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
    
    Closes #8474 from yu-iskw/minor-fix-sparkr.
    yu-iskw authored and shivaram committed Aug 27, 2015
    Configuration menu
    Copy the full SHA
    773ca03 View commit details
    Browse the repository at this point in the history
  3. [SPARK-9424] [SQL] Parquet programming guide updates for 1.5

    Author: Cheng Lian <lian@databricks.com>
    
    Closes #8467 from liancheng/spark-9424/parquet-docs-for-1.5.
    liancheng authored and marmbrus committed Aug 27, 2015
    3 Configuration menu
    Copy the full SHA
    0fac144 View commit details
    Browse the repository at this point in the history
  4. [SPARK-9964] [PYSPARK] [SQL] PySpark DataFrameReader accept RDD of St…

    …ring for JSON
    
    PySpark DataFrameReader should could accept an RDD of Strings (like the Scala version does) for JSON, rather than only taking a path.
    If this PR is merged, it should be duplicated to cover the other input types (not just JSON).
    
    Author: Yanbo Liang <ybliang8@gmail.com>
    
    Closes #8444 from yanboliang/spark-9964.
    yanboliang authored and rxin committed Aug 27, 2015
    Configuration menu
    Copy the full SHA
    ce97834 View commit details
    Browse the repository at this point in the history
  5. [SPARK-10219] [SPARKR] Fix varargsToEnv and add test case

    cc sun-rui davies
    
    Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
    
    Closes #8475 from shivaram/varargs-fix.
    shivaram committed Aug 27, 2015
    Configuration menu
    Copy the full SHA
    e936cf8 View commit details
    Browse the repository at this point in the history
  6. [SPARK-10251] [CORE] some common types are not registered for Kryo Se…

    …rializat…
    
    …ion by default
    
    Author: Ram Sriharsha <rsriharsha@hw11853.local>
    
    Closes #8465 from harsha2010/SPARK-10251.
    Ram Sriharsha authored and rxin committed Aug 27, 2015
    Configuration menu
    Copy the full SHA
    de02782 View commit details
    Browse the repository at this point in the history
  7. [DOCS] [STREAMING] [KAFKA] Fix typo in exactly once semantics

    Fix Typo in exactly once semantics
    [Semantics of output operations] link
    
    Author: Moussa Taifi <moutai10@gmail.com>
    
    Closes #8468 from moutai/patch-3.
    moutai authored and srowen committed Aug 27, 2015
    Configuration menu
    Copy the full SHA
    9625d13 View commit details
    Browse the repository at this point in the history
  8. [SPARK-10254] [ML] Removes Guava dependencies in spark.ml.feature Jav…

    …aTests
    
    * Replaces `com.google.common` dependencies with `java.util.Arrays`
    * Small clean up in `JavaNormalizerSuite`
    
    Author: Feynman Liang <fliang@databricks.com>
    
    Closes #8445 from feynmanliang/SPARK-10254.
    Feynman Liang authored and srowen committed Aug 27, 2015
    Configuration menu
    Copy the full SHA
    1650f6f View commit details
    Browse the repository at this point in the history
  9. [SPARK-10255] [ML] Removes Guava dependencies from spark.ml.param Jav…

    …aTests
    
    Author: Feynman Liang <fliang@databricks.com>
    
    Closes #8446 from feynmanliang/SPARK-10255.
    Feynman Liang authored and srowen committed Aug 27, 2015
    Configuration menu
    Copy the full SHA
    75d6230 View commit details
    Browse the repository at this point in the history
  10. [SPARK-10256] [ML] Removes guava dependency from spark.ml.classificat…

    …ion JavaTests
    
    Author: Feynman Liang <fliang@databricks.com>
    
    Closes #8447 from feynmanliang/SPARK-10256.
    Feynman Liang authored and srowen committed Aug 27, 2015
    Configuration menu
    Copy the full SHA
    1a446f7 View commit details
    Browse the repository at this point in the history
  11. [SPARK-9613] [HOTFIX] Fix usage of JavaConverters removed in Scala 2.11

    Fix for [JavaConverters.asJavaListConverter](http://www.scala-lang.org/api/2.10.5/index.html#scala.collection.JavaConverters$) being removed in 2.11.7 and hence the build fails with the 2.11 profile enabled. Tested with the default 2.10 and 2.11 profiles. BUILD SUCCESS in both cases.
    
    Build for 2.10:
    
        ./build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.7.1 -DskipTests clean install
    
    and 2.11:
    
        ./dev/change-scala-version.sh 2.11
        ./build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.7.1 -Dscala-2.11 -DskipTests clean install
    
    Author: Jacek Laskowski <jacek@japila.pl>
    
    Closes #8479 from jaceklaskowski/SPARK-9613-hotfix.
    jaceklaskowski authored and srowen committed Aug 27, 2015
    Configuration menu
    Copy the full SHA
    b02e818 View commit details
    Browse the repository at this point in the history
  12. [SPARK-10257] [MLLIB] Removes Guava from all spark.mllib Java tests

    * Replaces instances of `Lists.newArrayList` with `Arrays.asList`
    * Replaces `commons.lang.StringUtils` over `com.google.collections.Strings`
    * Replaces `List` interface over `ArrayList` implementations
    
    This PR along with #8445 #8446 #8447 completely removes all `com.google.collections.Lists` dependencies within mllib's Java tests.
    
    Author: Feynman Liang <fliang@databricks.com>
    
    Closes #8451 from feynmanliang/SPARK-10257.
    Feynman Liang authored and srowen committed Aug 27, 2015
    Configuration menu
    Copy the full SHA
    e1f4de4 View commit details
    Browse the repository at this point in the history
  13. [SPARK-10182] [MLLIB] GeneralizedLinearModel doesn't unpersist cached…

    … data
    
    `GeneralizedLinearModel` creates a cached RDD when building a model. It's inconvenient, since these RDDs flood the memory when building several models in a row, so useful data might get evicted from the cache.
    
    The proposed solution is to always cache the dataset & remove the warning. There's a caveat though: input dataset gets evaluated twice, in line 270 when fitting `StandardScaler` for the first time, and when running optimizer for the second time. So, it might worth to return removed warning.
    
    Another possible solution is to disable caching entirely & return removed warning. I don't really know what approach is better.
    
    Author: Vyacheslav Baranov <slavik.baranov@gmail.com>
    
    Closes #8395 from SlavikBaranov/SPARK-10182.
    SlavikBaranov authored and srowen committed Aug 27, 2015
    Configuration menu
    Copy the full SHA
    fdd466b View commit details
    Browse the repository at this point in the history
  14. [SPARK-9148] [SPARK-10252] [SQL] Update SQL Programming Guide

    Author: Michael Armbrust <michael@databricks.com>
    
    Closes #8441 from marmbrus/documentation.
    marmbrus committed Aug 27, 2015
    Configuration menu
    Copy the full SHA
    dc86a22 View commit details
    Browse the repository at this point in the history
  15. [SPARK-10315] remove document on spark.akka.failure-detector.threshold

    https://issues.apache.org/jira/browse/SPARK-10315
    
    this parameter is not used any longer and there is some mistake in the current document , should be 'akka.remote.watch-failure-detector.threshold'
    
    Author: CodingCat <zhunansjtu@gmail.com>
    
    Closes #8483 from CodingCat/SPARK_10315.
    CodingCat authored and srowen committed Aug 27, 2015
    Configuration menu
    Copy the full SHA
    84baa5e View commit details
    Browse the repository at this point in the history
  16. [SPARK-9901] User guide for RowMatrix Tall-and-skinny QR

    jira: https://issues.apache.org/jira/browse/SPARK-9901
    
    The jira covers only the document update. I can further provide example code for QR (like the ones for SVD and PCA) in a separate PR.
    
    Author: Yuhao Yang <hhbyyh@gmail.com>
    
    Closes #8462 from hhbyyh/qrDoc.
    hhbyyh authored and mengxr committed Aug 27, 2015
    Configuration menu
    Copy the full SHA
    6185cdd View commit details
    Browse the repository at this point in the history
  17. [SPARK-9906] [ML] User guide for LogisticRegressionSummary

    User guide for LogisticRegression summaries
    
    Author: MechCoder <manojkumarsivaraj334@gmail.com>
    Author: Manoj Kumar <mks542@nyu.edu>
    Author: Feynman Liang <fliang@databricks.com>
    
    Closes #8197 from MechCoder/log_summary_user_guide.
    MechCoder authored and jkbradley committed Aug 27, 2015
    Configuration menu
    Copy the full SHA
    c94ecdf View commit details
    Browse the repository at this point in the history
  18. [SPARK-9680] [MLLIB] [DOC] StopWordsRemovers user guide and Java comp…

    …atibility test
    
    * Adds user guide for ml.feature.StopWordsRemovers, ran code examples on my machine
    * Cleans up scaladocs for public methods
    * Adds test for Java compatibility
    * Follow up Python user guide code example is tracked by SPARK-10249
    
    Author: Feynman Liang <fliang@databricks.com>
    
    Closes #8436 from feynmanliang/SPARK-10230.
    Feynman Liang authored and mengxr committed Aug 27, 2015
    Configuration menu
    Copy the full SHA
    5bfe9e1 View commit details
    Browse the repository at this point in the history
  19. [SPARK-10287] [SQL] Fixes JSONRelation refreshing on read path

    https://issues.apache.org/jira/browse/SPARK-10287
    
    After porting json to HadoopFsRelation, it seems hard to keep the behavior of picking up new files automatically for JSON. This PR removes this behavior, so JSON is consistent with others (ORC and Parquet).
    
    Author: Yin Huai <yhuai@databricks.com>
    
    Closes #8469 from yhuai/jsonRefresh.
    yhuai committed Aug 27, 2015
    Configuration menu
    Copy the full SHA
    b3dd569 View commit details
    Browse the repository at this point in the history
  20. [SPARK-10321] sizeInBytes in HadoopFsRelation

    Having sizeInBytes in HadoopFsRelation to enable broadcast join.
    
    cc marmbrus
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes #8490 from davies/sizeInByte.
    Davies Liu authored and marmbrus committed Aug 27, 2015
    Configuration menu
    Copy the full SHA
    54cda0d View commit details
    Browse the repository at this point in the history

Commits on Aug 28, 2015

  1. [SPARK-8505] [SPARKR] Add settings to kick lint-r from `./dev/run-t…

    …est.py`
    
    JoshRosen we'd like to check the SparkR source code with the `dev/lint-r` script on the Jenkins. I tried to incorporate the script into `dev/run-test.py`. Could you review it when you have time?
    
    shivaram I modified `dev/lint-r` and `dev/lint-r.R` to install lintr package into a local directory(`R/lib/`) and to exit with a lint status. Could you review it?
    
    - [[SPARK-8505] Add settings to kick `lint-r` from `./dev/run-test.py` - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-8505)
    
    Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
    
    Closes #7883 from yu-iskw/SPARK-8505.
    yu-iskw authored and shivaram committed Aug 28, 2015
    Configuration menu
    Copy the full SHA
    1f90c5e View commit details
    Browse the repository at this point in the history
  2. [SPARK-9911] [DOC] [ML] Update Userguide for Evaluator

    I added a small note about the different types of evaluator and the metrics used.
    
    Author: MechCoder <manojkumarsivaraj334@gmail.com>
    
    Closes #8304 from MechCoder/multiclass_evaluator.
    MechCoder authored and mengxr committed Aug 28, 2015
    Configuration menu
    Copy the full SHA
    30734d4 View commit details
    Browse the repository at this point in the history
  3. [SPARK-9905] [ML] [DOC] Adds LinearRegressionSummary user guide

    * Adds user guide for `LinearRegressionSummary`
    * Fixes unresolved issues in  #8197
    
    CC jkbradley mengxr
    
    Author: Feynman Liang <fliang@databricks.com>
    
    Closes #8491 from feynmanliang/SPARK-9905.
    Feynman Liang authored and mengxr committed Aug 28, 2015
    Configuration menu
    Copy the full SHA
    af0e124 View commit details
    Browse the repository at this point in the history
  4. [SPARK-SQL] [MINOR] Fixes some typos in HiveContext

    Author: Cheng Lian <lian@databricks.com>
    
    Closes #8481 from liancheng/hive-context-typo.
    liancheng authored and rxin committed Aug 28, 2015
    Configuration menu
    Copy the full SHA
    89b9434 View commit details
    Browse the repository at this point in the history
  5. [SPARK-10188] [PYSPARK] Pyspark CrossValidator with RMSE selects inco…

    …rrect model
    
    * Added isLargerBetter() method to Pyspark Evaluator to match the Scala version.
    * JavaEvaluator delegates isLargerBetter() to underlying Scala object.
    * Added check for isLargerBetter() in CrossValidator to determine whether to use argmin or argmax.
    * Added test cases for where smaller is better (RMSE) and larger is better (R-Squared).
    
    (This contribution is my original work and that I license the work to the project under Sparks' open source license)
    
    Author: noelsmith <mail@noelsmith.com>
    
    Closes #8399 from noel-smith/pyspark-rmse-xval-fix.
    noel-smith authored and jkbradley committed Aug 28, 2015
    Configuration menu
    Copy the full SHA
    7583681 View commit details
    Browse the repository at this point in the history
  6. [SPARK-10328] [SPARKR] Fix generic for na.omit

    S3 function is at https://stat.ethz.ch/R-manual/R-patched/library/stats/html/na.fail.html
    
    Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
    Author: Shivaram Venkataraman <shivaram.venkataraman@gmail.com>
    Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
    
    Closes #8495 from shivaram/na-omit-fix.
    shivaram committed Aug 28, 2015
    Configuration menu
    Copy the full SHA
    2f99c37 View commit details
    Browse the repository at this point in the history
  7. [SPARK-10260] [ML] Add @SInCE annotation to ml.clustering

    ### JIRA
    [[SPARK-10260] Add Since annotation to ml.clustering - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-10260)
    
    Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
    
    Closes #8455 from yu-iskw/SPARK-10260.
    yu-iskw authored and mengxr committed Aug 28, 2015
    Configuration menu
    Copy the full SHA
    4eeda8d View commit details
    Browse the repository at this point in the history
  8. [SPARK-10295] [CORE] Dynamic allocation in Mesos does not release whe…

    …n RDDs are cached
    
    Remove obsolete warning about dynamic allocation not working with cached RDDs
    
    See discussion in https://issues.apache.org/jira/browse/SPARK-10295
    
    Author: Sean Owen <sowen@cloudera.com>
    
    Closes #8489 from srowen/SPARK-10295.
    srowen committed Aug 28, 2015
    Configuration menu
    Copy the full SHA
    cc39803 View commit details
    Browse the repository at this point in the history
  9. Fix DynamodDB/DynamoDB typo in Kinesis Integration doc

    Fix DynamodDB/DynamoDB typo in Kinesis Integration doc
    
    Author: Keiji Yoshida <yoshida.keiji.84@gmail.com>
    
    Closes #8501 from yosssi/patch-1.
    yosssi authored and srowen committed Aug 28, 2015
    Configuration menu
    Copy the full SHA
    18294cd View commit details
    Browse the repository at this point in the history
  10. typo in comment

    Author: Dharmesh Kakadia <dharmeshkakadia@users.noreply.github.com>
    
    Closes #8497 from dharmeshkakadia/patch-2.
    dharmeshkakadia authored and srowen committed Aug 28, 2015
    Configuration menu
    Copy the full SHA
    71a077f View commit details
    Browse the repository at this point in the history
  11. [YARN] [MINOR] Avoid hard code port number in YarnShuffleService test

    Current port number is fixed as default (7337) in test, this will introduce port contention exception, better to change to a random number in unit test.
    
    squito , seems you're author of this unit test, mind taking a look at this fix? Thanks a lot.
    
    ```
    [info] - executor state kept across NM restart *** FAILED *** (597 milliseconds)
    [info]   org.apache.hadoop.service.ServiceStateException: java.net.BindException: Address already in use
    [info]   at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
    [info]   at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
    [info]   at org.apache.spark.network.yarn.YarnShuffleServiceSuite$$anonfun$1.apply$mcV$sp(YarnShuffleServiceSuite.scala:72)
    [info]   at org.apache.spark.network.yarn.YarnShuffleServiceSuite$$anonfun$1.apply(YarnShuffleServiceSuite.scala:70)
    [info]   at org.apache.spark.network.yarn.YarnShuffleServiceSuite$$anonfun$1.apply(YarnShuffleServiceSuite.scala:70)
    [info]   at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
    [info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
    [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
    [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
    [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
    [info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
    [info]   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:42)
    ...
    ```
    
    Author: jerryshao <sshao@hortonworks.com>
    
    Closes #8502 from jerryshao/avoid-hardcode-port.
    jerryshao authored and squito committed Aug 28, 2015
    Configuration menu
    Copy the full SHA
    1502a0f View commit details
    Browse the repository at this point in the history
  12. [SPARK-9890] [DOC] [ML] User guide for CountVectorizer

    jira: https://issues.apache.org/jira/browse/SPARK-9890
    
    document with Scala and java examples
    
    Author: Yuhao Yang <hhbyyh@gmail.com>
    
    Closes #8487 from hhbyyh/cvDoc.
    hhbyyh authored and mengxr committed Aug 28, 2015
    Configuration menu
    Copy the full SHA
    e2a8430 View commit details
    Browse the repository at this point in the history
  13. [SPARK-8952] [SPARKR] - Wrap normalizePath calls with suppressWarnings

    This is based on davies comment on SPARK-8952 which suggests to only call normalizePath() when path starts with '~'
    
    Author: Luciano Resende <lresende@apache.org>
    
    Closes #8343 from lresende/SPARK-8952.
    lresende authored and shivaram committed Aug 28, 2015
    Configuration menu
    Copy the full SHA
    499e8e1 View commit details
    Browse the repository at this point in the history
  14. [SPARK-10325] Override hashCode() for public Row

    This commit fixes an issue where the public SQL `Row` class did not override `hashCode`, causing it to violate the hashCode() + equals() contract. To fix this, I simply ported the `hashCode` implementation from the 1.4.x version of `Row`.
    
    Author: Josh Rosen <joshrosen@databricks.com>
    
    Closes #8500 from JoshRosen/SPARK-10325 and squashes the following commits:
    
    51ffea1 [Josh Rosen] Override hashCode() for public Row.
    JoshRosen authored and marmbrus committed Aug 28, 2015
    Configuration menu
    Copy the full SHA
    d3f87dc View commit details
    Browse the repository at this point in the history
  15. [SPARK-9284] [TESTS] Allow all tests to run without an assembly.

    This change aims at speeding up the dev cycle a little bit, by making
    sure that all tests behave the same w.r.t. where the code to be tested
    is loaded from. Namely, that means that tests don't rely on the assembly
    anymore, rather loading all needed classes from the build directories.
    
    The main change is to make sure all build directories (classes and test-classes)
    are added to the classpath of child processes when running tests.
    
    YarnClusterSuite required some custom code since the executors are run
    differently (i.e. not through the launcher library, like standalone and
    Mesos do).
    
    I also found a couple of tests that could leak a SparkContext on failure,
    and added code to handle those.
    
    With this patch, it's possible to run the following command from a clean
    source directory and have all tests pass:
    
      mvn -Pyarn -Phadoop-2.4 -Phive-thriftserver install
    
    Author: Marcelo Vanzin <vanzin@cloudera.com>
    
    Closes #7629 from vanzin/SPARK-9284.
    Marcelo Vanzin committed Aug 28, 2015
    Configuration menu
    Copy the full SHA
    c53c902 View commit details
    Browse the repository at this point in the history
  16. [SPARK-10336][example] fix not being able to set intercept in LR example

    `fitIntercept` is a command line option but not set in the main program.
    
    dbtsai
    
    Author: Shuo Xiang <sxiang@pinterest.com>
    
    Closes #8510 from coderxiang/intercept and squashes the following commits:
    
    57c9b7d [Shuo Xiang] fix not being able to set intercept in LR example
    Shuo Xiang authored and DB Tsai committed Aug 28, 2015
    Configuration menu
    Copy the full SHA
    4572321 View commit details
    Browse the repository at this point in the history
  17. [SPARK-9671] [MLLIB] re-org user guide and add migration guide

    This PR updates the MLlib user guide and adds migration guide for 1.4->1.5.
    
    * merge migration guide for `spark.mllib` and `spark.ml` packages
    * remove dependency section from `spark.ml` guide
    * move the paragraph about `spark.mllib` and `spark.ml` to the top and recommend `spark.ml`
    * move Sam's talk to footnote to make the section focus on dependencies
    
    Minor changes to code examples and other wording will be in a separate PR.
    
    jkbradley srowen feynmanliang
    
    Author: Xiangrui Meng <meng@databricks.com>
    
    Closes #8498 from mengxr/SPARK-9671.
    mengxr committed Aug 28, 2015
    Configuration menu
    Copy the full SHA
    88032ec View commit details
    Browse the repository at this point in the history
  18. [SPARK-10323] [SQL] fix nullability of In/InSet/ArrayContain

    After this PR, In/InSet/ArrayContain will return null if value is null, instead of false. They also will return null even if there is a null in the set/array.
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes #8492 from davies/fix_in.
    Davies Liu authored and davies committed Aug 28, 2015
    Configuration menu
    Copy the full SHA
    bb7f352 View commit details
    Browse the repository at this point in the history

Commits on Aug 29, 2015

  1. [SPARK-9803] [SPARKR] Add subset and transform + tests

    Add subset and transform
    Also reorganize `[` & `[[` to subset instead of select
    
    Note: for transform, transform is very similar to mutate. Spark doesn't seem to replace existing column with the name in mutate (ie. `mutate(df, age = df$age + 2)` - returned DataFrame has 2 columns with the same name 'age'), so therefore not doing that for now in transform.
    Though it is clearly stated it should replace column with matching name (should I open a JIRA for mutate/transform?)
    
    Author: felixcheung <felixcheung_m@hotmail.com>
    
    Closes #8503 from felixcheung/rsubset_transform.
    felixcheung authored and shivaram committed Aug 29, 2015
    Configuration menu
    Copy the full SHA
    2a4e00c View commit details
    Browse the repository at this point in the history
  2. [SPARK-9910] [ML] User guide for train validation split

    Author: martinzapletal <zapletal-martin@email.cz>
    
    Closes #8377 from zapletal-martin/SPARK-9910.
    zapletal-martin authored and mengxr committed Aug 29, 2015
    Configuration menu
    Copy the full SHA
    e8ea5ba View commit details
    Browse the repository at this point in the history
  3. [SPARK-10350] [DOC] [SQL] Removed duplicated option description from …

    …SQL guide
    
    Author: GuoQiang Li <witgo@qq.com>
    
    Closes #8520 from witgo/SPARK-10350.
    witgo authored and marmbrus committed Aug 29, 2015
    Configuration menu
    Copy the full SHA
    5369be8 View commit details
    Browse the repository at this point in the history
  4. [SPARK-10289] [SQL] A direct write API for testing Parquet

    This PR introduces a direct write API for testing Parquet. It's a DSL flavored version of the [`writeDirect` method] [1] comes with parquet-avro testing code. With this API, it's much easier to construct arbitrary Parquet structures. It's especially useful when adding regression tests for various compatibility corner cases.
    
    Sample usage of this API can be found in the new test case added in `ParquetThriftCompatibilitySuite`.
    
    [1]: https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.1/parquet-avro/src/test/java/org/apache/parquet/avro/TestArrayCompatibility.java#L945-L972
    
    Author: Cheng Lian <lian@databricks.com>
    
    Closes #8454 from liancheng/spark-10289/parquet-testing-direct-write-api.
    liancheng authored and marmbrus committed Aug 29, 2015
    Configuration menu
    Copy the full SHA
    24ffa85 View commit details
    Browse the repository at this point in the history
  5. [SPARK-10344] [SQL] Add tests for extraStrategies

    Actually using this API requires access to a lot of classes that we might make private by accident.  I've added some tests to prevent this.
    
    Author: Michael Armbrust <michael@databricks.com>
    
    Closes #8516 from marmbrus/extraStrategiesTests.
    marmbrus authored and yhuai committed Aug 29, 2015
    Configuration menu
    Copy the full SHA
    5c3d16a View commit details
    Browse the repository at this point in the history
  6. [SPARK-10226] [SQL] Fix exclamation mark issue in SparkSQL

    When I tested the latest version of spark with exclamation mark, I got some errors. Then I reseted the spark version and found that commit id "a2409d1c8e8ddec04b529ac6f6a12b5993f0eeda" brought the bug. With jline version changing from 0.9.94 to 2.12 after this commit, exclamation mark would be treated as a special character in ConsoleReader.
    
    Author: wangwei <wangwei82@huawei.com>
    
    Closes #8420 from small-wang/jline-SPARK-10226.
    small-wang authored and marmbrus committed Aug 29, 2015
    Configuration menu
    Copy the full SHA
    277148b View commit details
    Browse the repository at this point in the history
  7. [SPARK-10330] Use SparkHadoopUtil TaskAttemptContext reflection metho…

    …ds in more places
    
    SparkHadoopUtil contains methods that use reflection to work around TaskAttemptContext binary incompatibilities between Hadoop 1.x and 2.x. We should use these methods in more places.
    
    Author: Josh Rosen <joshrosen@databricks.com>
    
    Closes #8499 from JoshRosen/use-hadoop-reflection-in-more-places.
    JoshRosen authored and marmbrus committed Aug 29, 2015
    Configuration menu
    Copy the full SHA
    6a6f3c9 View commit details
    Browse the repository at this point in the history
  8. [SPARK-10339] [SPARK-10334] [SPARK-10301] [SQL] Partitioned table sca…

    …n can OOM driver and throw a better error message when users need to enable parquet schema merging
    
    This fixes the problem that scanning partitioned table causes driver have a high memory pressure and takes down the cluster. Also, with this fix, we will be able to correctly show the query plan of a query consuming partitioned tables.
    
    https://issues.apache.org/jira/browse/SPARK-10339
    https://issues.apache.org/jira/browse/SPARK-10334
    
    Finally, this PR squeeze in a "quick fix" for SPARK-10301. It is not a real fix, but it just throw a better error message to let user know what to do.
    
    Author: Yin Huai <yhuai@databricks.com>
    
    Closes #8515 from yhuai/partitionedTableScan.
    yhuai authored and marmbrus committed Aug 29, 2015
    2 Configuration menu
    Copy the full SHA
    097a7e3 View commit details
    Browse the repository at this point in the history

Commits on Aug 30, 2015

  1. [SPARK-9986] [SPARK-9991] [SPARK-9993] [SQL] Create a simple test fra…

    …mework for local operators
    
    This PR includes the following changes:
    - Add `LocalNodeTest` for local operator tests and add unit tests for FilterNode and ProjectNode.
    - Add `LimitNode` and `UnionNode` and their unit tests to show how to use `LocalNodeTest`. (SPARK-9991, SPARK-9993)
    
    Author: zsxwing <zsxwing@gmail.com>
    
    Closes #8464 from zsxwing/local-execution.
    zsxwing authored and rxin committed Aug 30, 2015
    Configuration menu
    Copy the full SHA
    13f5f8e View commit details
    Browse the repository at this point in the history
  2. [SPARK-10348] [MLLIB] updates ml-guide

    * replace `ML Dataset` by `DataFrame` to unify the abstraction
    * ML algorithms -> pipeline components to describe the main concept
    * remove Scala API doc links from the main guide
    * `Section Title` -> `Section tile` to be consistent with other section titles in MLlib guide
    * modified lines break at 100 chars or periods
    
    jkbradley feynmanliang
    
    Author: Xiangrui Meng <meng@databricks.com>
    
    Closes #8517 from mengxr/SPARK-10348.
    mengxr committed Aug 30, 2015
    1 Configuration menu
    Copy the full SHA
    905fbe4 View commit details
    Browse the repository at this point in the history
  3. [SPARK-10331] [MLLIB] Update example code in ml-guide

    * The example code was added in 1.2, before `createDataFrame`. This PR switches to `createDataFrame`. Java code still uses JavaBean.
    * assume `sqlContext` is available
    * fix some minor issues from previous code review
    
    jkbradley srowen feynmanliang
    
    Author: Xiangrui Meng <meng@databricks.com>
    
    Closes #8518 from mengxr/SPARK-10331.
    mengxr committed Aug 30, 2015
    Configuration menu
    Copy the full SHA
    ca69fc8 View commit details
    Browse the repository at this point in the history
  4. [SPARK-10184] [CORE] Optimization for bounds determination in RangePa…

    …rtitioner
    
    JIRA Issue: https://issues.apache.org/jira/browse/SPARK-10184
    
    Change `cumWeight > target` to `cumWeight >= target` in `RangePartitioner.determineBounds` method to make the output partitions more balanced.
    
    Author: ihainan <ihainan72@gmail.com>
    
    Closes #8397 from ihainan/opt_for_rangepartitioner.
    ihainan authored and srowen committed Aug 30, 2015
    Configuration menu
    Copy the full SHA
    1bfd934 View commit details
    Browse the repository at this point in the history
  5. [SPARK-10353] [MLLIB] BLAS gemm not scaling when beta = 0.0 for some …

    …subset of matrix multiplications
    
    mengxr jkbradley rxin
    
    It would be great if this fix made it into RC3!
    
    Author: Burak Yavuz <brkyvz@gmail.com>
    
    Closes #8525 from brkyvz/blas-scaling.
    brkyvz authored and mengxr committed Aug 30, 2015
    1 Configuration menu
    Copy the full SHA
    8d2ab75 View commit details
    Browse the repository at this point in the history

Commits on Aug 31, 2015

  1. SPARK-9545, SPARK-9547: Use Maven in PRB if title contains "[test-mav…

    …en]"
    
    This is just some small glue code to actually make use of the
    AMPLAB_JENKINS_BUILD_TOOL switch. As far as I can tell, we actually
    don't currently use the Maven support in the tool even though it exists.
    This patch switches to Maven when the PR title contains "test-maven".
    
    There are a few small other pieces of cleanup in the patch as well.
    
    Author: Patrick Wendell <patrick@databricks.com>
    
    Closes #7878 from pwendell/maven-tests.
    pwendell committed Aug 31, 2015
    Configuration menu
    Copy the full SHA
    35e896a View commit details
    Browse the repository at this point in the history
  2. [SPARK-10351] [SQL] Fixes UTF8String.fromAddress to handle off-heap m…

    …emory
    
    CC rxin marmbrus
    
    Author: Feynman Liang <fliang@databricks.com>
    
    Closes #8523 from feynmanliang/SPARK-10351.
    Feynman Liang authored and rxin committed Aug 31, 2015
    Configuration menu
    Copy the full SHA
    8694c3a View commit details
    Browse the repository at this point in the history
  3. [SPARK-100354] [MLLIB] fix some apparent memory issues in k-means|| i…

    …nitializaiton
    
    * do not cache first cost RDD
    * change following cost RDD cache level to MEMORY_AND_DISK
    * remove Vector wrapper to save a object per instance
    
    Further improvements will be addressed in SPARK-10329
    
    cc: yu-iskw HuJiayin
    
    Author: Xiangrui Meng <meng@databricks.com>
    
    Closes #8526 from mengxr/SPARK-10354.
    mengxr committed Aug 31, 2015
    Configuration menu
    Copy the full SHA
    f0f563a View commit details
    Browse the repository at this point in the history
  4. [SPARK-8730] Fixes - Deser objects containing a primitive class attri…

    …bute
    
    Author: EugenCepoi <cepoi.eugen@gmail.com>
    
    Closes #7122 from EugenCepoi/master.
    EugenCepoi authored and squito committed Aug 31, 2015
    Configuration menu
    Copy the full SHA
    72f6dbf View commit details
    Browse the repository at this point in the history
  5. [SPARK-10369] [STREAMING] Don't remove ReceiverTrackingInfo when dere…

    …gisterReceivering since we may reuse it later
    
    `deregisterReceiver` should not remove `ReceiverTrackingInfo`. Otherwise, it will throw `java.util.NoSuchElementException: key not found` when restarting it.
    
    Author: zsxwing <zsxwing@gmail.com>
    
    Closes #8538 from zsxwing/SPARK-10369.
    zsxwing authored and tdas committed Aug 31, 2015
    1 Configuration menu
    Copy the full SHA
    4a5fe09 View commit details
    Browse the repository at this point in the history
  6. [SPARK-10170] [SQL] Add DB2 JDBC dialect support.

    Data frame write to DB2 database is failing because by default JDBC data source implementation is generating a table schema with DB2 unsupported data types TEXT for String, and BIT1(1) for Boolean.
    
    This patch registers DB2 JDBC Dialect that maps String, Boolean to valid DB2 data types.
    
    Author: sureshthalamati <suresh.thalamati@gmail.com>
    
    Closes #8393 from sureshthalamati/db2_dialect_spark-10170.
    sureshthalamati authored and rxin committed Aug 31, 2015
    Configuration menu
    Copy the full SHA
    a2d5c72 View commit details
    Browse the repository at this point in the history
  7. [SPARK-9954] [MLLIB] use first 128 nonzeros to compute Vector.hashCode

    This could help reduce hash collisions, e.g., in `RDD[Vector].repartition`. jkbradley
    
    Author: Xiangrui Meng <meng@databricks.com>
    
    Closes #8182 from mengxr/SPARK-9954.
    mengxr committed Aug 31, 2015
    Configuration menu
    Copy the full SHA
    23e39cc View commit details
    Browse the repository at this point in the history
  8. [SPARK-8472] [ML] [PySpark] Python API for DCT

    Add Python API for ml.feature.DCT.
    
    Author: Yanbo Liang <ybliang8@gmail.com>
    
    Closes #8485 from yanboliang/spark-8472.
    yanboliang authored and mengxr committed Aug 31, 2015
    Configuration menu
    Copy the full SHA
    5b3245d View commit details
    Browse the repository at this point in the history
  9. [SPARK-10341] [SQL] fix memory starving in unsafe SMJ

    In SMJ, the first ExternalSorter could consume all the memory before spilling, then the second can not even acquire the first page.
    
    Before we have a better memory allocator, SMJ should call prepare() before call any compute() of it's children.
    
    cc rxin JoshRosen
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes #8511 from davies/smj_memory.
    Davies Liu authored and rxin committed Aug 31, 2015
    Configuration menu
    Copy the full SHA
    540bdee View commit details
    Browse the repository at this point in the history
  10. [SPARK-10349] [ML] OneVsRest use 'when ... otherwise' not UDF to gene…

    …rate new label at binary reduction
    
    Currently OneVsRest use UDF to generate new binary label during training.
    Considering that [SPARK-7321](https://issues.apache.org/jira/browse/SPARK-7321) has been merged, we can use ```when ... otherwise``` which will be more efficiency.
    
    Author: Yanbo Liang <ybliang8@gmail.com>
    
    Closes #8519 from yanboliang/spark-10349.
    yanboliang authored and mengxr committed Aug 31, 2015
    Configuration menu
    Copy the full SHA
    fe16fd0 View commit details
    Browse the repository at this point in the history
  11. [SPARK-10355] [ML] [PySpark] Add Python API for SQLTransformer

    Add Python API for SQLTransformer
    
    Author: Yanbo Liang <ybliang8@gmail.com>
    
    Closes #8527 from yanboliang/spark-10355.
    yanboliang authored and mengxr committed Aug 31, 2015
    Configuration menu
    Copy the full SHA
    52ea399 View commit details
    Browse the repository at this point in the history

Commits on Sep 1, 2015

  1. [SPARK-10378][SQL][Test] Remove HashJoinCompatibilitySuite.

    They don't bring much value since we now have better unit test coverage for hash joins. This will also help reduce the test time.
    
    Author: Reynold Xin <rxin@databricks.com>
    
    Closes #8542 from rxin/SPARK-10378.
    rxin committed Sep 1, 2015
    Configuration menu
    Copy the full SHA
    d65656c View commit details
    Browse the repository at this point in the history
  2. [SPARK-10301] [SQL] Fixes schema merging for nested structs

    This PR can be quite challenging to review.  I'm trying to give a detailed description of the problem as well as its solution here.
    
    When reading Parquet files, we need to specify a potentially nested Parquet schema (of type `MessageType`) as requested schema for column pruning.  This Parquet schema is translated from a Catalyst schema (of type `StructType`), which is generated by the query planner and represents all requested columns.  However, this translation can be fairly complicated because of several reasons:
    
    1.  Requested schema must conform to the real schema of the physical file to be read.
    
        This means we have to tailor the actual file schema of every individual physical Parquet file to be read according to the given Catalyst schema.  Fortunately we are already doing this in Spark 1.5 by pushing request schema conversion to executor side in PR #7231.
    
    1.  Support for schema merging.
    
        A single Parquet dataset may consist of multiple physical Parquet files come with different but compatible schemas.  This means we may request for a column path that doesn't exist in a physical Parquet file.  All requested column paths can be nested.  For example, for a Parquet file schema
    
        ```
        message root {
          required group f0 {
            required group f00 {
              required int32 f000;
              required binary f001 (UTF8);
            }
          }
        }
        ```
    
        we may request for column paths defined in the following schema:
    
        ```
        message root {
          required group f0 {
            required group f00 {
              required binary f001 (UTF8);
              required float f002;
            }
          }
    
          optional double f1;
        }
        ```
    
        Notice that we pruned column path `f0.f00.f000`, but added `f0.f00.f002` and `f1`.
    
        The good news is that Parquet handles non-existing column paths properly and always returns null for them.
    
    1.  The map from `StructType` to `MessageType` is a one-to-many map.
    
        This is the most unfortunate part.
    
        Due to historical reasons (dark histories!), schemas of Parquet files generated by different libraries have different "flavors".  For example, to handle a schema with a single non-nullable column, whose type is an array of non-nullable integers, parquet-protobuf generates the following Parquet schema:
    
        ```
        message m0 {
          repeated int32 f;
        }
        ```
    
        while parquet-avro generates another version:
    
        ```
        message m1 {
          required group f (LIST) {
            repeated int32 array;
          }
        }
        ```
    
        and parquet-thrift spills this:
    
        ```
        message m1 {
          required group f (LIST) {
            repeated int32 f_tuple;
          }
        }
        ```
    
        All of them can be mapped to the following _unique_ Catalyst schema:
    
        ```
        StructType(
          StructField(
            "f",
            ArrayType(IntegerType, containsNull = false),
            nullable = false))
        ```
    
        This greatly complicates Parquet requested schema construction, since the path of a given column varies in different cases.  To read the array elements from files with the above schemas, we must use `f` for `m0`, `f.array` for `m1`, and `f.f_tuple` for `m2`.
    
    In earlier Spark versions, we didn't try to fix this issue properly.  Spark 1.4 and prior versions simply translate the Catalyst schema in a way more or less compatible with parquet-hive and parquet-avro, but is broken in many other cases.  Earlier revisions of Spark 1.5 only try to tailor the Parquet file schema at the first level, and ignore nested ones.  This caused [SPARK-10301] [spark-10301] as well as [SPARK-10005] [spark-10005].  In PR #8228, I tried to avoid the hard part of the problem and made a minimum change in `CatalystRowConverter` to fix SPARK-10005.  However, when taking SPARK-10301 into consideration, keeping hacking `CatalystRowConverter` doesn't seem to be a good idea.  So this PR is an attempt to fix the problem in a proper way.
    
    For a given physical Parquet file with schema `ps` and a compatible Catalyst requested schema `cs`, we use the following algorithm to tailor `ps` to get the result Parquet requested schema `ps'`:
    
    For a leaf column path `c` in `cs`:
    
    - if `c` exists in `cs` and a corresponding Parquet column path `c'` can be found in `ps`, `c'` should be included in `ps'`;
    - otherwise, we convert `c` to a Parquet column path `c"` using `CatalystSchemaConverter`, and include `c"` in `ps'`;
    - no other column paths should exist in `ps'`.
    
    Then comes the most tedious part:
    
    > Given `cs`, `ps`, and `c`, how to locate `c'` in `ps`?
    
    Unfortunately, there's no quick answer, and we have to enumerate all possible structures defined in parquet-format spec.  They are:
    
    1.  the standard structure of nested types, and
    1.  cases defined in all backwards-compatibility rules for `LIST` and `MAP`.
    
    The core part of this PR is `CatalystReadSupport.clipParquetType()`, which tailors a given Parquet file schema according to a requested schema in its Catalyst form.  Backwards-compatibility rules of `LIST` and `MAP` are covered in `clipParquetListType()` and `clipParquetMapType()` respectively.  The column path selection algorithm is implemented in `clipParquetGroupFields()`.
    
    With this PR, we no longer need to do schema tailoring in `CatalystReadSupport` and `CatalystRowConverter`.  Another benefit is that, now we can also read Parquet datasets consist of files with different physical Parquet schema but share the same logical schema, for example, files generated by different Parquet libraries.  This situation is illustrated by [this test case] [test-case].
    
    [spark-10301]: https://issues.apache.org/jira/browse/SPARK-10301
    [spark-10005]: https://issues.apache.org/jira/browse/SPARK-10005
    [test-case]: liancheng@38644d8#diff-a9b98e28ce3ae30641829dffd1173be2R26
    
    Author: Cheng Lian <lian@databricks.com>
    
    Closes #8509 from liancheng/spark-10301/fix-parquet-requested-schema.
    liancheng committed Sep 1, 2015
    Configuration menu
    Copy the full SHA
    391e6be View commit details
    Browse the repository at this point in the history
  3. [SPARK-9679] [ML] [PYSPARK] Add Python API for Stop Words Remover

    Add a python API for the Stop Words Remover.
    
    Author: Holden Karau <holden@pigscanfly.ca>
    
    Closes #8118 from holdenk/SPARK-9679-python-StopWordsRemover.
    holdenk authored and mengxr committed Sep 1, 2015
    Configuration menu
    Copy the full SHA
    e6e483c View commit details
    Browse the repository at this point in the history
  4. [SPARK-10398] [DOCS] Migrate Spark download page to use new lua mirro…

    …ring scripts
    
    Migrate Apache download closer.cgi refs to new closer.lua
    
    This is the bit of the change that affects the project docs; I'm implementing the changes to the Apache site separately.
    
    Author: Sean Owen <sowen@cloudera.com>
    
    Closes #8557 from srowen/SPARK-10398.
    srowen committed Sep 1, 2015
    Configuration menu
    Copy the full SHA
    3f63bd6 View commit details
    Browse the repository at this point in the history
  5. [SPARK-4223] [CORE] Support * in acls.

    SPARK-4223.
    
    Currently we support setting view and modify acls but you have to specify a list of users. It would be nice to support * meaning all users have access.
    
    Manual tests to verify that: "*" works for any user in:
    a. Spark ui: view and kill stage.     Done.
    b. Spark history server.                  Done.
    c. Yarn application killing.  Done.
    
    Author: zhuol <zhuol@yahoo-inc.com>
    
    Closes #8398 from zhuoliu/4223.
    zhuol authored and rxin committed Sep 1, 2015
    3 Configuration menu
    Copy the full SHA
    ec01280 View commit details
    Browse the repository at this point in the history
  6. [SPARK-10162] [SQL] Fix the timezone omitting for PySpark Dataframe f…

    …ilter function
    
    This PR addresses [SPARK-10162](https://issues.apache.org/jira/browse/SPARK-10162)
    The issue is with DataFrame filter() function, if datetime.datetime is passed to it:
    * Timezone information of this datetime is ignored
    * This datetime is assumed to be in local timezone, which depends on the OS timezone setting
    
    Fix includes both code change and regression test. Problem reproduction code on master:
    ```python
    import pytz
    from datetime import datetime
    from pyspark.sql import *
    from pyspark.sql.types import *
    sqc = SQLContext(sc)
    df = sqc.createDataFrame([], StructType([StructField("dt", TimestampType())]))
    
    m1 = pytz.timezone('UTC')
    m2 = pytz.timezone('Etc/GMT+3')
    
    df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain()
    df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain()
    ```
    It gives the same timestamp ignoring time zone:
    ```
    >>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain()
    Filter (dt#0 > 946713600000000)
     Scan PhysicalRDD[dt#0]
    
    >>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain()
    Filter (dt#0 > 946713600000000)
     Scan PhysicalRDD[dt#0]
    ```
    After the fix:
    ```
    >>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain()
    Filter (dt#0 > 946684800000000)
     Scan PhysicalRDD[dt#0]
    
    >>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain()
    Filter (dt#0 > 946695600000000)
     Scan PhysicalRDD[dt#0]
    ```
    PR [8536](#8536) was occasionally closed by me dropping the repo
    
    Author: 0x0FFF <programmerag@gmail.com>
    
    Closes #8555 from 0x0FFF/SPARK-10162.
    0x0FFF authored and davies committed Sep 1, 2015
    Configuration menu
    Copy the full SHA
    bf550a4 View commit details
    Browse the repository at this point in the history
  7. [SPARK-10392] [SQL] Pyspark - Wrong DateType support on JDBC connection

    This PR addresses issue [SPARK-10392](https://issues.apache.org/jira/browse/SPARK-10392)
    The problem is that for "start of epoch" date (01 Jan 1970) PySpark class DateType returns 0 instead of the `datetime.date` due to implementation of its return statement
    
    Issue reproduction on master:
    ```
    >>> from pyspark.sql.types import *
    >>> a = DateType()
    >>> a.fromInternal(0)
    0
    >>> a.fromInternal(1)
    datetime.date(1970, 1, 2)
    ```
    
    Author: 0x0FFF <programmerag@gmail.com>
    
    Closes #8556 from 0x0FFF/SPARK-10392.
    0x0FFF authored and davies committed Sep 1, 2015
    Configuration menu
    Copy the full SHA
    00d9af5 View commit details
    Browse the repository at this point in the history

Commits on Sep 2, 2015

  1. [SPARK-7336] [HISTORYSERVER] Fix bug that applications status incorre…

    …ct on JobHistory UI.
    
    Author: ArcherShao <shaochuan@huawei.com>
    
    Closes #5886 from ArcherShao/SPARK-7336.
    ArcherShao authored and Marcelo Vanzin committed Sep 2, 2015
    Configuration menu
    Copy the full SHA
    c3b881a View commit details
    Browse the repository at this point in the history
  2. [SPARK-10034] [SQL] add regression test for Sort on Aggregate

    Before #8371, there was a bug for `Sort` on `Aggregate` that we can't use aggregate expressions named `_aggOrdering` and can't use more than one ordering expressions which contains aggregate functions. The reason of this bug is that: The aggregate expression in `SortOrder` never get resolved, we alias it with `_aggOrdering` and call `toAttribute` which gives us an `UnresolvedAttribute`. So actually we are referencing aggregate expression by name, not by exprId like we thought. And if there is already an aggregate expression named `_aggOrdering` or there are more than one ordering expressions having aggregate functions, we will have conflict names and can't search by name.
    
    However, after #8371 got merged, the `SortOrder`s are guaranteed to be resolved and we are always referencing aggregate expression by exprId. The Bug doesn't exist anymore and this PR add regression tests for it.
    
    Author: Wenchen Fan <cloud0fan@outlook.com>
    
    Closes #8231 from cloud-fan/sort-agg.
    cloud-fan authored and marmbrus committed Sep 2, 2015
    Configuration menu
    Copy the full SHA
    56c4c17 View commit details
    Browse the repository at this point in the history
  3. [SPARK-10389] [SQL] support order by non-attribute grouping expressio…

    …n on Aggregate
    
    For example, we can write `SELECT MAX(value) FROM src GROUP BY key + 1 ORDER BY key + 1` in PostgreSQL, and we should support this in Spark SQL.
    
    Author: Wenchen Fan <cloud0fan@outlook.com>
    
    Closes #8548 from cloud-fan/support-order-by-non-attribute.
    cloud-fan authored and marmbrus committed Sep 2, 2015
    Configuration menu
    Copy the full SHA
    fc48307 View commit details
    Browse the repository at this point in the history
  4. [SPARK-10004] [SHUFFLE] Perform auth checks when clients read shuffle…

    … data.
    
    To correctly isolate applications, when requests to read shuffle data
    arrive at the shuffle service, proper authorization checks need to
    be performed. This change makes sure that only the application that
    created the shuffle data can read from it.
    
    Such checks are only enabled when "spark.authenticate" is enabled,
    otherwise there's no secure way to make sure that the client is really
    who it says it is.
    
    Author: Marcelo Vanzin <vanzin@cloudera.com>
    
    Closes #8218 from vanzin/SPARK-10004.
    Marcelo Vanzin committed Sep 2, 2015
    Configuration menu
    Copy the full SHA
    2da3a9e View commit details
    Browse the repository at this point in the history
  5. [SPARK-10417] [SQL] Iterating through Column results in infinite loop

    `pyspark.sql.column.Column` object has `__getitem__` method, which makes it iterable for Python. In fact it has `__getitem__` to address the case when the column might be a list or dict, for you to be able to access certain element of it in DF API. The ability to iterate over it is just a side effect that might cause confusion for the people getting familiar with Spark DF (as you might iterate this way on Pandas DF for instance)
    
    Issue reproduction:
    ```
    df = sqlContext.jsonRDD(sc.parallelize(['{"name": "El Magnifico"}']))
    for i in df["name"]: print i
    ```
    
    Author: 0x0FFF <programmerag@gmail.com>
    
    Closes #8574 from 0x0FFF/SPARK-10417.
    0x0FFF authored and davies committed Sep 2, 2015
    Configuration menu
    Copy the full SHA
    6cd98c1 View commit details
    Browse the repository at this point in the history

Commits on Sep 3, 2015

  1. [SPARK-10422] [SQL] String column in InMemoryColumnarCache needs to o…

    …verride clone method
    
    https://issues.apache.org/jira/browse/SPARK-10422
    
    Author: Yin Huai <yhuai@databricks.com>
    
    Closes #8578 from yhuai/SPARK-10422.
    yhuai authored and davies committed Sep 3, 2015
    Configuration menu
    Copy the full SHA
    03f3e91 View commit details
    Browse the repository at this point in the history
  2. [SPARK-9723] [ML] params getordefault should throw more useful error

    Params.getOrDefault should throw a more meaningful exception than what you get from a bad key lookup.
    
    Author: Holden Karau <holden@pigscanfly.ca>
    
    Closes #8567 from holdenk/SPARK-9723-params-getordefault-should-throw-more-useful-error.
    holdenk authored and mengxr committed Sep 3, 2015
    Configuration menu
    Copy the full SHA
    44948a2 View commit details
    Browse the repository at this point in the history
  3. [SPARK-5945] Spark should not retry a stage infinitely on a FetchFail…

    …edException
    
    The ```Stage``` class now tracks whether there were a sufficient number of consecutive failures of that stage to trigger an abort.
    
    To avoid an infinite loop of stage retries, we abort the job completely after 4 consecutive stage failures for one stage. We still allow more than 4 consecutive stage failures if there is an intervening successful attempt for the stage, so that in very long-lived applications, where a stage may get reused many times, we don't abort the job after failures that have been recovered from successfully.
    
    I've added test cases to exercise the most obvious scenarios.
    
    Author: Ilya Ganelin <ilya.ganelin@capitalone.com>
    
    Closes #5636 from ilganeli/SPARK-5945.
    Ilya Ganelin authored and Andrew Or committed Sep 3, 2015
    Configuration menu
    Copy the full SHA
    4bd85d0 View commit details
    Browse the repository at this point in the history
  4. [SPARK-8707] RDD#toDebugString fails if any cached RDD has invalid pa…

    …rtitions
    
    Added numPartitions(evaluate: Boolean) to RDD. With "evaluate=true" the method is same with "partitions.length". With "evaluate=false", it checks checked-out or already evaluated partitions in the RDD to get number of partition. If it's not those cases, returns -1. RDDInfo.partitionNum calls numPartition only when it's accessed.
    
    Author: navis.ryu <navis@apache.org>
    
    Closes #7127 from navis/SPARK-8707.
    navis authored and Andrew Or committed Sep 3, 2015
    Configuration menu
    Copy the full SHA
    0985d2c View commit details
    Browse the repository at this point in the history
  5. Removed code duplication in ShuffleBlockFetcherIterator

    Added fetchUpToMaxBytes() to prevent having to update both code blocks when a change is made.
    
    Author: Evan Racah <ejracah@gmail.com>
    
    Closes #8514 from eracah/master.
    eracah authored and Andrew Or committed Sep 3, 2015
    Configuration menu
    Copy the full SHA
    f6c447f View commit details
    Browse the repository at this point in the history
  6. [SPARK-10247] [CORE] improve readability of a test case in DAGSchedul…

    …erSuite
    
    This is pretty minor, just trying to improve the readability of `DAGSchedulerSuite`, I figure every bit helps.  Before whenever I read this test, I never knew what "should work" and "should be ignored" really meant -- this adds some asserts & updates comments to make it more clear.  Also some reformatting per a suggestion from markhamstra on #7699
    
    Author: Imran Rashid <irashid@cloudera.com>
    
    Closes #8434 from squito/SPARK-10247.
    squito authored and Andrew Or committed Sep 3, 2015
    Configuration menu
    Copy the full SHA
    3ddb9b3 View commit details
    Browse the repository at this point in the history
  7. [SPARK-10379] preserve first page in UnsafeShuffleExternalSorter

    Author: Davies Liu <davies@databricks.com>
    
    Closes #8543 from davies/preserve_page.
    Davies Liu authored and Andrew Or committed Sep 3, 2015
    Configuration menu
    Copy the full SHA
    62b4690 View commit details
    Browse the repository at this point in the history
  8. [SPARK-10411] [SQL] Move visualization above explain output and hide …

    …explain by default
    
    New screenshots after this fix:
    
    <img width="627" alt="s1" src="https://cloud.githubusercontent.com/assets/1000778/9625782/4b2dba36-518b-11e5-9104-c713ff026e3d.png">
    
    Default:
    <img width="462" alt="s2" src="https://cloud.githubusercontent.com/assets/1000778/9625817/92366e50-518b-11e5-9981-cdfb774d66b8.png">
    
    After clicking `+details`:
    <img width="377" alt="s3" src="https://cloud.githubusercontent.com/assets/1000778/9625784/4ba24342-518b-11e5-8522-846a16a95d44.png">
    
    Author: zsxwing <zsxwing@gmail.com>
    
    Closes #8570 from zsxwing/SPARK-10411.
    zsxwing authored and Andrew Or committed Sep 3, 2015
    Configuration menu
    Copy the full SHA
    0349b5b View commit details
    Browse the repository at this point in the history
  9. [SPARK-10332] [CORE] Fix yarn spark executor validation

    From Jira:
    Running spark-submit with yarn with number-executors equal to 0 when not using dynamic allocation should error out.
    In spark 1.5.0 it continues and ends up hanging.
    yarn.ClientArguments still has the check so something else must have changed.
    spark-submit --master yarn --deploy-mode cluster --class org.apache.spark.examples.SparkPi --num-executors 0 ....
    spark 1.4.1 errors with:
    java.lang.IllegalArgumentException:
    Number of executors was 0, but must be at least 1
    (or 0 if dynamic executor allocation is enabled).
    
    Author: Holden Karau <holden@pigscanfly.ca>
    
    Closes #8580 from holdenk/SPARK-10332-spark-submit-to-yarn-executors-0-message.
    holdenk authored and srowen committed Sep 3, 2015
    Configuration menu
    Copy the full SHA
    67580f1 View commit details
    Browse the repository at this point in the history
  10. [SPARK-9596] [SQL] treat hadoop classes as shared one in IsolatedClie…

    …ntLoader
    
    https://issues.apache.org/jira/browse/SPARK-9596
    
    Author: WangTaoTheTonic <wangtao111@huawei.com>
    
    Closes #7931 from WangTaoTheTonic/SPARK-9596.
    WangTaoTheTonic authored and marmbrus committed Sep 3, 2015
    Configuration menu
    Copy the full SHA
    3abc0d5 View commit details
    Browse the repository at this point in the history
  11. [SPARK-8951] [SPARKR] support Unicode characters in collect()

    Spark gives an error message and does not show the output when a field of the result DataFrame contains characters in CJK.
    I changed SerDe.scala in order that Spark support Unicode characters when writes a string to R.
    
    Author: CHOIJAEHONG <redrock07@naver.com>
    
    Closes #7494 from CHOIJAEHONG1/SPARK-8951.
    CHOIJAEHONG authored and shivaram committed Sep 3, 2015
    Configuration menu
    Copy the full SHA
    af0e312 View commit details
    Browse the repository at this point in the history
  12. [SPARK-10432] spark.port.maxRetries documentation is unclear

    Author: Tom Graves <tgraves@yahoo-inc.com>
    
    Closes #8585 from tgravescs/SPARK-10432.
    Tom Graves authored and Andrew Or committed Sep 3, 2015
    Configuration menu
    Copy the full SHA
    49aff7b View commit details
    Browse the repository at this point in the history
  13. [SPARK-10431] [CORE] Fix intermittent test failure. Wait for event qu…

    …eue to be clear
    
    Author: robbins <robbins@uk.ibm.com>
    
    Closes #8582 from robbinspg/InputOutputMetricsSuite.
    robbins authored and Andrew Or committed Sep 3, 2015
    Configuration menu
    Copy the full SHA
    d911c68 View commit details
    Browse the repository at this point in the history
  14. [SPARK-9869] [STREAMING] Wait for all event notifications before asse…

    …rting results
    
    Author: robbins <robbins@uk.ibm.com>
    
    Closes #8589 from robbinspg/InputStreamSuite-fix.
    robbins authored and Andrew Or committed Sep 3, 2015
    Configuration menu
    Copy the full SHA
    754f853 View commit details
    Browse the repository at this point in the history
  15. [SPARK-9672] [MESOS] Don’t include SPARK_ENV_LOADED when passing env …

    …vars
    
    This contribution is my original work and I license the work to the project under the project's open source license.
    
    Author: Pat Shields <yeoldefortran@gmail.com>
    
    Closes #7979 from pashields/env-loading-on-driver.
    pashields authored and Andrew Or committed Sep 3, 2015
    Configuration menu
    Copy the full SHA
    e62f4a4 View commit details
    Browse the repository at this point in the history
  16. [SPARK-10430] [CORE] Added hashCode methods in AccumulableInfo and RD…

    …DOperationScope
    
    Author: Vinod K C <vinod.kc@huawei.com>
    
    Closes #8581 from vinodkc/fix_RDDOperationScope_Hashcode.
    Vinod K C authored and Andrew Or committed Sep 3, 2015
    Configuration menu
    Copy the full SHA
    11ef32c View commit details
    Browse the repository at this point in the history
  17. [SPARK-9591] [CORE] Job may fail for exception during getting remote …

    …block
    
    [SPARK-9591](https://issues.apache.org/jira/browse/SPARK-9591)
    When we getting the broadcast variable, we can fetch the block form several location,but now when connecting the lost blockmanager(idle for enough time removed by driver when using dynamic resource allocate and so on) will cause task fail,and the worse case will cause the job fail.
    
    Author: jeanlyn <jeanlyn92@gmail.com>
    
    Closes #7927 from jeanlyn/catch_exception.
    jeanlyn authored and Andrew Or committed Sep 3, 2015
    Configuration menu
    Copy the full SHA
    db4c130 View commit details
    Browse the repository at this point in the history
  18. [SPARK-10435] Spark submit should fail fast for Mesos cluster mode wi…

    …th R
    
    It's not supported yet so we should error with a clear message.
    
    Author: Andrew Or <andrew@databricks.com>
    
    Closes #8590 from andrewor14/mesos-cluster-r-guard.
    Andrew Or committed Sep 3, 2015
    Configuration menu
    Copy the full SHA
    08b0750 View commit details
    Browse the repository at this point in the history
  19. [SPARK-10421] [BUILD] Exclude curator artifacts from tachyon dependen…

    …cies.
    
    This avoids them being mistakenly pulled instead of the newer ones that
    Spark actually uses. Spark only depends on these artifacts transitively,
    so sometimes maven just decides to pick tachyon's version of the
    dependency for whatever reason.
    
    Author: Marcelo Vanzin <vanzin@cloudera.com>
    
    Closes #8577 from vanzin/SPARK-10421.
    Marcelo Vanzin committed Sep 3, 2015
    Configuration menu
    Copy the full SHA
    208fbca View commit details
    Browse the repository at this point in the history

Commits on Sep 4, 2015

  1. [SPARK-10003] Improve readability of DAGScheduler

    Note: this is not intended to be in Spark 1.5!
    
    This patch rewrites some code in the `DAGScheduler` to make it more readable. In particular
    - there were blocks of code that are unnecessary and removed for simplicity
    - there were abstractions that are unnecessary and made the code hard to navigate
    - other minor changes
    
    Author: Andrew Or <andrew@databricks.com>
    
    Closes #8217 from andrewor14/dag-scheduler-readability and squashes the following commits:
    
    57abca3 [Andrew Or] Move comment back into if case
    574fb1e [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-scheduler-readability
    64a9ed2 [Andrew Or] Remove unnecessary code + minor code rewrites
    Andrew Or authored and kayousterhout committed Sep 4, 2015
    Configuration menu
    Copy the full SHA
    cf42138 View commit details
    Browse the repository at this point in the history
  2. [MINOR] Minor style fix in SparkR

    `dev/lintr-r` passes on my machine now
    
    Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
    
    Closes #8601 from shivaram/sparkr-style-fix.
    shivaram committed Sep 4, 2015
    Configuration menu
    Copy the full SHA
    143e521 View commit details
    Browse the repository at this point in the history
  3. MAINTENANCE: Automated closing of pull requests.

    This commit exists to close the following pull requests on Github:
    
    Closes #1890 (requested by andrewor14, JoshRosen)
    Closes #3558 (requested by JoshRosen, marmbrus)
    Closes #3890 (requested by marmbrus)
    Closes #3895 (requested by andrewor14, marmbrus)
    Closes #4055 (requested by andrewor14)
    Closes #4105 (requested by andrewor14)
    Closes #4812 (requested by marmbrus)
    Closes #5109 (requested by andrewor14)
    Closes #5178 (requested by andrewor14)
    Closes #5298 (requested by marmbrus)
    Closes #5393 (requested by marmbrus)
    Closes #5449 (requested by andrewor14)
    Closes #5468 (requested by marmbrus)
    Closes #5715 (requested by marmbrus)
    Closes #6192 (requested by marmbrus)
    Closes #6319 (requested by marmbrus)
    Closes #6326 (requested by marmbrus)
    Closes #6349 (requested by marmbrus)
    Closes #6380 (requested by andrewor14)
    Closes #6554 (requested by marmbrus)
    Closes #6696 (requested by marmbrus)
    Closes #6868 (requested by marmbrus)
    Closes #6951 (requested by marmbrus)
    Closes #7129 (requested by marmbrus)
    Closes #7188 (requested by marmbrus)
    Closes #7358 (requested by marmbrus)
    Closes #7379 (requested by marmbrus)
    Closes #7628 (requested by marmbrus)
    Closes #7715 (requested by marmbrus)
    Closes #7782 (requested by marmbrus)
    Closes #7914 (requested by andrewor14)
    Closes #8051 (requested by andrewor14)
    Closes #8269 (requested by andrewor14)
    Closes #8448 (requested by andrewor14)
    Closes #8576 (requested by andrewor14)
    marmbrus committed Sep 4, 2015
    Configuration menu
    Copy the full SHA
    804a012 View commit details
    Browse the repository at this point in the history
  4. [SPARK-10176] [SQL] Show partially analyzed plans when checkAnswer fa…

    …ils to analyze
    
    This PR takes over #8389.
    
    This PR improves `checkAnswer` to print the partially analyzed plan in addition to the user friendly error message, in order to aid debugging failing tests.
    
    In doing so, I ran into a conflict with the various ways that we bring a SQLContext into the tests. Depending on the trait we refer to the current context as `sqlContext`, `_sqlContext`, `ctx` or `hiveContext` with access modifiers `public`, `protected` and `private` depending on the defining class.
    
    I propose we refactor as follows:
    
    1. All tests should only refer to a `protected sqlContext` when testing general features, and `protected hiveContext` when it is a method that only exists on a `HiveContext`.
    2. All tests should only import `testImplicits._` (i.e., don't import `TestHive.implicits._`)
    
    Author: Wenchen Fan <cloud0fan@outlook.com>
    
    Closes #8584 from cloud-fan/cleanupTests.
    cloud-fan authored and Andrew Or committed Sep 4, 2015
    Configuration menu
    Copy the full SHA
    c3c0e43 View commit details
    Browse the repository at this point in the history
  5. [SPARK-10450] [SQL] Minor improvements to readability / style / typos…

    … etc.
    
    Author: Andrew Or <andrew@databricks.com>
    
    Closes #8603 from andrewor14/minor-sql-changes.
    Andrew Or committed Sep 4, 2015
    Configuration menu
    Copy the full SHA
    3339e6f View commit details
    Browse the repository at this point in the history
  6. [SPARK-9669] [MESOS] Support PySpark on Mesos cluster mode.

    Support running pyspark with cluster mode on Mesos!
    This doesn't upload any scripts, so if running in a remote Mesos requires the user to specify the script from a available URI.
    
    Author: Timothy Chen <tnachen@gmail.com>
    
    Closes #8349 from tnachen/mesos_python.
    tnachen authored and Andrew Or committed Sep 4, 2015
    Configuration menu
    Copy the full SHA
    b087d23 View commit details
    Browse the repository at this point in the history
  7. [SPARK-10454] [SPARK CORE] wait for empty event queue

    Author: robbins <robbins@uk.ibm.com>
    
    Closes #8605 from robbinspg/DAGSchedulerSuite-fix.
    robbins authored and Andrew Or committed Sep 4, 2015
    Configuration menu
    Copy the full SHA
    2e1c175 View commit details
    Browse the repository at this point in the history
  8. [SPARK-10311] [STREAMING] Reload appId and attemptId when app starts …

    …with checkpoint file in cluster mode
    
    Author: xutingjun <xutingjun@huawei.com>
    
    Closes #8477 from XuTingjun/streaming-attempt.
    XuTingjun authored and tdas committed Sep 4, 2015
    Configuration menu
    Copy the full SHA
    eafe372 View commit details
    Browse the repository at this point in the history

Commits on Sep 5, 2015

  1. [SPARK-10402] [DOCS] [ML] Add defaults to the scaladoc for params in ml/

    We should make sure the scaladoc for params includes their default values through the models in ml/
    
    Author: Holden Karau <holden@pigscanfly.ca>
    
    Closes #8591 from holdenk/SPARK-10402-add-scaladoc-for-default-values-of-params-in-ml.
    holdenk authored and jkbradley committed Sep 5, 2015
    Configuration menu
    Copy the full SHA
    22eab70 View commit details
    Browse the repository at this point in the history
  2. [SPARK-9925] [SQL] [TESTS] Set SQLConf.SHUFFLE_PARTITIONS.key correct…

    …ly for tests
    
    This PR fix the failed test and conflict for #8155
    
    https://issues.apache.org/jira/browse/SPARK-9925
    
    Closes #8155
    
    Author: Yin Huai <yhuai@databricks.com>
    Author: Davies Liu <davies@databricks.com>
    
    Closes #8602 from davies/shuffle_partitions.
    yhuai authored and Andrew Or committed Sep 5, 2015
    Configuration menu
    Copy the full SHA
    47058ca View commit details
    Browse the repository at this point in the history
  3. [HOTFIX] [SQL] Fixes compilation error

    Jenkins master builders are currently broken by a merge conflict between PR #8584 and PR #8155.
    
    Author: Cheng Lian <lian@databricks.com>
    
    Closes #8614 from liancheng/hotfix/fix-pr-8155-8584-conflict.
    liancheng authored and rxin committed Sep 5, 2015
    Configuration menu
    Copy the full SHA
    6c75194 View commit details
    Browse the repository at this point in the history
  4. [SPARK-10440] [STREAMING] [DOCS] Update python API stuff in the progr…

    …amming guides and python docs
    
    - Fixed information around Python API tags in streaming programming guides
    - Added missing stuff in python docs
    
    Author: Tathagata Das <tathagata.das1565@gmail.com>
    
    Closes #8595 from tdas/SPARK-10440.
    tdas authored and rxin committed Sep 5, 2015
    1 Configuration menu
    Copy the full SHA
    7a4f326 View commit details
    Browse the repository at this point in the history
  5. [SPARK-10434] [SQL] Fixes Parquet schema of arrays that may contain null

    To keep full compatibility of Parquet write path with Spark 1.4, we should rename the innermost field name of arrays that may contain null from "array_element" to "array".
    
    Please refer to [SPARK-10434] [1] for more details.
    
    [1]: https://issues.apache.org/jira/browse/SPARK-10434
    
    Author: Cheng Lian <lian@databricks.com>
    
    Closes #8586 from liancheng/spark-10434/fix-parquet-array-type.
    liancheng committed Sep 5, 2015
    Configuration menu
    Copy the full SHA
    bca8c07 View commit details
    Browse the repository at this point in the history
  6. [SPARK-10013] [ML] [JAVA] [TEST] remove java assert from java unit tests

    From Jira: We should use assertTrue, etc. instead to make sure the asserts are not ignored in tests.
    
    Author: Holden Karau <holden@pigscanfly.ca>
    
    Closes #8607 from holdenk/SPARK-10013-remove-java-assert-from-java-unit-tests.
    holdenk authored and rxin committed Sep 5, 2015
    Configuration menu
    Copy the full SHA
    871764c View commit details
    Browse the repository at this point in the history

Commits on Sep 7, 2015

  1. [SPARK-9767] Remove ConnectionManager.

    We introduced the Netty network module for shuffle in Spark 1.2, and has turned it on by default for 3 releases. The old ConnectionManager is difficult to maintain. If we merge the patch now, by the time it is released, it would be 1 yr for which ConnectionManager is off by default. It's time to remove it.
    
    Author: Reynold Xin <rxin@databricks.com>
    
    Closes #8161 from rxin/SPARK-9767.
    rxin committed Sep 7, 2015
    Configuration menu
    Copy the full SHA
    5ffe752 View commit details
    Browse the repository at this point in the history

Commits on Sep 8, 2015

  1. [DOC] Added R to the list of languages with "high-level API" support …

    …in the…
    
    … main README.
    
    Author: Stephen Hopper <shopper@shopper-osx.local>
    
    Closes #8646 from enragedginger/master.
    Stephen Hopper authored and srowen committed Sep 8, 2015
    1 Configuration menu
    Copy the full SHA
    9d8e838 View commit details
    Browse the repository at this point in the history
  2. Docs small fixes

    Author: Jacek Laskowski <jacek@japila.pl>
    
    Closes #8629 from jaceklaskowski/docs-fixes.
    jaceklaskowski authored and srowen committed Sep 8, 2015
    Configuration menu
    Copy the full SHA
    6ceed85 View commit details
    Browse the repository at this point in the history
  3. [SPARK-9170] [SQL] Use OrcStructInspector to be case preserving when …

    …writing ORC files
    
    JIRA: https://issues.apache.org/jira/browse/SPARK-9170
    
    `StandardStructObjectInspector` will implicitly lowercase column names. But I think Orc format doesn't have such requirement. In fact, there is a `OrcStructInspector` specified for Orc format. We should use it when serialize rows to Orc file. It can be case preserving when writing ORC files.
    
    Author: Liang-Chi Hsieh <viirya@appier.com>
    
    Closes #7520 from viirya/use_orcstruct.
    viirya authored and liancheng committed Sep 8, 2015
    Configuration menu
    Copy the full SHA
    990c9f7 View commit details
    Browse the repository at this point in the history
  4. [SPARK-10480] [ML] Fix ML.LinearRegressionModel.copy()

    This PR fix two model ```copy()``` related issues:
    [SPARK-10480](https://issues.apache.org/jira/browse/SPARK-10480)
    ```ML.LinearRegressionModel.copy()``` ignored argument ```extra```, it will not take effect when users setting this parameter.
    [SPARK-10479](https://issues.apache.org/jira/browse/SPARK-10479)
    ```ML.LogisticRegressionModel.copy()``` should copy model summary if available.
    
    Author: Yanbo Liang <ybliang8@gmail.com>
    
    Closes #8641 from yanboliang/linear-regression-copy.
    yanboliang authored and mengxr committed Sep 8, 2015
    Configuration menu
    Copy the full SHA
    5b2192e View commit details
    Browse the repository at this point in the history
  5. [SPARK-10316] [SQL] respect nondeterministic expressions in PhysicalO…

    …peration
    
    We did a lot of special handling for non-deterministic expressions in `Optimizer`. However, `PhysicalOperation` just collects all Projects and Filters and mess it up. We should respect the operators order caused by non-deterministic expressions in `PhysicalOperation`.
    
    Author: Wenchen Fan <cloud0fan@outlook.com>
    
    Closes #8486 from cloud-fan/fix.
    cloud-fan authored and marmbrus committed Sep 8, 2015
    Configuration menu
    Copy the full SHA
    5fd5795 View commit details
    Browse the repository at this point in the history
  6. [SPARK-10470] [ML] ml.IsotonicRegressionModel.copy should set parent

    Copied model must have the same parent, but ml.IsotonicRegressionModel.copy did not set parent.
    Here fix it and add test case.
    
    Author: Yanbo Liang <ybliang8@gmail.com>
    
    Closes #8637 from yanboliang/spark-10470.
    yanboliang authored and mengxr committed Sep 8, 2015
    Configuration menu
    Copy the full SHA
    f7b55db View commit details
    Browse the repository at this point in the history
  7. [SPARK-10441] [SQL] Save data correctly to json.

    https://issues.apache.org/jira/browse/SPARK-10441
    
    Author: Yin Huai <yhuai@databricks.com>
    
    Closes #8597 from yhuai/timestampJson.
    yhuai authored and marmbrus committed Sep 8, 2015
    Configuration menu
    Copy the full SHA
    7a9dcbc View commit details
    Browse the repository at this point in the history
  8. [SPARK-10468] [ MLLIB ] Verify schema before Dataframe select API call

    Loader.checkSchema was called to verify the schema after dataframe.select(...).
    Schema verification should be done before dataframe.select(...)
    
    Author: Vinod K C <vinod.kc@huawei.com>
    
    Closes #8636 from vinodkc/fix_GaussianMixtureModel_load_verification.
    Vinod K C authored and mengxr committed Sep 8, 2015
    Configuration menu
    Copy the full SHA
    e6f8d36 View commit details
    Browse the repository at this point in the history
  9. [SPARK-10492] [STREAMING] [DOCUMENTATION] Update Streaming documentat…

    …ion about rate limiting and backpressure
    
    Author: Tathagata Das <tathagata.das1565@gmail.com>
    
    Closes #8656 from tdas/SPARK-10492 and squashes the following commits:
    
    986cdd6 [Tathagata Das] Added information on backpressure
    tdas committed Sep 8, 2015
    3 Configuration menu
    Copy the full SHA
    52b24a6 View commit details
    Browse the repository at this point in the history
  10. [SPARK-10327] [SQL] Cache Table is not working while subquery has ali…

    …as in its project list
    
    ```scala
        import org.apache.spark.sql.hive.execution.HiveTableScan
        sql("select key, value, key + 1 from src").registerTempTable("abc")
        cacheTable("abc")
    
        val sparkPlan = sql(
          """select a.key, b.key, c.key from
            |abc a join abc b on a.key=b.key
            |join abc c on a.key=c.key""".stripMargin).queryExecution.sparkPlan
    
        assert(sparkPlan.collect { case e: InMemoryColumnarTableScan => e }.size === 3) // failed
        assert(sparkPlan.collect { case e: HiveTableScan => e }.size === 0) // failed
    ```
    
    The actual plan is:
    
    ```
    == Parsed Logical Plan ==
    'Project [unresolvedalias('a.key),unresolvedalias('b.key),unresolvedalias('c.key)]
     'Join Inner, Some(('a.key = 'c.key))
      'Join Inner, Some(('a.key = 'b.key))
       'UnresolvedRelation [abc], Some(a)
       'UnresolvedRelation [abc], Some(b)
      'UnresolvedRelation [abc], Some(c)
    
    == Analyzed Logical Plan ==
    key: int, key: int, key: int
    Project [key#14,key#61,key#66]
     Join Inner, Some((key#14 = key#66))
      Join Inner, Some((key#14 = key#61))
       Subquery a
        Subquery abc
         Project [key#14,value#15,(key#14 + 1) AS _c2#16]
          MetastoreRelation default, src, None
       Subquery b
        Subquery abc
         Project [key#61,value#62,(key#61 + 1) AS _c2#58]
          MetastoreRelation default, src, None
      Subquery c
       Subquery abc
        Project [key#66,value#67,(key#66 + 1) AS _c2#63]
         MetastoreRelation default, src, None
    
    == Optimized Logical Plan ==
    Project [key#14,key#61,key#66]
     Join Inner, Some((key#14 = key#66))
      Project [key#14,key#61]
       Join Inner, Some((key#14 = key#61))
        Project [key#14]
         InMemoryRelation [key#14,value#15,_c2#16], true, 10000, StorageLevel(true, true, false, true, 1), (Project [key#14,value#15,(key#14 + 1) AS _c2#16]), Some(abc)
        Project [key#61]
         MetastoreRelation default, src, None
      Project [key#66]
       MetastoreRelation default, src, None
    
    == Physical Plan ==
    TungstenProject [key#14,key#61,key#66]
     BroadcastHashJoin [key#14], [key#66], BuildRight
      TungstenProject [key#14,key#61]
       BroadcastHashJoin [key#14], [key#61], BuildRight
        ConvertToUnsafe
         InMemoryColumnarTableScan [key#14], (InMemoryRelation [key#14,value#15,_c2#16], true, 10000, StorageLevel(true, true, false, true, 1), (Project [key#14,value#15,(key#14 + 1) AS _c2#16]), Some(abc))
        ConvertToUnsafe
         HiveTableScan [key#61], (MetastoreRelation default, src, None)
      ConvertToUnsafe
       HiveTableScan [key#66], (MetastoreRelation default, src, None)
    ```
    
    Author: Cheng Hao <hao.cheng@intel.com>
    
    Closes #8494 from chenghao-intel/weird_cache.
    chenghao-intel authored and marmbrus committed Sep 8, 2015
    Configuration menu
    Copy the full SHA
    d637a66 View commit details
    Browse the repository at this point in the history
  11. [HOTFIX] Fix build break caused by #8494

    Author: Michael Armbrust <michael@databricks.com>
    
    Closes #8659 from marmbrus/testBuildBreak.
    marmbrus committed Sep 8, 2015
    Configuration menu
    Copy the full SHA
    2143d59 View commit details
    Browse the repository at this point in the history

Commits on Sep 9, 2015

  1. [RELEASE] Add more contributors & only show names in release notes.

    Author: Reynold Xin <rxin@databricks.com>
    
    Closes #8660 from rxin/contrib.
    rxin committed Sep 9, 2015
    Configuration menu
    Copy the full SHA
    ae74c3f View commit details
    Browse the repository at this point in the history
  2. [SPARK-10071] [STREAMING] Output a warning when writing QueueInputDSt…

    …ream and throw a better exception when reading QueueInputDStream
    
    Output a warning when serializing QueueInputDStream rather than throwing an exception to allow unit tests use it. Moreover, this PR also throws an better exception when deserializing QueueInputDStream to make the user find out the problem easily. The previous exception is hard to understand: https://issues.apache.org/jira/browse/SPARK-8553
    
    Author: zsxwing <zsxwing@gmail.com>
    
    Closes #8624 from zsxwing/SPARK-10071 and squashes the following commits:
    
    847cfa8 [zsxwing] Output a warning when writing QueueInputDStream and throw a better exception when reading QueueInputDStream
    zsxwing authored and tdas committed Sep 9, 2015
    Configuration menu
    Copy the full SHA
    820913f View commit details
    Browse the repository at this point in the history
  3. [SPARK-9834] [MLLIB] implement weighted least squares via normal equa…

    …tion
    
    The goal of this PR is to have a weighted least squares implementation that takes the normal equation approach, and hence to be able to provide R-like summary statistics and support IRLS (used by GLMs). The tests match R's lm and glmnet.
    
    There are couple TODOs that can be addressed in future PRs:
    * consolidate summary statistics aggregators
    * move `dspr` to `BLAS`
    * etc
    
    It would be nice to have this merged first because it blocks couple other features.
    
    dbtsai
    
    Author: Xiangrui Meng <meng@databricks.com>
    
    Closes #8588 from mengxr/SPARK-9834.
    mengxr committed Sep 9, 2015
    Configuration menu
    Copy the full SHA
    52fe32f View commit details
    Browse the repository at this point in the history
  4. [SPARK-10464] [MLLIB] Add WeibullGenerator for RandomDataGenerator

    Add WeibullGenerator for RandomDataGenerator.
    #8611 need use WeibullGenerator to generate random data based on Weibull distribution.
    
    Author: Yanbo Liang <ybliang8@gmail.com>
    
    Closes #8622 from yanboliang/spark-10464.
    yanboliang authored and mengxr committed Sep 9, 2015
    Configuration menu
    Copy the full SHA
    a157348 View commit details
    Browse the repository at this point in the history
  5. [SPARK-10373] [PYSPARK] move @SInCE into pyspark from sql

    cc mengxr
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes #8657 from davies/move_since.
    Davies Liu authored and mengxr committed Sep 9, 2015
    Configuration menu
    Copy the full SHA
    3a11e50 View commit details
    Browse the repository at this point in the history
  6. [SPARK-10094] Pyspark ML Feature transformers marked as experimental

    Modified class-level docstrings to mark all feature transformers in pyspark.ml as experimental.
    
    Author: noelsmith <mail@noelsmith.com>
    
    Closes #8623 from noel-smith/SPARK-10094-mark-pyspark-ml-trans-exp.
    noel-smith authored and mengxr committed Sep 9, 2015
    Configuration menu
    Copy the full SHA
    0e2f216 View commit details
    Browse the repository at this point in the history
  7. [SPARK-9654] [ML] [PYSPARK] Add IndexToString to PySpark

    Adds IndexToString to PySpark.
    
    Author: Holden Karau <holden@pigscanfly.ca>
    
    Closes #7976 from holdenk/SPARK-9654-add-string-indexer-inverse-in-pyspark.
    holdenk authored and jkbradley committed Sep 9, 2015
    Configuration menu
    Copy the full SHA
    2f6fd52 View commit details
    Browse the repository at this point in the history
  8. [SPARK-10249] [ML] [DOC] Add Python Code Example to StopWordsRemover …

    …User Guide
    
    jira: https://issues.apache.org/jira/browse/SPARK-10249
    
    update user guide since python support added.
    
    Author: Yuhao Yang <hhbyyh@gmail.com>
    
    Closes #8620 from hhbyyh/swPyDocExample.
    hhbyyh authored and mengxr committed Sep 9, 2015
    Configuration menu
    Copy the full SHA
    91a577d View commit details
    Browse the repository at this point in the history
  9. [SPARK-10227] fatal warnings with sbt on Scala 2.11

    The bulk of the changes are on `transient` annotation on class parameter. Often the compiler doesn't generate a field for this parameters, so the the transient annotation would be unnecessary.
    But if the class parameter are used in methods, then fields are created. So it is safer to keep the annotations.
    
    The remainder are some potential bugs, and deprecated syntax.
    
    Author: Luc Bourlier <luc.bourlier@typesafe.com>
    
    Closes #8433 from skyluc/issue/sbt-2.11.
    Luc Bourlier authored and srowen committed Sep 9, 2015
    Configuration menu
    Copy the full SHA
    c1bc4f4 View commit details
    Browse the repository at this point in the history
  10. [SPARK-10117] [MLLIB] Implement SQL data source API for reading LIBSV…

    …M data
    
    It is convenient to implement data source API for LIBSVM format to have a better integration with DataFrames and ML pipeline API.
    
    Two option is implemented.
    * `numFeatures`: Specify the dimension of features vector
    * `featuresType`: Specify the type of output vector. `sparse` is default.
    
    Author: lewuathe <lewuathe@me.com>
    
    Closes #8537 from Lewuathe/SPARK-10117 and squashes the following commits:
    
    986999d [lewuathe] Change unit test phrase
    11d513f [lewuathe] Fix some reviews
    21600a4 [lewuathe] Merge branch 'master' into SPARK-10117
    9ce63c7 [lewuathe] Rewrite service loader file
    1fdd2df [lewuathe] Merge branch 'SPARK-10117' of github.com:Lewuathe/spark into SPARK-10117
    ba3657c [lewuathe] Merge branch 'master' into SPARK-10117
    0ea1c1c [lewuathe] LibSVMRelation is registered into META-INF
    4f40891 [lewuathe] Improve test suites
    5ab62ab [lewuathe] Merge branch 'master' into SPARK-10117
    8660d0e [lewuathe] Fix Java unit test
    b56a948 [lewuathe] Merge branch 'master' into SPARK-10117
    2c12894 [lewuathe] Remove unnecessary tag
    7d693c2 [lewuathe] Resolv conflict
    62010af [lewuathe] Merge branch 'master' into SPARK-10117
    a97ee97 [lewuathe] Fix some points
    aef9564 [lewuathe] Fix
    70ee4dd [lewuathe] Add Java test
    3fd8dce [lewuathe] [SPARK-10117] Implement SQL data source API for reading LIBSVM data
    40d3027 [lewuathe] Add Java test
    7056d4a [lewuathe] Merge branch 'master' into SPARK-10117
    99accaa [lewuathe] [SPARK-10117] Implement SQL data source API for reading LIBSVM data
    Lewuathe authored and mengxr committed Sep 9, 2015
    1 Configuration menu
    Copy the full SHA
    2ddeb63 View commit details
    Browse the repository at this point in the history
  11. [SPARK-10481] [YARN] SPARK_PREPEND_CLASSES make spark-yarn related ja…

    …r could n…
    
    Throw a more readable exception. Please help review. Thanks
    
    Author: Jeff Zhang <zjffdu@apache.org>
    
    Closes #8649 from zjffdu/SPARK-10481.
    zjffdu authored and Marcelo Vanzin committed Sep 9, 2015
    Configuration menu
    Copy the full SHA
    c0052d8 View commit details
    Browse the repository at this point in the history
  12. [SPARK-10461] [SQL] make sure input.primitive is always variable na…

    …me not code at `GenerateUnsafeProjection`
    
    When we generate unsafe code inside `createCodeForXXX`, we always assign the `input.primitive` to a temp variable in case `input.primitive` is expression code.
    
    This PR did some refactor to make sure `input.primitive` is always variable name, and some other typo and style fixes.
    
    Author: Wenchen Fan <cloud0fan@outlook.com>
    
    Closes #8613 from cloud-fan/minor.
    cloud-fan authored and davies committed Sep 9, 2015
    Configuration menu
    Copy the full SHA
    71da163 View commit details
    Browse the repository at this point in the history
  13. [SPARK-9730] [SQL] Add Full Outer Join support for SortMergeJoin

    This PR is based on #8383 , thanks to viirya
    
    JIRA: https://issues.apache.org/jira/browse/SPARK-9730
    
    This patch adds the Full Outer Join support for SortMergeJoin. A new class SortMergeFullJoinScanner is added to scan rows from left and right iterators. FullOuterIterator is simply a wrapper of type RowIterator to consume joined rows from SortMergeFullJoinScanner.
    
    Closes #8383
    
    Author: Liang-Chi Hsieh <viirya@appier.com>
    Author: Davies Liu <davies@databricks.com>
    
    Closes #8579 from davies/smj_fullouter.
    viirya authored and davies committed Sep 9, 2015
    Configuration menu
    Copy the full SHA
    45de518 View commit details
    Browse the repository at this point in the history

Commits on Sep 10, 2015

  1. [SPARK-9772] [PYSPARK] [ML] Add Python API for ml.feature.VectorSlicer

    Add Python API for ml.feature.VectorSlicer.
    
    Author: Yanbo Liang <ybliang8@gmail.com>
    
    Closes #8102 from yanboliang/SPARK-9772.
    yanboliang authored and jkbradley committed Sep 10, 2015
    Configuration menu
    Copy the full SHA
    56a0fe5 View commit details
    Browse the repository at this point in the history
  2. [MINOR] [MLLIB] [ML] [DOC] fixed typo: label for negative result shou…

    …ld be 0.0 (original: 1.0)
    
    Small typo in the example for `LabelledPoint` in the MLLib docs.
    
    Author: Sean Paradiso <seanparadiso@gmail.com>
    
    Closes #8680 from sparadiso/docs_mllib_smalltypo.
    sparadiso authored and mengxr committed Sep 10, 2015
    Configuration menu
    Copy the full SHA
    1dc7548 View commit details
    Browse the repository at this point in the history
  3. [SPARK-10497] [BUILD] [TRIVIAL] Handle both locations for JIRAError w…

    …ith python-jira
    
    Location of JIRAError has moved between old and new versions of python-jira package.
    Longer term it probably makes sense to pin to specific versions (as mentioned in https://issues.apache.org/jira/browse/SPARK-10498 ) but for now, making release tools works with both new and old versions of python-jira.
    
    Author: Holden Karau <holden@pigscanfly.ca>
    
    Closes #8661 from holdenk/SPARK-10497-release-utils-does-not-work-with-new-jira-python.
    holdenk authored and srowen committed Sep 10, 2015
    Configuration menu
    Copy the full SHA
    48817cc View commit details
    Browse the repository at this point in the history
  4. [SPARK-10065] [SQL] avoid the extra copy when generate unsafe array

    The reason for this extra copy is that we iterate the array twice: calculate elements data size and copy elements to array buffer.
    
    A simple solution is to follow `createCodeForStruct`, we can dynamically grow the buffer when needed and thus don't need to know the data size ahead.
    
    This PR also include some typo and style fixes, and did some minor refactor to make sure `input.primitive` is always variable name not code when generate unsafe code.
    
    Author: Wenchen Fan <cloud0fan@outlook.com>
    
    Closes #8496 from cloud-fan/avoid-copy.
    cloud-fan authored and davies committed Sep 10, 2015
    Configuration menu
    Copy the full SHA
    4f1daa1 View commit details
    Browse the repository at this point in the history
  5. [SPARK-7142] [SQL] Minor enhancement to BooleanSimplification Optimiz…

    …er rule
    
    Use these in the optimizer as well:
    
                A and (not(A) or B) => A and B
                not(A and B) => not(A) or not(B)
                not(A or B) => not(A) and not(B)
    
    Author: Yash Datta <Yash.Datta@guavus.com>
    
    Closes #5700 from saucam/bool_simp.
    Yash Datta authored and marmbrus committed Sep 10, 2015
    Configuration menu
    Copy the full SHA
    f892d92 View commit details
    Browse the repository at this point in the history
  6. [SPARK-10301] [SPARK-10428] [SQL] Addresses comments of PR #8583 and #…

    …8509 for master
    
    Author: Cheng Lian <lian@databricks.com>
    
    Closes #8670 from liancheng/spark-10301/address-pr-comments.
    liancheng authored and davies committed Sep 10, 2015
    Configuration menu
    Copy the full SHA
    49da38e View commit details
    Browse the repository at this point in the history
  7. [SPARK-10466] [SQL] UnsafeRow SerDe exception with data spill

    Data Spill with UnsafeRow causes assert failure.
    
    ```
    java.lang.AssertionError: assertion failed
    	at scala.Predef$.assert(Predef.scala:165)
    	at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeKey(UnsafeRowSerializer.scala:75)
    	at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:180)
    	at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:688)
    	at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:687)
    	at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
    	at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:687)
    	at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:683)
    	at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
    	at org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:683)
    	at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80)
    	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
    	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    	at org.apache.spark.scheduler.Task.run(Task.scala:88)
    	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    ```
    
    To reproduce that with code (thanks andrewor14):
    ```scala
    bin/spark-shell --master local
      --conf spark.shuffle.memoryFraction=0.005
      --conf spark.shuffle.sort.bypassMergeThreshold=0
    
    sc.parallelize(1 to 2 * 1000 * 1000, 10)
      .map { i => (i, i) }.toDF("a", "b").groupBy("b").avg().count()
    ```
    
    Author: Cheng Hao <hao.cheng@intel.com>
    
    Closes #8635 from chenghao-intel/unsafe_spill.
    chenghao-intel authored and Andrew Or committed Sep 10, 2015
    Configuration menu
    Copy the full SHA
    e048111 View commit details
    Browse the repository at this point in the history
  8. [SPARK-10469] [DOC] Try and document the three options

    From JIRA:
    Add documentation for tungsten-sort.
    From the mailing list "I saw a new "spark.shuffle.manager=tungsten-sort" implemented in
    https://issues.apache.org/jira/browse/SPARK-7081, but it can't be found its
    corresponding description in
    http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-docs/configuration.html(Currenlty
    there are only 'sort' and 'hash' two options)."
    
    Author: Holden Karau <holden@pigscanfly.ca>
    
    Closes #8638 from holdenk/SPARK-10469-document-tungsten-sort.
    holdenk authored and Andrew Or committed Sep 10, 2015
    1 Configuration menu
    Copy the full SHA
    a76bde9 View commit details
    Browse the repository at this point in the history
  9. [SPARK-8167] Make tasks that fail from YARN preemption not fail job

    The architecture is that, in YARN mode, if the driver detects that an executor has disconnected, it asks the ApplicationMaster why the executor died. If the ApplicationMaster is aware that the executor died because of preemption, all tasks associated with that executor are not marked as failed. The executor
    is still removed from the driver's list of available executors, however.
    
    There's a few open questions:
    1. Should standalone mode have a similar "get executor loss reason" as well? I localized this change as much as possible to affect only YARN, but there could be a valid case to differentiate executor losses in standalone mode as well.
    2. I make a pretty strong assumption in YarnAllocator that getExecutorLossReason(executorId) will only be called once per executor id; I do this so that I can remove the metadata from the in-memory map to avoid object accumulation. It's not clear if I'm being overly zealous to save space, however.
    
    cc vanzin specifically for review because it collided with some earlier YARN scheduling work.
    cc JoshRosen because it's similar to output commit coordination we did in the past
    cc andrewor14 for our discussion on how to get executor exit codes and loss reasons
    
    Author: mcheah <mcheah@palantir.com>
    
    Closes #8007 from mccheah/feature/preemption-handling.
    mccheah authored and Andrew Or committed Sep 10, 2015
    Configuration menu
    Copy the full SHA
    af3bc59 View commit details
    Browse the repository at this point in the history
  10. [SPARK-6350] [MESOS] Fine-grained mode scheduler respects mesosExecut…

    …or.cores
    
    This is a regression introduced in #4960, this commit fixes it and adds a test.
    
    tnachen andrewor14 please review, this should be an easy one.
    
    Author: Iulian Dragos <jaguarul@gmail.com>
    
    Closes #8653 from dragos/issue/mesos/fine-grained-maxExecutorCores.
    dragos authored and Andrew Or committed Sep 10, 2015
    Configuration menu
    Copy the full SHA
    f0562e8 View commit details
    Browse the repository at this point in the history
  11. [SPARK-10514] [MESOS] waiting for min no of total cores acquired by S…

    …park by implementing the sufficientResourcesRegistered method
    
    spark.scheduler.minRegisteredResourcesRatio configuration parameter works for YARN mode but not for Mesos Coarse grained mode.
    
    If the parameter specified default value of 0 will be set for spark.scheduler.minRegisteredResourcesRatio in base class and this method will always return true.
    
    There are no existing test for YARN mode too. Hence not added test for the same.
    
    Author: Akash Mishra <akash.mishra20@gmail.com>
    
    Closes #8672 from SleepyThread/master.
    SleepyThread authored and Andrew Or committed Sep 10, 2015
    Configuration menu
    Copy the full SHA
    a5ef2d0 View commit details
    Browse the repository at this point in the history
  12. [SPARK-9990] [SQL] Create local hash join operator

    This PR includes the following changes:
    - Add SQLConf to LocalNode
    - Add HashJoinNode
    - Add ConvertToUnsafeNode and ConvertToSafeNode.scala to test unsafe hash join.
    
    Author: zsxwing <zsxwing@gmail.com>
    
    Closes #8535 from zsxwing/SPARK-9990.
    zsxwing authored and Andrew Or committed Sep 10, 2015
    Configuration menu
    Copy the full SHA
    d88abb7 View commit details
    Browse the repository at this point in the history
  13. [SPARK-10049] [SPARKR] Support collecting data of ArraryType in DataF…

    …rame.
    
    this PR :
    1.  Enhance reflection in RBackend. Automatically matching a Java array to Scala Seq when finding methods. Util functions like seq(), listToSeq() in R side can be removed, as they will conflict with the Serde logic that transferrs a Scala seq to R side.
    
    2.  Enhance the SerDe to support transferring  a Scala seq to R side. Data of ArrayType in DataFrame
    after collection is observed to be of Scala Seq type.
    
    3.  Support ArrayType in createDataFrame().
    
    Author: Sun Rui <rui.sun@intel.com>
    
    Closes #8458 from sun-rui/SPARK-10049.
    Sun Rui authored and shivaram committed Sep 10, 2015
    Configuration menu
    Copy the full SHA
    45e3be5 View commit details
    Browse the repository at this point in the history
  14. [SPARK-10443] [SQL] Refactor SortMergeOuterJoin to reduce duplication

    `LeftOutputIterator` and `RightOutputIterator` are symmetrically identical and can share a lot of code. If someone makes a change in one but forgets to do the same thing in the other we'll end up with inconsistent behavior. This patch also adds inline comments to clarify the intention of the code.
    
    Author: Andrew Or <andrew@databricks.com>
    
    Closes #8596 from andrewor14/smoj-cleanup.
    Andrew Or authored and davies committed Sep 10, 2015
    Configuration menu
    Copy the full SHA
    3db7255 View commit details
    Browse the repository at this point in the history
  15. Add 1.5 to master branch EC2 scripts

    This change brings it to par with `branch-1.5` (and 1.5.0 release)
    
    Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
    
    Closes #8704 from shivaram/ec2-1.5-update.
    shivaram committed Sep 10, 2015
    Configuration menu
    Copy the full SHA
    4204757 View commit details
    Browse the repository at this point in the history
  16. [SPARK-7544] [SQL] [PySpark] pyspark.sql.types.Row implements __getit…

    …em__
    
    pyspark.sql.types.Row implements ```__getitem__```
    
    Author: Yanbo Liang <ybliang8@gmail.com>
    
    Closes #8333 from yanboliang/spark-7544.
    yanboliang authored and davies committed Sep 10, 2015
    Configuration menu
    Copy the full SHA
    89562a1 View commit details
    Browse the repository at this point in the history

Commits on Sep 11, 2015

  1. [SPARK-9043] Serialize key, value and combiner classes in ShuffleDepe…

    …ndency
    
    ShuffleManager implementations are currently not given type information for
    the key, value and combiner classes. Serialization of shuffle objects relies
    on objects being JavaSerializable, with methods defined for reading/writing
    the object or, alternatively, serialization via Kryo which uses reflection.
    
    Serialization systems like Avro, Thrift and Protobuf generate classes with
    zero argument constructors and explicit schema information
    (e.g. IndexedRecords in Avro have get, put and getSchema methods).
    
    By serializing the key, value and combiner class names in ShuffleDependency,
    shuffle implementations will have access to schema information when
    registerShuffle() is called.
    
    Author: Matt Massie <massie@cs.berkeley.edu>
    
    Closes #7403 from massie/shuffle-classtags.
    massie authored and rxin committed Sep 11, 2015
    Configuration menu
    Copy the full SHA
    0eabea8 View commit details
    Browse the repository at this point in the history
  2. [SPARK-10023] [ML] [PySpark] Unified DecisionTreeParams checkpointInt…

    …erval between Scala and Python API.
    
    "checkpointInterval" is member of DecisionTreeParams in Scala API which is inconsistency with Python API, we should unified them.
    ```
    member of DecisionTreeParams <-> Scala API
    shared param for all ML Transformer/Estimator <-> Python API
    ```
    Proposal:
    "checkpointInterval" is also used by ALS, so we make it shared params at Scala.
    
    Author: Yanbo Liang <ybliang8@gmail.com>
    
    Closes #8528 from yanboliang/spark-10023.
    yanboliang authored and mengxr committed Sep 11, 2015
    Configuration menu
    Copy the full SHA
    339a527 View commit details
    Browse the repository at this point in the history
  3. [SPARK-10027] [ML] [PySpark] Add Python API missing methods for ml.fe…

    …ature
    
    Missing method of ml.feature are listed here:
    ```StringIndexer``` lacks of parameter ```handleInvalid```.
    ```StringIndexerModel``` lacks of method ```labels```.
    ```VectorIndexerModel``` lacks of methods ```numFeatures``` and ```categoryMaps```.
    
    Author: Yanbo Liang <ybliang8@gmail.com>
    
    Closes #8313 from yanboliang/spark-10027.
    yanboliang authored and mengxr committed Sep 11, 2015
    Configuration menu
    Copy the full SHA
    a140dd7 View commit details
    Browse the repository at this point in the history
  4. [SPARK-10472] [SQL] Fixes DataType.typeName for UDT

    Before this fix, `MyDenseVectorUDT.typeName` gives `mydensevecto`, which is not desirable.
    
    Author: Cheng Lian <lian@databricks.com>
    
    Closes #8640 from liancheng/spark-10472/udt-type-name.
    liancheng committed Sep 11, 2015
    Configuration menu
    Copy the full SHA
    e1d7f64 View commit details
    Browse the repository at this point in the history
  5. [SPARK-10556] Remove explicit Scala version for sbt project build files

    Previously, project/plugins.sbt explicitly set scalaVersion to 2.10.4. This can cause issues when using a version of sbt that is compiled against a different version of Scala (for example sbt 0.13.9 uses 2.10.5). Removing this explicit setting will cause build files to be compiled and run against the same version of Scala that sbt is compiled against.
    
    Note that this only applies to the project build files (items in project/), it is distinct from the version of Scala we target for the actual spark compilation.
    
    Author: Ahir Reddy <ahirreddy@gmail.com>
    
    Closes #8709 from ahirreddy/sbt-scala-version-fix.
    ahirreddy authored and srowen committed Sep 11, 2015
    Configuration menu
    Copy the full SHA
    9bbe33f View commit details
    Browse the repository at this point in the history
  6. [SPARK-10518] [DOCS] Update code examples in spark.ml user guide to u…

    …se LIBSVM data source instead of MLUtils
    
    I fixed to use LIBSVM data source in the example code in spark.ml instead of MLUtils
    
    Author: y-shimizu <y.shimizu0429@gmail.com>
    
    Closes #8697 from y-shimizu/SPARK-10518.
    y-shimizu authored and mengxr committed Sep 11, 2015
    Configuration menu
    Copy the full SHA
    c268ca4 View commit details
    Browse the repository at this point in the history
  7. [SPARK-10026] [ML] [PySpark] Implement some common Params for regress…

    …ion in PySpark
    
    LinearRegression and LogisticRegression lack of some Params for Python, and some Params are not shared classes which lead we need to write them for each class. These kinds of Params are list here:
    ```scala
    HasElasticNetParam
    HasFitIntercept
    HasStandardization
    HasThresholds
    ```
    Here we implement them in shared params at Python side and make LinearRegression/LogisticRegression parameters peer with Scala one.
    
    Author: Yanbo Liang <ybliang8@gmail.com>
    
    Closes #8508 from yanboliang/spark-10026.
    yanboliang authored and mengxr committed Sep 11, 2015
    Configuration menu
    Copy the full SHA
    b656e61 View commit details
    Browse the repository at this point in the history
  8. [SPARK-9773] [ML] [PySpark] Add Python API for MultilayerPerceptronCl…

    …assifier
    
    Add Python API for ```MultilayerPerceptronClassifier```.
    
    Author: Yanbo Liang <ybliang8@gmail.com>
    
    Closes #8067 from yanboliang/SPARK-9773.
    yanboliang authored and mengxr committed Sep 11, 2015
    Configuration menu
    Copy the full SHA
    b01b262 View commit details
    Browse the repository at this point in the history
  9. [SPARK-10537] [ML] document LIBSVM source options in public API doc a…

    …nd some minor improvements
    
    We should document options in public API doc. Otherwise, it is hard to find out the options without looking at the code. I tried to make `DefaultSource` private and put the documentation to package doc. However, since then there exists no public class under `source.libsvm`, the Java package doc doesn't show up in the generated html file (http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4492654). So I put the doc to `DefaultSource` instead. There are several minor updates in this PR:
    
    1. Do `vectorType == "sparse"` only once.
    2. Update `hashCode` and `equals`.
    3. Remove inherited doc.
    4. Delete temp dir in `afterAll`.
    
    Lewuathe
    
    Author: Xiangrui Meng <meng@databricks.com>
    
    Closes #8699 from mengxr/SPARK-10537.
    mengxr committed Sep 11, 2015
    Configuration menu
    Copy the full SHA
    960d2d0 View commit details
    Browse the repository at this point in the history
  10. [MINOR] [MLLIB] [ML] [DOC] Minor doc fixes for StringIndexer and Meta…

    …dataUtils
    
    Changes:
    * Make Scala doc for StringIndexerInverse clearer.  Also remove Scala doc from transformSchema, so that the doc is inherited.
    * MetadataUtils.scala: “ Helper utilities for tree-based algorithms” —> not just trees anymore
    
    CC: holdenk mengxr
    
    Author: Joseph K. Bradley <joseph@databricks.com>
    
    Closes #8679 from jkbradley/doc-fixes-1.5.
    jkbradley authored and mengxr committed Sep 11, 2015
    Configuration menu
    Copy the full SHA
    2e3a280 View commit details
    Browse the repository at this point in the history
  11. [SPARK-10540] [SQL] Ignore HadoopFsRelationTest's "test all data type…

    …s" if it is too flaky
    
    If hadoopFsRelationSuites's "test all data types" is too flaky we can disable it for now.
    
    https://issues.apache.org/jira/browse/SPARK-10540
    
    Author: Yin Huai <yhuai@databricks.com>
    
    Closes #8705 from yhuai/SPARK-10540-ignore.
    yhuai committed Sep 11, 2015
    Configuration menu
    Copy the full SHA
    6ce0886 View commit details
    Browse the repository at this point in the history
  12. [SPARK-8530] [ML] add python API for MinMaxScaler

    jira: https://issues.apache.org/jira/browse/SPARK-8530
    
    add python API for MinMaxScaler
    jira for MinMaxScaler: https://issues.apache.org/jira/browse/SPARK-7514
    
    Author: Yuhao Yang <hhbyyh@gmail.com>
    
    Closes #7150 from hhbyyh/pythonMinMax.
    hhbyyh authored and jkbradley committed Sep 11, 2015
    Configuration menu
    Copy the full SHA
    5f46444 View commit details
    Browse the repository at this point in the history
  13. [SPARK-10546] Check partitionId's range in ExternalSorter#spill()

    See this thread for background:
    http://search-hadoop.com/m/q3RTt0rWvIkHAE81
    
    We should check the range of partition Id and provide meaningful message through exception.
    
    Alternatively, we can use abs() and modulo to force the partition Id into legitimate range. However, expectation is that user should correct the logic error in his / her code.
    
    Author: tedyu <yuzhihong@gmail.com>
    
    Closes #8703 from tedyu/master.
    tedyu authored and srowen committed Sep 11, 2015
    Configuration menu
    Copy the full SHA
    b231ab8 View commit details
    Browse the repository at this point in the history
  14. [PYTHON] Fixed typo in exception message

    Just fixing a typo in exception message, raised when attempting to pickle SparkContext.
    
    Author: Icaro Medeiros <icaro.medeiros@gmail.com>
    
    Closes #8724 from icaromedeiros/master.
    icaromedeiros authored and srowen committed Sep 11, 2015
    Configuration menu
    Copy the full SHA
    c373866 View commit details
    Browse the repository at this point in the history
  15. [SPARK-10442] [SQL] fix string to boolean cast

    When we cast string to boolean in hive, it returns `true` if the length of string is > 0, and spark SQL follows this behavior.
    
    However, this behavior is very different from other SQL systems:
    
    1. [presto](https://github.com/facebook/presto/blob/master/presto-main/src/main/java/com/facebook/presto/type/VarcharOperators.java#L89-L118) will return `true` for 't' 'true' '1', `false` for 'f' 'false' '0', throw exception for others.
    2. [redshift](http://docs.aws.amazon.com/redshift/latest/dg/r_Boolean_type.html) will return `true` for 't' 'true' 'y' 'yes' '1', `false` for 'f' 'false' 'n' 'no' '0', null for others.
    3. [postgresql](http://www.postgresql.org/docs/devel/static/datatype-boolean.html) will return `true` for 't' 'true' 'y' 'yes' 'on' '1', `false` for 'f' 'false' 'n' 'no' 'off' '0', throw exception for others.
    4. [vertica](https://my.vertica.com/docs/5.0/HTML/Master/2983.htm) will return `true` for 't' 'true' 'y' 'yes' '1', `false` for 'f' 'false' 'n' 'no' '0', null for others.
    5. [impala](http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/latest/topics/impala_boolean.html) throw exception when try to cast string to boolean.
    6. mysql, oracle, sqlserver don't have boolean type
    
    Whether we should change the cast behavior according to other SQL system or not is not decided yet, this PR is a test to see if we changed, how many compatibility tests will fail.
    
    Author: Wenchen Fan <cloud0fan@outlook.com>
    
    Closes #8698 from cloud-fan/string2boolean.
    cloud-fan authored and yhuai committed Sep 11, 2015
    Configuration menu
    Copy the full SHA
    d5d6473 View commit details
    Browse the repository at this point in the history
  16. [SPARK-7142] [SQL] Minor enhancement to BooleanSimplification Optimiz…

    …er rule. Incorporate review comments
    
    Adding changes suggested by cloud-fan  in #5700
    
    cc marmbrus
    
    Author: Yash Datta <Yash.Datta@guavus.com>
    
    Closes #8716 from saucam/bool_simp.
    Yash Datta authored and marmbrus committed Sep 11, 2015
    Configuration menu
    Copy the full SHA
    1eede3b View commit details
    Browse the repository at this point in the history
  17. [SPARK-9992] [SPARK-9994] [SPARK-9998] [SQL] Implement the local TopK…

    …, sample and intersect operators
    
    This PR is in conflict with #8535. I will update this one when #8535 gets merged.
    
    Author: zsxwing <zsxwing@gmail.com>
    
    Closes #8573 from zsxwing/more-local-operators.
    zsxwing authored and Andrew Or committed Sep 11, 2015
    Configuration menu
    Copy the full SHA
    e626ac5 View commit details
    Browse the repository at this point in the history
  18. [SPARK-9990] [SQL] Local hash join follow-ups

    1. Hide `LocalNodeIterator` behind the `LocalNode#asIterator` method
    2. Add tests for this
    
    Author: Andrew Or <andrew@databricks.com>
    
    Closes #8708 from andrewor14/local-hash-join-follow-up.
    Andrew Or committed Sep 11, 2015
    Configuration menu
    Copy the full SHA
    c2af42b View commit details
    Browse the repository at this point in the history
  19. [SPARK-10564] ThreadingSuite: assertion failures in threads don't fai…

    …l the test
    
    This commit ensures if an assertion fails within a thread, it will ultimately fail the test. Otherwise we end up potentially masking real bugs by not propagating assertion failures properly.
    
    Author: Andrew Or <andrew@databricks.com>
    
    Closes #8723 from andrewor14/fix-threading-suite.
    Andrew Or committed Sep 11, 2015
    Configuration menu
    Copy the full SHA
    d74c6a1 View commit details
    Browse the repository at this point in the history
  20. [SPARK-9014] [SQL] Allow Python spark API to use built-in exponential…

    … operator
    
    This PR addresses (SPARK-9014)[https://issues.apache.org/jira/browse/SPARK-9014]
    Added functionality: `Column` object in Python now supports exponential operator `**`
    Example:
    ```
    from pyspark.sql import *
    df = sqlContext.createDataFrame([Row(a=2)])
    df.select(3**df.a,df.a**3,df.a**df.a).collect()
    ```
    Outputs:
    ```
    [Row(POWER(3.0, a)=9.0, POWER(a, 3.0)=8.0, POWER(a, a)=4.0)]
    ```
    
    Author: 0x0FFF <programmerag@gmail.com>
    
    Closes #8658 from 0x0FFF/SPARK-9014.
    0x0FFF authored and davies committed Sep 11, 2015
    Configuration menu
    Copy the full SHA
    c34fc19 View commit details
    Browse the repository at this point in the history

Commits on Sep 12, 2015

  1. [SPARK-10566] [CORE] SnappyCompressionCodec init exception handling m…

    …asks important error information
    
    When throwing an IllegalArgumentException in SnappyCompressionCodec.init, chain the existing exception. This allows potentially important debugging info to be passed to the user.
    
    Manual testing shows the exception chained properly, and the test suite still looks fine as well.
    
    This contribution is my original work and I license the work to the project under the project's open source license.
    
    Author: Daniel Imfeld <daniel@danielimfeld.com>
    
    Closes #8725 from dimfeld/dimfeld-patch-1.
    dimfeld authored and srowen committed Sep 12, 2015
    Configuration menu
    Copy the full SHA
    6d83678 View commit details
    Browse the repository at this point in the history
  2. [SPARK-10554] [CORE] Fix NPE with ShutdownHook

    https://issues.apache.org/jira/browse/SPARK-10554
    
    Fixes NPE when ShutdownHook tries to cleanup temporary folders
    
    Author: Nithin Asokan <Nithin.Asokan@Cerner.com>
    
    Closes #8720 from nasokan/SPARK-10554.
    Nithin Asokan authored and srowen committed Sep 12, 2015
    Configuration menu
    Copy the full SHA
    8285e3b View commit details
    Browse the repository at this point in the history
  3. [SPARK-10547] [TEST] Streamline / improve style of Java API tests

    Fix a few Java API test style issues: unused generic types, exceptions, wrong assert argument order
    
    Author: Sean Owen <sowen@cloudera.com>
    
    Closes #8706 from srowen/SPARK-10547.
    srowen committed Sep 12, 2015
    Configuration menu
    Copy the full SHA
    22730ad View commit details
    Browse the repository at this point in the history
  4. [SPARK-6548] Adding stddev to DataFrame functions

    Adding STDDEV support for DataFrame using 1-pass online /parallel algorithm to compute variance. Please review the code change.
    
    Author: JihongMa <linlin200605@gmail.com>
    Author: Jihong MA <linlin200605@gmail.com>
    Author: Jihong MA <jihongma@jihongs-mbp.usca.ibm.com>
    Author: Jihong MA <jihongma@Jihongs-MacBook-Pro.local>
    
    Closes #6297 from JihongMA/SPARK-SQL.
    JihongMA authored and davies committed Sep 12, 2015
    Configuration menu
    Copy the full SHA
    f4a2280 View commit details
    Browse the repository at this point in the history
  5. [SPARK-10330] Add Scalastyle rule to require use of SparkHadoopUtil J…

    …obContext methods
    
    This is a followup to #8499 which adds a Scalastyle rule to mandate the use of SparkHadoopUtil's JobContext accessor methods and fixes the existing violations.
    
    Author: Josh Rosen <joshrosen@databricks.com>
    
    Closes #8521 from JoshRosen/SPARK-10330-part2.
    JoshRosen committed Sep 12, 2015
    Configuration menu
    Copy the full SHA
    b3a7480 View commit details
    Browse the repository at this point in the history

Commits on Sep 13, 2015

  1. [SPARK-10222] [GRAPHX] [DOCS] More thoroughly deprecate Bagel in favo…

    …r of GraphX
    
    Finish deprecating Bagel; remove reference to nonexistent example
    
    Author: Sean Owen <sowen@cloudera.com>
    
    Closes #8731 from srowen/SPARK-10222.
    srowen committed Sep 13, 2015
    Configuration menu
    Copy the full SHA
    1dc614b View commit details
    Browse the repository at this point in the history

Commits on Sep 14, 2015

  1. [SPARK-9720] [ML] Identifiable types need UID in toString methods

    A few Identifiable types did override their toString method but without using the parent implementation. As a consequence, the uid was not present anymore in the toString result. It is the default behaviour.
    
    This patch is a quick fix. The question of enforcement is still up.
    
    No tests have been written to verify the toString method behaviour. That would be long to do because all types should be tested and not only those which have a regression now.
    
    It is possible to enforce the condition using the compiler by making the toString method final but that would introduce unwanted potential API breaking changes (see jira).
    
    Author: Bertrand Dechoux <BertrandDechoux@users.noreply.github.com>
    
    Closes #8062 from BertrandDechoux/SPARK-9720.
    BertrandDechoux authored and srowen committed Sep 14, 2015
    Configuration menu
    Copy the full SHA
    d815654 View commit details
    Browse the repository at this point in the history
  2. [SPARK-9899] [SQL] log warning for direct output committer with specu…

    …lation enabled
    
    This is a follow-up of #8317.
    
    When speculation is enabled, there may be multiply tasks writing to the same path. Generally it's OK as we will write to a temporary directory first and only one task can commit the temporary directory to target path.
    
    However, when we use direct output committer, tasks will write data to target path directly without temporary directory. This causes problems like corrupted data. Please see [PR comment](#8191 (comment)) for more details.
    
    Unfortunately, we don't have a simple flag to tell if a output committer will write to temporary directory or not, so for safety, we have to disable any customized output committer when `speculation` is true.
    
    Author: Wenchen Fan <cloud0fan@outlook.com>
    
    Closes #8687 from cloud-fan/direct-committer.
    cloud-fan authored and yhuai committed Sep 14, 2015
    Configuration menu
    Copy the full SHA
    32407bf View commit details
    Browse the repository at this point in the history
  3. [SPARK-10584] [DOC] [SQL] Documentation about spark.sql.hive.metastor…

    …e.version is wrong.
    
    The default value of hive metastore version is 1.2.1 but the documentation says the value of `spark.sql.hive.metastore.version` is 0.13.1.
    Also, we cannot get the default value by `sqlContext.getConf("spark.sql.hive.metastore.version")`.
    
    Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
    
    Closes #8739 from sarutak/SPARK-10584.
    sarutak authored and yhuai committed Sep 14, 2015
    Configuration menu
    Copy the full SHA
    cf2821e View commit details
    Browse the repository at this point in the history
  4. [SPARK-10194] [MLLIB] [PYSPARK] SGD algorithms need convergenceTol pa…

    …rameter in Python
    
    [SPARK-3382](https://issues.apache.org/jira/browse/SPARK-3382) added a ```convergenceTol``` parameter for GradientDescent-based methods in Scala. We need that parameter in Python; otherwise, Python users will not be able to adjust that behavior (or even reproduce behavior from previous releases since the default changed).
    
    Author: Yanbo Liang <ybliang8@gmail.com>
    
    Closes #8457 from yanboliang/spark-10194.
    yanboliang authored and mengxr committed Sep 14, 2015
    Configuration menu
    Copy the full SHA
    ce6f3f1 View commit details
    Browse the repository at this point in the history
  5. [SPARK-10573] [ML] IndexToString output schema should be StringType

    Fixes bug where IndexToString output schema was DoubleType. Correct me if I'm wrong, but it doesn't seem like the output needs to have any "ML Attribute" metadata.
    
    Author: Nick Pritchard <nicholas.pritchard@falkonry.com>
    
    Closes #8751 from pnpritchard/SPARK-10573.
    pnpritchard authored and mengxr committed Sep 14, 2015
    2 Configuration menu
    Copy the full SHA
    8a634e9 View commit details
    Browse the repository at this point in the history
  6. [SPARK-10522] [SQL] Nanoseconds of Timestamp in Parquet should be pos…

    …itive
    
    Or Hive can't read it back correctly.
    
    Thanks vanzin for report this.
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes #8674 from davies/positive_nano.
    Davies Liu authored and davies committed Sep 14, 2015
    Configuration menu
    Copy the full SHA
    7e32387 View commit details
    Browse the repository at this point in the history
  7. [SPARK-6981] [SQL] Factor out SparkPlanner and QueryExecution from SQ…

    …LContext
    
    Alternative to PR #6122; in this case the refactored out classes are replaced by inner classes with the same name for backwards binary compatibility
    
       * process in a lighter-weight, backwards-compatible way
    
    Author: Edoardo Vacchi <uncommonnonsense@gmail.com>
    
    Closes #6356 from evacchi/sqlctx-refactoring-lite.
    evacchi authored and marmbrus committed Sep 14, 2015
    2 Configuration menu
    Copy the full SHA
    64f0415 View commit details
    Browse the repository at this point in the history
  8. [SPARK-9996] [SPARK-9997] [SQL] Add local expand and NestedLoopJoin o…

    …perators
    
    This PR is in conflict with #8535 and #8573. Will update this one when they are merged.
    
    Author: zsxwing <zsxwing@gmail.com>
    
    Closes #8642 from zsxwing/expand-nest-join.
    zsxwing authored and Andrew Or committed Sep 14, 2015
    Configuration menu
    Copy the full SHA
    217e496 View commit details
    Browse the repository at this point in the history
  9. [SPARK-10594] [YARN] Remove reference to --num-executors, add --prope…

    …rties-file
    
    `ApplicationMaster` no longer has the `--num-executors` flag, and had an undocumented `--properties-file` configuration option.
    
    cc srowen
    
    Author: Erick Tryzelaar <erick.tryzelaar@gmail.com>
    
    Closes #8754 from erickt/master.
    erickt authored and Andrew Or committed Sep 14, 2015
    2 Configuration menu
    Copy the full SHA
    16b6d18 View commit details
    Browse the repository at this point in the history
  10. [SPARK-10576] [BUILD] Move .java files out of src/main/scala

    Move .java files in `src/main/scala` to `src/main/java` root, except for `package-info.java` (to stay next to package.scala)
    
    Author: Sean Owen <sowen@cloudera.com>
    
    Closes #8736 from srowen/SPARK-10576.
    srowen authored and Andrew Or committed Sep 14, 2015
    Configuration menu
    Copy the full SHA
    4e2242b View commit details
    Browse the repository at this point in the history
  11. [SPARK-10549] scala 2.11 spark on yarn with security - Repl doesn't work

    Make this lazy so that it can set the yarn mode before creating the securityManager.
    
    Author: Tom Graves <tgraves@yahoo-inc.com>
    Author: Thomas Graves <tgraves@staydecay.corp.gq1.yahoo.com>
    
    Closes #8719 from tgravescs/SPARK-10549.
    Tom Graves authored and Andrew Or committed Sep 14, 2015
    Configuration menu
    Copy the full SHA
    ffbbc2c View commit details
    Browse the repository at this point in the history
  12. [SPARK-10543] [CORE] Peak Execution Memory Quantile should be Per-tas…

    …k Basis
    
    Read `PEAK_EXECUTION_MEMORY` using `update` to get per task partial value instead of cumulative value.
    
    I tested with this workload:
    
    ```scala
    val size = 1000
    val repetitions = 10
    val data = sc.parallelize(1 to size, 5).map(x => (util.Random.nextInt(size / repetitions),util.Random.nextDouble)).toDF("key", "value")
    val res = data.toDF.groupBy("key").agg(sum("value")).count
    ```
    
    Before:
    ![image](https://cloud.githubusercontent.com/assets/4317392/9828197/07dd6874-58b8-11e5-9bd9-6ba927c38b26.png)
    
    After:
    ![image](https://cloud.githubusercontent.com/assets/4317392/9828151/a5ddff30-58b7-11e5-8d31-eda5dc4eae79.png)
    
    Tasks view:
    ![image](https://cloud.githubusercontent.com/assets/4317392/9828199/17dc2b84-58b8-11e5-92a8-be89ce4d29d1.png)
    
    cc andrewor14 I appreciate if you can give feedback on this since I think you introduced display of this metric.
    
    Author: Forest Fang <forest.fang@outlook.com>
    
    Closes #8726 from saurfang/stagepage.
    saurfang authored and Andrew Or committed Sep 14, 2015
    Configuration menu
    Copy the full SHA
    fd1e8cd View commit details
    Browse the repository at this point in the history
  13. [SPARK-10564] ThreadingSuite: assertion failures in threads don't fai…

    …l the test (round 2)
    
    This is a follow-up patch to #8723. I missed one case there.
    
    Author: Andrew Or <andrew@databricks.com>
    
    Closes #8727 from andrewor14/fix-threading-suite.
    Andrew Or committed Sep 14, 2015
    Configuration menu
    Copy the full SHA
    7b6c856 View commit details
    Browse the repository at this point in the history

Commits on Sep 15, 2015

  1. [SPARK-9851] Support submitting map stages individually in DAGScheduler

    This patch adds support for submitting map stages in a DAG individually so that we can make downstream decisions after seeing statistics about their output, as part of SPARK-9850. I also added more comments to many of the key classes in DAGScheduler. By itself, the patch is not super useful except maybe to switch between a shuffle and broadcast join, but with the other subtasks of SPARK-9850 we'll be able to do more interesting decisions.
    
    The main entry point is SparkContext.submitMapStage, which lets you run a map stage and see stats about the map output sizes. Other stats could also be collected through accumulators. See AdaptiveSchedulingSuite for a short example.
    
    Author: Matei Zaharia <matei@databricks.com>
    
    Closes #8180 from mateiz/spark-9851.
    mateiz committed Sep 15, 2015
    Configuration menu
    Copy the full SHA
    1a09552 View commit details
    Browse the repository at this point in the history
  2. [SPARK-10542] [PYSPARK] fix serialize namedtuple

    Author: Davies Liu <davies@databricks.com>
    
    Closes #8707 from davies/fix_namedtuple.
    Davies Liu authored and davies committed Sep 15, 2015
    Configuration menu
    Copy the full SHA
    5520418 View commit details
    Browse the repository at this point in the history
  3. [SPARK-9793] [MLLIB] [PYSPARK] PySpark DenseVector, SparseVector impl…

    …ement __eq__ and __hash__ correctly
    
    PySpark DenseVector, SparseVector ```__eq__``` method should use semantics equality, and DenseVector can compared with SparseVector.
    Implement PySpark DenseVector, SparseVector ```__hash__``` method based on the first 16 entries. That will make PySpark Vector objects can be used in collections.
    
    Author: Yanbo Liang <ybliang8@gmail.com>
    
    Closes #8166 from yanboliang/spark-9793.
    yanboliang authored and mengxr committed Sep 15, 2015
    Configuration menu
    Copy the full SHA
    4ae4d54 View commit details
    Browse the repository at this point in the history
  4. [SPARK-10273] Add @SInCE annotation to pyspark.mllib.feature

    Duplicated the since decorator from pyspark.sql into pyspark (also tweaked to handle functions without docstrings).
    
    Added since to methods + "versionadded::" to classes (derived from the git file history in pyspark).
    
    Author: noelsmith <mail@noelsmith.com>
    
    Closes #8633 from noel-smith/SPARK-10273-since-mllib-feature.
    noel-smith authored and mengxr committed Sep 15, 2015
    Configuration menu
    Copy the full SHA
    610971e View commit details
    Browse the repository at this point in the history
  5. [SPARK-10275] [MLLIB] Add @SInCE annotation to pyspark.mllib.random

    Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
    
    Closes #8666 from yu-iskw/SPARK-10275.
    yu-iskw authored and mengxr committed Sep 15, 2015
    Configuration menu
    Copy the full SHA
    a224935 View commit details
    Browse the repository at this point in the history
  6. Small fixes to docs

    Links work now properly + consistent use of *Spark standalone cluster* (Spark uppercase + lowercase the rest -- seems agreed in the other places in the docs).
    
    Author: Jacek Laskowski <jacek.laskowski@deepsense.io>
    
    Closes #8759 from jaceklaskowski/docs-submitting-apps.
    Jacek Laskowski authored and rxin committed Sep 15, 2015
    Configuration menu
    Copy the full SHA
    833be73 View commit details
    Browse the repository at this point in the history
  7. [SPARK-10598] [DOCS]

    Comments preceding toMessage method state: "The edge partition is encoded in the lower
       * 30 bytes of the Int, and the position is encoded in the upper 2 bytes of the Int.". References to bytes should be changed to bits.
    
    This contribution is my original work and I license the work to the Spark project under it's open source license.
    
    Author: Robin East <robin.east@xense.co.uk>
    
    Closes #8756 from insidedctm/master.
    insidedctm authored and rxin committed Sep 15, 2015
    Configuration menu
    Copy the full SHA
    6503c4b View commit details
    Browse the repository at this point in the history
  8. Update version to 1.6.0-SNAPSHOT.

    Author: Reynold Xin <rxin@databricks.com>
    
    Closes #8350 from rxin/1.6.
    rxin committed Sep 15, 2015
    1 Configuration menu
    Copy the full SHA
    09b7e7c View commit details
    Browse the repository at this point in the history
  9. [SPARK-10491] [MLLIB] move RowMatrix.dspr to BLAS

    jira: https://issues.apache.org/jira/browse/SPARK-10491
    
    We implemented dspr with sparse vector support in `RowMatrix`. This method is also used in WeightedLeastSquares and other places. It would be useful to move it to `linalg.BLAS`.
    
    Let me know if new UT needed.
    
    Author: Yuhao Yang <hhbyyh@gmail.com>
    
    Closes #8663 from hhbyyh/movedspr.
    hhbyyh authored and mengxr committed Sep 15, 2015
    Configuration menu
    Copy the full SHA
    c35fdcb View commit details
    Browse the repository at this point in the history
  10. [SPARK-10300] [BUILD] [TESTS] Add support for test tags in run-tests.py.

    This change does two things:
    
    - tag a few tests and adds the mechanism in the build to be able to disable those tags,
      both in maven and sbt, for both junit and scalatest suites.
    - add some logic to run-tests.py to disable some tags depending on what files have
      changed; that's used to disable expensive tests when a module hasn't explicitly
      been changed, to speed up testing for changes that don't directly affect those
      modules.
    
    Author: Marcelo Vanzin <vanzin@cloudera.com>
    
    Closes #8437 from vanzin/test-tags.
    Marcelo Vanzin committed Sep 15, 2015
    Configuration menu
    Copy the full SHA
    8abef21 View commit details
    Browse the repository at this point in the history
  11. [PYSPARK] [MLLIB] [DOCS] Replaced addversion with versionadded in mll…

    …ib.random
    
    Missed this when reviewing `pyspark.mllib.random` for SPARK-10275.
    
    Author: noelsmith <mail@noelsmith.com>
    
    Closes #8773 from noel-smith/mllib-random-versionadded-fix.
    noel-smith authored and mengxr committed Sep 15, 2015
    Configuration menu
    Copy the full SHA
    7ca30b5 View commit details
    Browse the repository at this point in the history
  12. Closes #8738

    Closes #8767
    Closes #2491
    Closes #6795
    Closes #2096
    Closes #7722
    mengxr committed Sep 15, 2015
    Configuration menu
    Copy the full SHA
    0d9ab01 View commit details
    Browse the repository at this point in the history