Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update #1

Merged
merged 170 commits into from
Aug 6, 2014
Merged

update #1

merged 170 commits into from
Aug 6, 2014

Commits on Jul 28, 2014

  1. [SPARK-2523] [SQL] Hadoop table scan bug fixing

    In HiveTableScan.scala, ObjectInspector was created for all of the partition based records, which probably causes ClassCastException if the object inspector is not identical among table & partitions.
    
    This is the follow up with:
    #1408
    #1390
    
    I've run a micro benchmark in my local with 15000000 records totally, and got the result as below:
    
    With This Patch  |  Partition-Based Table  |  Non-Partition-Based Table
    ------------ | ------------- | -------------
    No  |  1927 ms  |  1885 ms
    Yes  | 1541 ms  |  1524 ms
    
    It showed this patch will also improve the performance.
    
    PS:  the benchmark code is also attached. (thanks liancheng )
    ```
    package org.apache.spark.sql.hive
    
    import org.apache.spark.SparkContext
    import org.apache.spark.SparkConf
    import org.apache.spark.sql._
    
    object HiveTableScanPrepare extends App {
      case class Record(key: String, value: String)
    
      val sparkContext = new SparkContext(
        new SparkConf()
          .setMaster("local")
          .setAppName(getClass.getSimpleName.stripSuffix("$")))
    
      val hiveContext = new LocalHiveContext(sparkContext)
    
      val rdd = sparkContext.parallelize((1 to 3000000).map(i => Record(s"$i", s"val_$i")))
    
      import hiveContext._
    
      hql("SHOW TABLES")
      hql("DROP TABLE if exists part_scan_test")
      hql("DROP TABLE if exists scan_test")
      hql("DROP TABLE if exists records")
      rdd.registerAsTable("records")
    
      hql("""CREATE TABLE part_scan_test (key STRING, value STRING) PARTITIONED BY (part1 string, part2 STRING)
                     | ROW FORMAT SERDE
                     | 'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe'
                     | STORED AS RCFILE
                   """.stripMargin)
      hql("""CREATE TABLE scan_test (key STRING, value STRING)
                     | ROW FORMAT SERDE
                     | 'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe'
                     | STORED AS RCFILE
                   """.stripMargin)
    
      for (part1 <- 2000 until 2001) {
        for (part2 <- 1 to 5) {
          hql(s"""from records
                     | insert into table part_scan_test PARTITION (part1='$part1', part2='2010-01-$part2')
                     | select key, value
                   """.stripMargin)
          hql(s"""from records
                     | insert into table scan_test select key, value
                   """.stripMargin)
        }
      }
    }
    
    object HiveTableScanTest extends App {
      val sparkContext = new SparkContext(
        new SparkConf()
          .setMaster("local")
          .setAppName(getClass.getSimpleName.stripSuffix("$")))
    
      val hiveContext = new LocalHiveContext(sparkContext)
    
      import hiveContext._
    
      hql("SHOW TABLES")
      val part_scan_test = hql("select key, value from part_scan_test")
      val scan_test = hql("select key, value from scan_test")
    
      val r_part_scan_test = (0 to 5).map(i => benchmark(part_scan_test))
      val r_scan_test = (0 to 5).map(i => benchmark(scan_test))
      println("Scanning Partition-Based Table")
      r_part_scan_test.foreach(printResult)
      println("Scanning Non-Partition-Based Table")
      r_scan_test.foreach(printResult)
    
      def printResult(result: (Long, Long)) {
        println(s"Duration: ${result._1} ms Result: ${result._2}")
      }
    
      def benchmark(srdd: SchemaRDD) = {
        val begin = System.currentTimeMillis()
        val result = srdd.count()
        val end = System.currentTimeMillis()
        ((end - begin), result)
      }
    }
    ```
    
    Author: Cheng Hao <hao.cheng@intel.com>
    
    Closes #1439 from chenghao-intel/hadoop_table_scan and squashes the following commits:
    
    888968f [Cheng Hao] Fix issues in code style
    27540ba [Cheng Hao] Fix the TableScan Bug while partition serde differs
    40a24a7 [Cheng Hao] Add Unit Test
    chenghao-intel authored and marmbrus committed Jul 28, 2014
    Configuration menu
    Copy the full SHA
    2b8d89e View commit details
    Browse the repository at this point in the history
  2. [SPARK-2479][MLlib] Comparing floating-point numbers using relative e…

    …rror in UnitTests
    
    Floating point math is not exact, and most floating-point numbers end up being slightly imprecise due to rounding errors.
    
    Simple values like 0.1 cannot be precisely represented using binary floating point numbers, and the limited precision of floating point numbers means that slight changes in the order of operations or the precision of intermediates can change the result.
    
    That means that comparing two floats to see if they are equal is usually not what we want. As long as this imprecision stays small, it can usually be ignored.
    
    Based on discussion in the community, we have implemented two different APIs for relative tolerance, and absolute tolerance. It makes sense that test writers should know which one they need depending on their circumstances.
    
    Developers also need to explicitly specify the eps, and there is no default value which will sometimes cause confusion.
    
    When comparing against zero using relative tolerance, a exception will be raised to warn users that it's meaningless.
    
    For relative tolerance, users can now write
    
        assert(23.1 ~== 23.52 relTol 0.02)
        assert(23.1 ~== 22.74 relTol 0.02)
        assert(23.1 ~= 23.52 relTol 0.02)
        assert(23.1 ~= 22.74 relTol 0.02)
        assert(!(23.1 !~= 23.52 relTol 0.02))
        assert(!(23.1 !~= 22.74 relTol 0.02))
    
        // This will throw exception with the following message.
        // "Did not expect 23.1 and 23.52 to be within 0.02 using relative tolerance."
        assert(23.1 !~== 23.52 relTol 0.02)
    
        // "Expected 23.1 and 22.34 to be within 0.02 using relative tolerance."
        assert(23.1 ~== 22.34 relTol 0.02)
    
    For absolute error,
    
        assert(17.8 ~== 17.99 absTol 0.2)
        assert(17.8 ~== 17.61 absTol 0.2)
        assert(17.8 ~= 17.99 absTol 0.2)
        assert(17.8 ~= 17.61 absTol 0.2)
        assert(!(17.8 !~= 17.99 absTol 0.2))
        assert(!(17.8 !~= 17.61 absTol 0.2))
    
        // This will throw exception with the following message.
        // "Did not expect 17.8 and 17.99 to be within 0.2 using absolute error."
        assert(17.8 !~== 17.99 absTol 0.2)
    
        // "Expected 17.8 and 17.59 to be within 0.2 using absolute error."
        assert(17.8 ~== 17.59 absTol 0.2)
    
    Authors:
      DB Tsai <dbtsaialpinenow.com>
      Marek Kolodziej <marekalpinenow.com>
    
    Author: DB Tsai <dbtsai@alpinenow.com>
    
    Closes #1425 from dbtsai/SPARK-2479_comparing_floating_point and squashes the following commits:
    
    8c7cbcc [DB Tsai] Alpine Data Labs
    DB Tsai authored and mengxr committed Jul 28, 2014
    Configuration menu
    Copy the full SHA
    255b56f View commit details
    Browse the repository at this point in the history
  3. [SPARK-2410][SQL] Merging Hive Thrift/JDBC server (with Maven profile…

    … fix)
    
    JIRA issue: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410)
    
    Another try for #1399 & #1600. Those two PR breaks Jenkins builds because we made a separate profile `hive-thriftserver` in sub-project `assembly`, but the `hive-thriftserver` module is defined outside the `hive-thriftserver` profile. Thus every time a pull request that doesn't touch SQL code will also execute test suites defined in `hive-thriftserver`, but tests fail because related .class files are not included in the assembly jar.
    
    In the most recent commit, module `hive-thriftserver` is moved into its own profile to fix this problem. All previous commits are squashed for clarity.
    
    Author: Cheng Lian <lian.cs.zju@gmail.com>
    
    Closes #1620 from liancheng/jdbc-with-maven-fix and squashes the following commits:
    
    629988e [Cheng Lian] Moved hive-thriftserver module definition into its own profile
    ec3c7a7 [Cheng Lian] Cherry picked the Hive Thrift server
    liancheng authored and marmbrus committed Jul 28, 2014
    Configuration menu
    Copy the full SHA
    a7a9d14 View commit details
    Browse the repository at this point in the history
  4. Use commons-lang3 in SignalLogger rather than commons-lang

    Spark only transitively depends on the latter, based on the Hadoop version.
    
    Author: Aaron Davidson <aaron@databricks.com>
    
    Closes #1621 from aarondav/lang3 and squashes the following commits:
    
    93c93bf [Aaron Davidson] Use commons-lang3 in SignalLogger rather than commons-lang
    aarondav authored and rxin committed Jul 28, 2014
    Configuration menu
    Copy the full SHA
    39ab87b View commit details
    Browse the repository at this point in the history

Commits on Jul 29, 2014

  1. Excess judgment

    Author: Yadong Qi <qiyadong2010@gmail.com>
    
    Closes #1629 from watermen/bug-fix2 and squashes the following commits:
    
    59b7237 [Yadong Qi] Update HiveQl.scala
    watermen authored and rxin committed Jul 29, 2014
    Configuration menu
    Copy the full SHA
    16ef4d1 View commit details
    Browse the repository at this point in the history
  2. [SPARK-2580] [PySpark] keep silent in worker if JVM close the socket

    During rdd.take(n), JVM will close the socket if it had got enough data, the Python worker should keep silent in this case.
    
    In the same time, the worker should not print the trackback into stderr if it send the traceback to JVM successfully.
    
    Author: Davies Liu <davies.liu@gmail.com>
    
    Closes #1625 from davies/error and squashes the following commits:
    
    4fbcc6d [Davies Liu] disable log4j during testing when exception is expected.
    cc14202 [Davies Liu] keep silent in worker if JVM close the socket
    davies authored and JoshRosen committed Jul 29, 2014
    Configuration menu
    Copy the full SHA
    ccd5ab5 View commit details
    Browse the repository at this point in the history
  3. [SPARK-791] [PySpark] fix pickle itemgetter with cloudpickle

    fix the problem with pickle operator.itemgetter with multiple index.
    
    Author: Davies Liu <davies.liu@gmail.com>
    
    Closes #1627 from davies/itemgetter and squashes the following commits:
    
    aabd7fa [Davies Liu] fix pickle itemgetter with cloudpickle
    davies authored and JoshRosen committed Jul 29, 2014
    Configuration menu
    Copy the full SHA
    92ef026 View commit details
    Browse the repository at this point in the history
  4. [SPARK-2726] and [SPARK-2727] Remove SortOrder and do in-place sort.

    The pull request includes two changes:
    
    1. Removes SortOrder introduced by SPARK-2125. The key ordering already includes the SortOrder information since an Ordering can be reverse. This is similar to Java's Comparator interface. Rarely does an API accept both a Comparator as well as a SortOrder.
    
    2. Replaces the sortWith call in HashShuffleReader with an in-place quick sort.
    
    Author: Reynold Xin <rxin@apache.org>
    
    Closes #1631 from rxin/sortOrder and squashes the following commits:
    
    c9d37e1 [Reynold Xin] [SPARK-2726] and [SPARK-2727] Remove SortOrder and do in-place sort.
    rxin committed Jul 29, 2014
    Configuration menu
    Copy the full SHA
    96ba04b View commit details
    Browse the repository at this point in the history
  5. [SPARK-2174][MLLIB] treeReduce and treeAggregate

    In `reduce` and `aggregate`, the driver node spends linear time on the number of partitions. It becomes a bottleneck when there are many partitions and the data from each partition is big.
    
    SPARK-1485 (#506) tracks the progress of implementing AllReduce on Spark. I did several implementations including butterfly, reduce + broadcast, and treeReduce + broadcast. treeReduce + BT broadcast seems to be right way to go for Spark. Using binary tree may introduce some overhead in communication, because the driver still need to coordinate on data shuffling. In my experiments, n -> sqrt(n) -> 1 gives the best performance in general, which is why I set "depth = 2" in MLlib algorithms. But it certainly needs more testing.
    
    I left `treeReduce` and `treeAggregate` public for easy testing. Some numbers from a test on 32-node m3.2xlarge cluster.
    
    code:
    
    ~~~
    import breeze.linalg._
    import org.apache.log4j._
    
    Logger.getRootLogger.setLevel(Level.OFF)
    
    for (n <- Seq(1, 10, 100, 1000, 10000, 100000, 1000000)) {
      val vv = sc.parallelize(0 until 1024, 1024).map(i => DenseVector.zeros[Double](n))
      var start = System.nanoTime(); vv.treeReduce(_ + _, 2); println((System.nanoTime() - start) / 1e9)
      start = System.nanoTime(); vv.reduce(_ + _); println((System.nanoTime() - start) / 1e9)
    }
    ~~~
    
    out:
    
    | n | treeReduce(,2) | reduce |
    |---|---------------------|-----------|
    | 10 | 0.215538731 | 0.204206899 |
    | 100 | 0.278405907 | 0.205732582 |
    | 1000 | 0.208972182 | 0.214298272 |
    | 10000 | 0.194792071 | 0.349353687 |
    | 100000 | 0.347683285 | 6.086671892 |
    | 1000000 | 2.589350682 | 66.572906702 |
    
    CC: @pwendell
    
    This is clearly more scalable than the default implementation. My question is whether we should use this implementation in `reduce` and `aggregate` or put them as separate methods. The concern is that users may use `reduce` and `aggregate` as collect, where having multiple stages doesn't reduce the data size. However, in this case, `collect` is more appropriate.
    
    Author: Xiangrui Meng <meng@databricks.com>
    
    Closes #1110 from mengxr/tree and squashes the following commits:
    
    c6cd267 [Xiangrui Meng] make depth default to 2
    b04b96a [Xiangrui Meng] address comments
    9bcc5d3 [Xiangrui Meng] add depth for readability
    7495681 [Xiangrui Meng] fix compile error
    142a857 [Xiangrui Meng] merge master
    d58a087 [Xiangrui Meng] move treeReduce and treeAggregate to mllib
    8a2a59c [Xiangrui Meng] Merge branch 'master' into tree
    be6a88a [Xiangrui Meng] use treeAggregate in mllib
    0f94490 [Xiangrui Meng] add docs
    eb71c33 [Xiangrui Meng] add treeReduce
    fe42a5e [Xiangrui Meng] add treeAggregate
    mengxr authored and rxin committed Jul 29, 2014
    Configuration menu
    Copy the full SHA
    20424da View commit details
    Browse the repository at this point in the history
  6. Minor indentation and comment typo fixes.

    Author: Aaron Staple <astaple@gmail.com>
    
    Closes #1630 from staple/minor and squashes the following commits:
    
    6f295a2 [Aaron Staple] Fix typos in comment about ExprId.
    8566467 [Aaron Staple] Fix off by one column indentation in SqlParser.
    staple authored and rxin committed Jul 29, 2014
    Configuration menu
    Copy the full SHA
    fc4d057 View commit details
    Browse the repository at this point in the history
  7. [STREAMING] SPARK-1729. Make Flume pull data from source, rather than…

    … the current pu...
    
    ...sh model
    
    Currently Spark uses Flume's internal Avro Protocol to ingest data from Flume. If the executor running the
    receiver fails, it currently has to be restarted on the same node to be able to receive data.
    
    This commit adds a new Sink which can be deployed to a Flume agent. This sink can be polled by a new
    DStream that is also included in this commit. This model ensures that data can be pulled into Spark from
    Flume even if the receiver is restarted on a new node. This also allows the receiver to receive data on
    multiple threads for better performance.
    
    Author: Hari Shreedharan <harishreedharan@gmail.com>
    Author: Hari Shreedharan <hshreedharan@apache.org>
    Author: Tathagata Das <tathagata.das1565@gmail.com>
    Author: harishreedharan <hshreedharan@cloudera.com>
    
    Closes #807 from harishreedharan/master and squashes the following commits:
    
    e7f70a3 [Hari Shreedharan] Merge remote-tracking branch 'asf-git/master'
    96cfb6f [Hari Shreedharan] Merge remote-tracking branch 'asf/master'
    e48d785 [Hari Shreedharan] Documenting flume-sink being ignored for Mima checks.
    5f212ce [Hari Shreedharan] Ignore Spark Sink from mima.
    981bf62 [Hari Shreedharan] Merge remote-tracking branch 'asf/master'
    7a1bc6e [Hari Shreedharan] Fix SparkBuild.scala
    a082eb3 [Hari Shreedharan] Merge remote-tracking branch 'asf/master'
    1f47364 [Hari Shreedharan] Minor fixes.
    73d6f6d [Hari Shreedharan] Cleaned up tests a bit. Added some docs in multiple places.
    65b76b4 [Hari Shreedharan] Fixing the unit test.
    e59cc20 [Hari Shreedharan] Use SparkFlumeEvent instead of the new type. Also, Flume Polling Receiver now uses the store(ArrayBuffer) method.
    f3c99d1 [Hari Shreedharan] Merge remote-tracking branch 'asf/master'
    3572180 [Hari Shreedharan] Adding a license header, making Jenkins happy.
    799509f [Hari Shreedharan] Fix a compile issue.
    3c5194c [Hari Shreedharan] Merge remote-tracking branch 'asf/master'
    d248d22 [harishreedharan] Merge pull request #1 from tdas/flume-polling
    10b6214 [Tathagata Das] Changed public API, changed sink package, and added java unit test to make sure Java API is callable from Java.
    1edc806 [Hari Shreedharan] SPARK-1729. Update logging in Spark Sink.
    8c00289 [Hari Shreedharan] More debug messages
    393bd94 [Hari Shreedharan] SPARK-1729. Use LinkedBlockingQueue instead of ArrayBuffer to keep track of connections.
    120e2a1 [Hari Shreedharan] SPARK-1729. Some test changes and changes to utils classes.
    9fd0da7 [Hari Shreedharan] SPARK-1729. Use foreach instead of map for all Options.
    8136aa6 [Hari Shreedharan] Adding TransactionProcessor to map on returning batch of data
    86aa274 [Hari Shreedharan] Merge remote-tracking branch 'asf/master'
    205034d [Hari Shreedharan] Merging master in
    4b0c7fc [Hari Shreedharan] FLUME-1729. New Flume-Spark integration.
    bda01fc [Hari Shreedharan] FLUME-1729. Flume-Spark integration.
    0d69604 [Hari Shreedharan] FLUME-1729. Better Flume-Spark integration.
    3c23c18 [Hari Shreedharan] SPARK-1729. New Spark-Flume integration.
    70bcc2a [Hari Shreedharan] SPARK-1729. New Flume-Spark integration.
    d6fa3aa [Hari Shreedharan] SPARK-1729. New Flume-Spark integration.
    e7da512 [Hari Shreedharan] SPARK-1729. Fixing import order
    9741683 [Hari Shreedharan] SPARK-1729. Fixes based on review.
    c604a3c [Hari Shreedharan] SPARK-1729. Optimize imports.
    0f10788 [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model
    87775aa [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model
    8df37e4 [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model
    03d6c1c [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model
    08176ad [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model
    d24d9d4 [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model
    6d6776a [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model
    harishreedharan authored and tdas committed Jul 29, 2014
    Configuration menu
    Copy the full SHA
    800ecff View commit details
    Browse the repository at this point in the history
  8. [SQL]change some test lists

    1. there's no `hook_context.q` but a `hook_context_cs.q` in query folder
    2. there's no `compute_stats_table.q` in query folder
    3. there's no `having1.q` in query folder
    4. `udf_E` and `udf_PI` appear twice in white list
    
    Author: Daoyuan <daoyuan.wang@intel.com>
    
    Closes #1634 from adrian-wang/testcases and squashes the following commits:
    
    d7482ce [Daoyuan] change some test lists
    adrian-wang authored and marmbrus committed Jul 29, 2014
    Configuration menu
    Copy the full SHA
    0c5c6a6 View commit details
    Browse the repository at this point in the history
  9. [SPARK-2730][SQL] When retrieving a value from a Map, GetItem evaluat…

    …es key twice
    
    JIRA: https://issues.apache.org/jira/browse/SPARK-2730
    
    Author: Yin Huai <huai@cse.ohio-state.edu>
    
    Closes #1637 from yhuai/SPARK-2730 and squashes the following commits:
    
    1a9f24e [Yin Huai] Remove unnecessary key evaluation.
    yhuai authored and marmbrus committed Jul 29, 2014
    Configuration menu
    Copy the full SHA
    e364348 View commit details
    Browse the repository at this point in the history
  10. [SPARK-2674] [SQL] [PySpark] support datetime type for SchemaRDD

    Datetime and time in Python will be converted into java.util.Calendar after serialization, it will be converted into java.sql.Timestamp during inferSchema().
    
    In javaToPython(), Timestamp will be converted into Calendar, then be converted into datetime in Python after pickling.
    
    Author: Davies Liu <davies.liu@gmail.com>
    
    Closes #1601 from davies/date and squashes the following commits:
    
    f0599b0 [Davies Liu] remove tests for sets and tuple in sql, fix list of list
    c9d607a [Davies Liu] convert datetype for runtime
    709d40d [Davies Liu] remove brackets
    96db384 [Davies Liu] support datetime type for SchemaRDD
    davies authored and marmbrus committed Jul 29, 2014
    Configuration menu
    Copy the full SHA
    f0d880e View commit details
    Browse the repository at this point in the history
  11. [SPARK-2082] stratified sampling in PairRDDFunctions that guarantees …

    …exact sample size
    
    Implemented stratified sampling that guarantees exact sample size using ScaRSR with two passes over the RDD for sampling without replacement and three passes for sampling with replacement.
    
    Author: Doris Xin <doris.s.xin@gmail.com>
    Author: Xiangrui Meng <meng@databricks.com>
    
    Closes #1025 from dorx/stratified and squashes the following commits:
    
    245439e [Doris Xin] moved minSamplingRate to getUpperBound
    eaf5771 [Doris Xin] bug fixes.
    17a381b [Doris Xin] fixed a merge issue and a failed unit
    ea7d27f [Doris Xin] merge master
    b223529 [Xiangrui Meng] use approx bounds for poisson fix poisson mean for waitlisting add unit tests for Java
    b3013a4 [Xiangrui Meng] move math3 back to test scope
    eecee5f [Doris Xin] Merge branch 'master' into stratified
    f4c21f3 [Doris Xin] Reviewer comments
    a10e68d [Doris Xin] style fix
    a2bf756 [Doris Xin] Merge branch 'master' into stratified
    680b677 [Doris Xin] use mapPartitionWithIndex instead
    9884a9f [Doris Xin] style fix
    bbfb8c9 [Doris Xin] Merge branch 'master' into stratified
    ee9d260 [Doris Xin] addressed reviewer comments
    6b5b10b [Doris Xin] Merge branch 'master' into stratified
    254e03c [Doris Xin] minor fixes and Java API.
    4ad516b [Doris Xin] remove unused imports from PairRDDFunctions
    bd9dc6e [Doris Xin] unit bug and style violation fixed
    1fe1cff [Doris Xin] Changed fractionByKey to a map to enable arg check
    944a10c [Doris Xin] [SPARK-2145] Add lower bound on sampling rate
    0214a76 [Doris Xin] cleanUp
    90d94c0 [Doris Xin] merge master
    9e74ab5 [Doris Xin] Separated out most of the logic in sampleByKey
    7327611 [Doris Xin] merge master
    50581fc [Doris Xin] added a TODO for logging in python
    46f6c8c [Doris Xin] fixed the NPE caused by closures being cleaned before being passed into the aggregate function
    7e1a481 [Doris Xin] changed the permission on SamplingUtil
    1d413ce [Doris Xin] fixed checkstyle issues
    9ee94ee [Doris Xin] [SPARK-2082] stratified sampling in PairRDDFunctions that guarantees exact sample size
    e3fd6a6 [Doris Xin] Merge branch 'master' into takeSample
    7cab53a [Doris Xin] fixed import bug in rdd.py
    ffea61a [Doris Xin] SPARK-1939: Refactor takeSample method in RDD
    1441977 [Doris Xin] SPARK-1939 Refactor takeSample method in RDD to use ScaSRS
    dorx authored and mengxr committed Jul 29, 2014
    Configuration menu
    Copy the full SHA
    dc96536 View commit details
    Browse the repository at this point in the history
  12. [SPARK-2393][SQL] Cost estimation optimization framework for Catalyst…

    … logical plans & sample usage.
    
    The idea is that every Catalyst logical plan gets hold of a Statistics class, the usage of which provides useful estimations on various statistics. See the implementations of `MetastoreRelation`.
    
    This patch also includes several usages of the estimation interface in the planner. For instance, we now use physical table sizes from the estimate interface to convert an equi-join to a broadcast join (when doing so is beneficial, as determined by a size threshold).
    
    Finally, there are a couple minor accompanying changes including:
    - Remove the not-in-use `BaseRelation`.
    - Make SparkLogicalPlan take a `SQLContext` in the second param list.
    
    Author: Zongheng Yang <zongheng.y@gmail.com>
    
    Closes #1238 from concretevitamin/estimates and squashes the following commits:
    
    329071d [Zongheng Yang] Address review comments; turn config name from string to field in SQLConf.
    8663e84 [Zongheng Yang] Use BigInt for stat; for logical leaves, by default throw an exception.
    2f2fb89 [Zongheng Yang] Fix statistics for SparkLogicalPlan.
    9951305 [Zongheng Yang] Remove childrenStats.
    16fc60a [Zongheng Yang] Avoid calling statistics on plans if auto join conversion is disabled.
    8bd2816 [Zongheng Yang] Add a note on performance of statistics.
    6e594b8 [Zongheng Yang] Get size info from metastore for MetastoreRelation.
    01b7a3e [Zongheng Yang] Update scaladoc for a field and move it to @param section.
    549061c [Zongheng Yang] Remove numTuples in Statistics for now.
    729a8e2 [Zongheng Yang] Update docs to be more explicit.
    573e644 [Zongheng Yang] Remove singleton SQLConf and move back `settings` to the trait.
    2d99eb5 [Zongheng Yang] {Cleanup, use synchronized in, enrich} StatisticsSuite.
    ca5b825 [Zongheng Yang] Inject SQLContext into SparkLogicalPlan, removing SQLConf mixin from it.
    43d38a6 [Zongheng Yang] Revert optimization for BroadcastNestedLoopJoin (this fixes tests).
    0ef9e5b [Zongheng Yang] Use multiplication instead of sum for default estimates.
    4ef0d26 [Zongheng Yang] Make Statistics a case class.
    3ba8f3e [Zongheng Yang] Add comment.
    e5bcf5b [Zongheng Yang] Fix optimization conditions & update scala docs to explain.
    7d9216a [Zongheng Yang] Apply estimation to planning ShuffleHashJoin & BroadcastNestedLoopJoin.
    73cde01 [Zongheng Yang] Move SQLConf back. Assign default sizeInBytes to SparkLogicalPlan.
    73412be [Zongheng Yang] Move SQLConf to Catalyst & add default val for sizeInBytes.
    7a60ab7 [Zongheng Yang] s/Estimates/Statistics, s/cardinality/numTuples.
    de3ae13 [Zongheng Yang] Add parquetAfter() properly in test.
    dcff9bd [Zongheng Yang] Cleanups.
    84301a4 [Zongheng Yang] Refactors.
    5bf5586 [Zongheng Yang] Typo.
    56a8e6e [Zongheng Yang] Prototype impl of estimations for Catalyst logical plans.
    concretevitamin authored and marmbrus committed Jul 29, 2014
    Configuration menu
    Copy the full SHA
    c7db274 View commit details
    Browse the repository at this point in the history

Commits on Jul 30, 2014

  1. MAINTENANCE: Automated closing of pull requests.

    This commit exists to close the following pull requests on Github:
    
    Closes #740 (close requested by 'rxin')
    Closes #647 (close requested by 'rxin')
    Closes #1383 (close requested by 'rxin')
    Closes #1485 (close requested by 'pwendell')
    Closes #693 (close requested by 'rxin')
    Closes #478 (close requested by 'JoshRosen')
    pwendell committed Jul 30, 2014
    Configuration menu
    Copy the full SHA
    2c35666 View commit details
    Browse the repository at this point in the history
  2. [SPARK-2716][SQL] Don't check resolved for having filters.

    For queries like `... HAVING COUNT(*) > 9` the expression is always resolved since it contains no attributes.  This was causing us to avoid doing the Having clause aggregation rewrite.
    
    Author: Michael Armbrust <michael@databricks.com>
    
    Closes #1640 from marmbrus/havingNoRef and squashes the following commits:
    
    92d3901 [Michael Armbrust] Don't check resolved for having filters.
    marmbrus committed Jul 30, 2014
    Configuration menu
    Copy the full SHA
    39b8193 View commit details
    Browse the repository at this point in the history
  3. [SPARK-2631][SQL] Use SQLConf to configure in-memory columnar caching

    Author: Michael Armbrust <michael@databricks.com>
    
    Closes #1638 from marmbrus/cachedConfig and squashes the following commits:
    
    2362082 [Michael Armbrust] Use SQLConf to configure in-memory columnar caching
    marmbrus committed Jul 30, 2014
    Configuration menu
    Copy the full SHA
    86534d0 View commit details
    Browse the repository at this point in the history
  4. [SPARK-2305] [PySpark] Update Py4J to version 0.8.2.1

    Author: Josh Rosen <joshrosen@apache.org>
    
    Closes #1626 from JoshRosen/SPARK-2305 and squashes the following commits:
    
    03fb283 [Josh Rosen] Update Py4J to version 0.8.2.1.
    JoshRosen authored and mateiz committed Jul 30, 2014
    Configuration menu
    Copy the full SHA
    22649b6 View commit details
    Browse the repository at this point in the history
  5. [SPARK-2054][SQL] Code Generation for Expression Evaluation

    Adds a new method for evaluating expressions using code that is generated though Scala reflection.  This functionality is configured by the SQLConf option `spark.sql.codegen` and is currently turned off by default.
    
    Evaluation can be done in several specialized ways:
     - *Projection* - Given an input row, produce a new row from a set of expressions that define each column in terms of the input row.  This can either produce a new Row object or perform the projection in-place on an existing Row (MutableProjection).
     - *Ordering* - Compares two rows based on a list of `SortOrder` expressions
     - *Condition* - Returns `true` or `false` given an input row.
    
    For each of the above operations there is both a Generated and Interpreted version.  When generation for a given expression type is undefined, the code generator falls back on calling the `eval` function of the expression class.  Even without custom code, there is still a potential speed up, as loops are unrolled and code can still be inlined by JIT.
    
    This PR also contains a new type of Aggregation operator, `GeneratedAggregate`, that performs aggregation by using generated `Projection` code.  Currently the required expression rewriting only works for simple aggregations like `SUM` and `COUNT`.  This functionality will be extended in a future PR.
    
    This PR also performs several clean ups that simplified the implementation:
     - The notion of `Binding` all expressions in a tree automatically before query execution has been removed.  Instead it is the responsibly of an operator to provide the input schema when creating one of the specialized evaluators defined above.  In cases when the standard eval method is going to be called, binding can still be done manually using `BindReferences`.  There are a few reasons for this change:  First, there were many operators where it just didn't work before.  For example, operators with more than one child, and operators like aggregation that do significant rewriting of the expression. Second, the semantics of equality with `BoundReferences` are broken.  Specifically, we have had a few bugs where partitioning breaks because of the binding.
     - A copy of the current `SQLContext` is automatically propagated to all `SparkPlan` nodes by the query planner.  Before this was done ad-hoc for the nodes that needed this.  However, this required a lot of boilerplate as one had to always remember to make it `transient` and also had to modify the `otherCopyArgs`.
    
    Author: Michael Armbrust <michael@databricks.com>
    
    Closes #993 from marmbrus/newCodeGen and squashes the following commits:
    
    96ef82c [Michael Armbrust] Merge remote-tracking branch 'apache/master' into newCodeGen
    f34122d [Michael Armbrust] Merge remote-tracking branch 'apache/master' into newCodeGen
    67b1c48 [Michael Armbrust] Use conf variable in SQLConf object
    4bdc42c [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen
    41a40c9 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen
    de22aac [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen
    fed3634 [Michael Armbrust] Inspectors are not serializable.
    ef8d42b [Michael Armbrust] comments
    533fdfd [Michael Armbrust] More logging of expression rewriting for GeneratedAggregate.
    3cd773e [Michael Armbrust] Allow codegen for Generate.
    64b2ee1 [Michael Armbrust] Implement copy
    3587460 [Michael Armbrust] Drop unused string builder function.
    9cce346 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen
    1a61293 [Michael Armbrust] Address review comments.
    0672e8a [Michael Armbrust] Address comments.
    1ec2d6e [Michael Armbrust] Address comments
    033abc6 [Michael Armbrust] off by default
    4771fab [Michael Armbrust] Docs, more test coverage.
    d30fee2 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen
    d2ad5c5 [Michael Armbrust] Refactor putting SQLContext into SparkPlan. Fix ordering, other test cases.
    be2cd6b [Michael Armbrust] WIP: Remove old method for reference binding, more work on configuration.
    bc88ecd [Michael Armbrust] Style
    6cc97ca [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen
    4220f1e [Michael Armbrust] Better config, docs, etc.
    ca6cc6b [Michael Armbrust] WIP
    9d67d85 [Michael Armbrust] Fix hive planner
    fc522d5 [Michael Armbrust] Hook generated aggregation in to the planner.
    e742640 [Michael Armbrust] Remove unneeded changes and code.
    675e679 [Michael Armbrust] Upgrade paradise.
    0093376 [Michael Armbrust] Comment / indenting cleanup.
    d81f998 [Michael Armbrust] include schema for binding.
    0e889e8 [Michael Armbrust] Use typeOf instead tq
    f623ffd [Michael Armbrust] Quiet logging from test suite.
    efad14f [Michael Armbrust] Remove some half finished functions.
    92e74a4 [Michael Armbrust] add overrides
    a2b5408 [Michael Armbrust] WIP: Code generation with scala reflection.
    marmbrus committed Jul 30, 2014
    Configuration menu
    Copy the full SHA
    8446746 View commit details
    Browse the repository at this point in the history
  6. [SPARK-2568] RangePartitioner should run only one job if data is bala…

    …nced
    
    As of Spark 1.0, RangePartitioner goes through data twice: once to compute the count and once to do sampling. As a result, to do sortByKey, Spark goes through data 3 times (once to count, once to sample, and once to sort).
    
    `RangePartitioner` should go through data only once, collecting samples from input partitions as well as counting. If the data is balanced, this should give us a good sketch. If we see big partitions, we re-sample from them in order to collect enough items.
    
    The downside is that we need to collect more from each partition in the first pass. An alternative solution is caching the intermediate result and decide whether to fetch the data after.
    
    Author: Xiangrui Meng <meng@databricks.com>
    Author: Reynold Xin <rxin@apache.org>
    
    Closes #1562 from mengxr/range-partitioner and squashes the following commits:
    
    6cc2551 [Xiangrui Meng] change foreach to for
    eb39b08 [Xiangrui Meng] Merge branch 'master' into range-partitioner
    eb95dd8 [Xiangrui Meng] separate sketching and determining bounds impl
    c436d30 [Xiangrui Meng] fix binary metrics unit tests
    db58a55 [Xiangrui Meng] add unit tests
    a6e35d6 [Xiangrui Meng] minor update
    60be09e [Xiangrui Meng] remove importance sampler
    9ee9992 [Xiangrui Meng] update range partitioner to run only one job on roughly balanced data
    cc12f47 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into range-part
    06ac2ec [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into range-part
    17bcbf3 [Reynold Xin] Added seed.
    badf20d [Reynold Xin] Renamed the method.
    6940010 [Reynold Xin] Reservoir sampling implementation.
    mengxr authored and rxin committed Jul 30, 2014
    Configuration menu
    Copy the full SHA
    2e6efca View commit details
    Browse the repository at this point in the history
  7. [SQL] Handle null values in debug()

    Author: Michael Armbrust <michael@databricks.com>
    
    Closes #1646 from marmbrus/nullDebug and squashes the following commits:
    
    49050a8 [Michael Armbrust] Handle null values in debug()
    marmbrus committed Jul 30, 2014
    Configuration menu
    Copy the full SHA
    077f633 View commit details
    Browse the repository at this point in the history
  8. [SPARK-2260] Fix standalone-cluster mode, which was broken

    The main thing was that spark configs were not propagated to the driver, and so applications that do not specify `master` or `appName` automatically failed. This PR fixes that and a couple of miscellaneous things that are related.
    
    One thing that may or may not be an issue is that the jars must be available on the driver node. In `standalone-cluster` mode, this effectively means these jars must be available on all the worker machines, since the driver is launched on one of them. The semantics here are not the same as `yarn-cluster` mode,  where all the relevant jars are uploaded to a distributed cache automatically and shipped to the containers. This is probably not a concern, but still worth a mention.
    
    Author: Andrew Or <andrewor14@gmail.com>
    
    Closes #1538 from andrewor14/standalone-cluster and squashes the following commits:
    
    8c11a0d [Andrew Or] Clean up imports / comments (minor)
    2678d13 [Andrew Or] Handle extraJavaOpts properly
    7660547 [Andrew Or] Merge branch 'master' of github.com:apache/spark into standalone-cluster
    6f64a9b [Andrew Or] Revert changes in YARN
    2f2908b [Andrew Or] Fix tests
    ed01491 [Andrew Or] Don't go overboard with escaping
    8e105e1 [Andrew Or] Merge branch 'master' of github.com:apache/spark into standalone-cluster
    b890949 [Andrew Or] Abstract usages of converting spark opts to java opts
    79f63a3 [Andrew Or] Move sparkProps into javaOpts
    78752f8 [Andrew Or] Fix tests
    5a9c6c7 [Andrew Or] Fix line too long
    c141a00 [Andrew Or] Don't display "unknown app" on driver log pages
    d7e2728 [Andrew Or] Avoid deprecation warning in standalone Client
    6ceb14f [Andrew Or] Allow relevant configs to propagate to standalone Driver
    7f854bc [Andrew Or] Fix test
    855256e [Andrew Or] Fix standalone-cluster mode
    fd9da51 [Andrew Or] Formatting changes (minor)
    andrewor14 authored and pwendell committed Jul 30, 2014
    Configuration menu
    Copy the full SHA
    4ce92cc View commit details
    Browse the repository at this point in the history
  9. [SPARK-2179][SQL] Public API for DataTypes and Schema

    The current PR contains the following changes:
    * Expose `DataType`s in the sql package (internal details are private to sql).
    * Users can create Rows.
    * Introduce `applySchema` to create a `SchemaRDD` by applying a `schema: StructType` to an `RDD[Row]`.
    * Add a function `simpleString` to every `DataType`. Also, the schema represented by a `StructType` can be visualized by `printSchema`.
    * `ScalaReflection.typeOfObject` provides a way to infer the Catalyst data type based on an object. Also, we can compose `typeOfObject` with some custom logics to form a new function to infer the data type (for different use cases).
    * `JsonRDD` has been refactored to use changes introduced by this PR.
    * Add a field `containsNull` to `ArrayType`. So, we can explicitly mark if an `ArrayType` can contain null values. The default value of `containsNull` is `false`.
    
    New APIs are introduced in the sql package object and SQLContext. You can find the scaladoc at
    [sql package object](http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.package) and [SQLContext](http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext).
    
    An example of using `applySchema` is shown below.
    ```scala
    import org.apache.spark.sql._
    val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    
    val schema =
      StructType(
        StructField("name", StringType, false) ::
        StructField("age", IntegerType, true) :: Nil)
    
    val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Row(p(0), p(1).trim.toInt))
    val peopleSchemaRDD = sqlContext. applySchema(people, schema)
    peopleSchemaRDD.printSchema
    // root
    // |-- name: string (nullable = false)
    // |-- age: integer (nullable = true)
    
    peopleSchemaRDD.registerAsTable("people")
    sqlContext.sql("select name from people").collect.foreach(println)
    ```
    
    I will add new contents to the SQL programming guide later.
    
    JIRA: https://issues.apache.org/jira/browse/SPARK-2179
    
    Author: Yin Huai <huai@cse.ohio-state.edu>
    
    Closes #1346 from yhuai/dataTypeAndSchema and squashes the following commits:
    
    1d45977 [Yin Huai] Clean up.
    a6e08b4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
    c712fbf [Yin Huai] Converts types of values based on defined schema.
    4ceeb66 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
    e5f8df5 [Yin Huai] Scaladoc.
    122d1e7 [Yin Huai] Address comments.
    03bfd95 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
    2476ed0 [Yin Huai] Minor updates.
    ab71f21 [Yin Huai] Format.
    fc2bed1 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
    bd40a33 [Yin Huai] Address comments.
    991f860 [Yin Huai] Move "asJavaDataType" and "asScalaDataType" to DataTypeConversions.scala.
    1cb35fe [Yin Huai] Add "valueContainsNull" to MapType.
    3edb3ae [Yin Huai] Python doc.
    692c0b9 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
    1d93395 [Yin Huai] Python APIs.
    246da96 [Yin Huai] Add java data type APIs to javadoc index.
    1db9531 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
    d48fc7b [Yin Huai] Minor updates.
    33c4fec [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
    b9f3071 [Yin Huai] Java API for applySchema.
    1c9f33c [Yin Huai] Java APIs for DataTypes and Row.
    624765c [Yin Huai] Tests for applySchema.
    aa92e84 [Yin Huai] Update data type tests.
    8da1a17 [Yin Huai] Add Row.fromSeq.
    9c99bc0 [Yin Huai] Several minor updates.
    1d9c13a [Yin Huai] Update applySchema API.
    85e9b51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
    e495e4e [Yin Huai] More comments.
    42d47a3 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
    c3f4a02 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
    2e58dbd [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
    b8b7db4 [Yin Huai] 1. Move sql package object and package-info to sql-core. 2. Minor updates on APIs. 3. Update scala doc.
    68525a2 [Yin Huai] Update JSON unit test.
    3209108 [Yin Huai] Add unit tests.
    dcaf22f [Yin Huai] Add a field containsNull to ArrayType to indicate if an array can contain null values or not. If an ArrayType is constructed by "ArrayType(elementType)" (the existing constructor), the value of containsNull is false.
    9168b83 [Yin Huai] Update comments.
    fc649d7 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
    eca7d04 [Yin Huai] Add two apply methods which will be used to extract StructField(s) from a StructType.
    949d6bb [Yin Huai] When creating a SchemaRDD for a JSON dataset, users can apply an existing schema.
    7a6a7e5 [Yin Huai] Fix bug introduced by the change made on SQLContext.inferSchema.
    43a45e1 [Yin Huai] Remove sql.util.package introduced in a previous commit.
    0266761 [Yin Huai] Format
    03eec4c [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
    90460ac [Yin Huai] Infer the Catalyst data type from an object and cast a data value to the expected type.
    3fa0df5 [Yin Huai] Provide easier ways to construct a StructType.
    16be3e5 [Yin Huai] This commit contains three changes: * Expose `DataType`s in the sql package (internal details are private to sql). * Introduce `createSchemaRDD` to create a `SchemaRDD` from an `RDD` with a provided schema (represented by a `StructType`) and a provided function to construct `Row`, * Add a function `simpleString` to every `DataType`. Also, the schema represented by a `StructType` can be visualized by `printSchema`.
    yhuai authored and marmbrus committed Jul 30, 2014
    Configuration menu
    Copy the full SHA
    7003c16 View commit details
    Browse the repository at this point in the history
  10. SPARK-2543: Allow user to set maximum Kryo buffer size

    Author: Koert Kuipers <koert@tresata.com>
    
    Closes #735 from koertkuipers/feat-kryo-max-buffersize and squashes the following commits:
    
    15f6d81 [Koert Kuipers] change default for spark.kryoserializer.buffer.max.mb to 64mb and add some documentation
    1bcc22c [Koert Kuipers] Merge branch 'master' into feat-kryo-max-buffersize
    0c9f8eb [Koert Kuipers] make default for kryo max buffer size 16MB
    143ec4d [Koert Kuipers] test resizable buffer in kryo Output
    0732445 [Koert Kuipers] support setting maxCapacity to something different than capacity in kryo Output
    koertkuipers authored and pwendell committed Jul 30, 2014
    Configuration menu
    Copy the full SHA
    7c5fc28 View commit details
    Browse the repository at this point in the history
  11. SPARK-2748 [MLLIB] [GRAPHX] Loss of precision for small arguments to …

    …Math.exp, Math.log
    
    In a few places in MLlib, an expression of the form `log(1.0 + p)` is evaluated. When p is so small that `1.0 + p == 1.0`, the result is 0.0. However the correct answer is very near `p`. This is why `Math.log1p` exists.
    
    Similarly for one instance of `exp(m) - 1` in GraphX; there's a special `Math.expm1` method.
    
    While the errors occur only for very small arguments, given their use in machine learning algorithms, this is entirely possible.
    
    Also note the related PR for Python: #1652
    
    Author: Sean Owen <srowen@gmail.com>
    
    Closes #1659 from srowen/SPARK-2748 and squashes the following commits:
    
    c5926d4 [Sean Owen] Use log1p, expm1 for better precision for tiny arguments
    srowen authored and mengxr committed Jul 30, 2014
    Configuration menu
    Copy the full SHA
    ee07541 View commit details
    Browse the repository at this point in the history
  12. [SPARK-2521] Broadcast RDD object (instead of sending it along with e…

    …very task)
    
    This is a resubmission of #1452. It was reverted because it broke the build.
    
    Currently (as of Spark 1.0.1), Spark sends RDD object (which contains closures) using Akka along with the task itself to the executors. This is inefficient because all tasks in the same stage use the same RDD object, but we have to send RDD object multiple times to the executors. This is especially bad when a closure references some variable that is very large. The current design led to users having to explicitly broadcast large variables.
    
    The patch uses broadcast to send RDD objects and the closures to executors, and use Akka to only send a reference to the broadcast RDD/closure along with the partition specific information for the task. For those of you who know more about the internals, Spark already relies on broadcast to send the Hadoop JobConf every time it uses the Hadoop input, because the JobConf is large.
    
    The user-facing impact of the change include:
    
    1. Users won't need to decide what to broadcast anymore, unless they would want to use a large object multiple times in different operations
    2. Task size will get smaller, resulting in faster scheduling and higher task dispatch throughput.
    
    In addition, the change will simplify some internals of Spark, eliminating the need to maintain task caches and the complex logic to broadcast JobConf (which also led to a deadlock recently).
    
    A simple way to test this:
    ```scala
    val a = new Array[Byte](1000*1000); scala.util.Random.nextBytes(a);
    sc.parallelize(1 to 1000, 1000).map { x => a; x }.groupBy { x => a; x }.count
    ```
    
    Numbers on 3 r3.8xlarge instances on EC2
    ```
    master branch: 5.648436068 s, 4.715361895 s, 5.360161877 s
    with this change: 3.416348793 s, 1.477846558 s, 1.553432156 s
    ```
    
    Author: Reynold Xin <rxin@apache.org>
    
    Closes #1498 from rxin/broadcast-task and squashes the following commits:
    
    f7364db [Reynold Xin] Code review feedback.
    f8535dc [Reynold Xin] Fixed the style violation.
    252238d [Reynold Xin] Serialize the final task closure as well as ShuffleDependency in taskBinary.
    111007d [Reynold Xin] Fix broadcast tests.
    797c247 [Reynold Xin] Properly send SparkListenerStageSubmitted and SparkListenerStageCompleted.
    bab1d8b [Reynold Xin] Check for NotSerializableException in submitMissingTasks.
    cf38450 [Reynold Xin] Use TorrentBroadcastFactory.
    991c002 [Reynold Xin] Use HttpBroadcast.
    de779f8 [Reynold Xin] Fix TaskContextSuite.
    cc152fc [Reynold Xin] Don't cache the RDD broadcast variable.
    d256b45 [Reynold Xin] Fixed unit test failures. One more to go.
    cae0af3 [Reynold Xin] [SPARK-2521] Broadcast RDD object (instead of sending it along with every task).
    rxin committed Jul 30, 2014
    Configuration menu
    Copy the full SHA
    774142f View commit details
    Browse the repository at this point in the history
  13. [SPARK-2747] git diff --dirstat can miss sql changes and not run Hive…

    … tests
    
    dev/run-tests use "git diff --dirstat master" to check whether sql is changed. However, --dirstat won't show sql if sql's change is negligible (e.g. 1k loc change in core, and only 1 loc change in hive).
    
    We should use "git diff --name-only master" instead.
    
    Author: Reynold Xin <rxin@apache.org>
    
    Closes #1656 from rxin/hiveTest and squashes the following commits:
    
    f5eab9f [Reynold Xin] [SPARK-2747] git diff --dirstat can miss sql changes and not run Hive tests.
    rxin committed Jul 30, 2014
    Configuration menu
    Copy the full SHA
    3bc3f18 View commit details
    Browse the repository at this point in the history
  14. Avoid numerical instability

    This avoids basically doing 1 - 1, for example:
    
    ```python
    >>> from math import exp
    >>> margin = -40
    >>> 1 - 1 / (1 + exp(margin))
    0.0
    >>> exp(margin) / (1 + exp(margin))
    4.248354255291589e-18
    >>>
    ```
    
    Author: Naftali Harris <naftaliharris@gmail.com>
    
    Closes #1652 from naftaliharris/patch-2 and squashes the following commits:
    
    0d55a9f [Naftali Harris] Avoid numerical instability
    naftaliharris authored and mengxr committed Jul 30, 2014
    Configuration menu
    Copy the full SHA
    e3d85b7 View commit details
    Browse the repository at this point in the history
  15. [SPARK-2544][MLLIB] Improve ALS algorithm resource usage

    Author: GuoQiang Li <witgo@qq.com>
    Author: witgo <witgo@qq.com>
    
    Closes #929 from witgo/improve_als and squashes the following commits:
    
    ea25033 [GuoQiang Li] checkpoint products 3,6,9 ...
    154dccf [GuoQiang Li] checkpoint products only
    c5779ff [witgo] Improve ALS algorithm resource usage
    witgo authored and mengxr committed Jul 30, 2014
    Configuration menu
    Copy the full SHA
    fc47bb6 View commit details
    Browse the repository at this point in the history
  16. [SPARK-2746] Set SBT_MAVEN_PROFILES only when it is not set explicitl…

    …y by the user.
    
    Author: Reynold Xin <rxin@apache.org>
    
    Closes #1655 from rxin/SBT_MAVEN_PROFILES and squashes the following commits:
    
    b268c4b [Reynold Xin] [SPARK-2746] Set SBT_MAVEN_PROFILES only when it is not set explicitly by the user.
    rxin committed Jul 30, 2014
    Configuration menu
    Copy the full SHA
    ff511ba View commit details
    Browse the repository at this point in the history
  17. Wrap FWDIR in quotes.

    rxin committed Jul 30, 2014
    Configuration menu
    Copy the full SHA
    f2eb84f View commit details
    Browse the repository at this point in the history
  18. Configuration menu
    Copy the full SHA
    95cf203 View commit details
    Browse the repository at this point in the history
  19. More wrapping FWDIR in quotes.

    rxin committed Jul 30, 2014
    Configuration menu
    Copy the full SHA
    0feb349 View commit details
    Browse the repository at this point in the history
  20. [SQL] Fix compiling of catalyst docs.

    Author: Michael Armbrust <michael@databricks.com>
    
    Closes #1653 from marmbrus/fixDocs and squashes the following commits:
    
    0aa1feb [Michael Armbrust] Fix compiling of catalyst docs.
    marmbrus committed Jul 30, 2014
    Configuration menu
    Copy the full SHA
    2248891 View commit details
    Browse the repository at this point in the history
  21. Configuration menu
    Copy the full SHA
    437dc8c View commit details
    Browse the repository at this point in the history
  22. [SPARK-2024] Add saveAsSequenceFile to PySpark

    JIRA issue: https://issues.apache.org/jira/browse/SPARK-2024
    
    This PR is a followup to #455 and adds capabilities for saving PySpark RDDs using SequenceFile or any Hadoop OutputFormats.
    
    * Added RDD methods ```saveAsSequenceFile```, ```saveAsHadoopFile``` and ```saveAsHadoopDataset```, for both old and new MapReduce APIs.
    
    * Default converter for converting common data types to Writables. Users may specify custom converters to convert to desired data types.
    
    * No out-of-box support for reading/writing arrays, since ArrayWritable itself doesn't have a no-arg constructor for creating an empty instance upon reading. Users need to provide ArrayWritable subtypes. Custom converters for converting arrays to suitable ArrayWritable subtypes are also needed when writing. When reading, the default converter will convert any custom ArrayWritable subtypes to ```Object[]``` and they get pickled to Python tuples.
    
    * Added HBase and Cassandra output examples to show how custom output formats and converters can be used.
    
    cc MLnick mateiz ahirreddy pwendell
    
    Author: Kan Zhang <kzhang@apache.org>
    
    Closes #1338 from kanzhang/SPARK-2024 and squashes the following commits:
    
    c01e3ef [Kan Zhang] [SPARK-2024] code formatting
    6591e37 [Kan Zhang] [SPARK-2024] renaming pickled -> pickledRDD
    d998ad6 [Kan Zhang] [SPARK-2024] refectoring to get method params below 10
    57a7a5e [Kan Zhang] [SPARK-2024] correcting typo
    75ca5bd [Kan Zhang] [SPARK-2024] Better type checking for batch serialized RDD
    0bdec55 [Kan Zhang] [SPARK-2024] Refactoring newly added tests
    9f39ff4 [Kan Zhang] [SPARK-2024] Adding 2 saveAsHadoopDataset tests
    0c134f3 [Kan Zhang] [SPARK-2024] Test refactoring and adding couple unbatched cases
    7a176df [Kan Zhang] [SPARK-2024] Add saveAsSequenceFile to PySpark
    kanzhang authored and JoshRosen committed Jul 30, 2014
    Configuration menu
    Copy the full SHA
    94d1f46 View commit details
    Browse the repository at this point in the history
  23. Configuration menu
    Copy the full SHA
    7c7ce54 View commit details
    Browse the repository at this point in the history
  24. Configuration menu
    Copy the full SHA
    1097327 View commit details
    Browse the repository at this point in the history
  25. Configuration menu
    Copy the full SHA
    2f4b170 View commit details
    Browse the repository at this point in the history
  26. SPARK-2749 [BUILD]. Spark SQL Java tests aren't compiling in Jenkins'…

    … Maven builds; missing junit:junit dep
    
    The Maven-based builds in the build matrix have been failing for a few days:
    
    https://amplab.cs.berkeley.edu/jenkins/view/Spark/
    
    On inspection, it looks like the Spark SQL Java tests don't compile:
    
    https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/244/consoleFull
    
    I confirmed it by repeating the command vs master:
    
    `mvn -Dhadoop.version=1.0.4 -Dlabel=centos -DskipTests clean package`
    
    The problem is that this module doesn't depend on JUnit. In fact, none of the modules do, but `com.novocode:junit-interface` (the SBT-JUnit bridge) pulls it in, in most places. However this module doesn't depend on `com.novocode:junit-interface`
    
    Adding the `junit:junit` dependency fixes the compile problem. In fact, the other modules with Java tests should probably depend on it explicitly instead of happening to get it via `com.novocode:junit-interface`, since that is a bit SBT/Scala-specific (and I am not even sure it's needed).
    
    Author: Sean Owen <srowen@gmail.com>
    
    Closes #1660 from srowen/SPARK-2749 and squashes the following commits:
    
    858ff7c [Sean Owen] Add explicit junit dep to other modules with Java tests for robustness
    9636794 [Sean Owen] Add junit dep so that Spark SQL Java tests compile
    srowen authored and rxin committed Jul 30, 2014
    Configuration menu
    Copy the full SHA
    6ab96a6 View commit details
    Browse the repository at this point in the history

Commits on Jul 31, 2014

  1. SPARK-2741 - Publish version of spark assembly which does not contain…

    … Hive
    
    Provide a version of the Spark tarball which does not package Hive. This is meant for HIve + Spark users.
    
    Author: Brock Noland <brock@apache.org>
    
    Closes #1667 from brockn/master and squashes the following commits:
    
    5beafb2 [Brock Noland] SPARK-2741 - Publish version of spark assembly which does not contain Hive
    Brock Noland authored and pwendell committed Jul 31, 2014
    Configuration menu
    Copy the full SHA
    2ac37db View commit details
    Browse the repository at this point in the history
  2. [SPARK-2734][SQL] Remove tables from cache when DROP TABLE is run.

    Author: Michael Armbrust <michael@databricks.com>
    
    Closes #1650 from marmbrus/dropCached and squashes the following commits:
    
    e6ab80b [Michael Armbrust] Support if exists.
    83426c6 [Michael Armbrust] Remove tables from cache when DROP TABLE is run.
    marmbrus committed Jul 31, 2014
    Configuration menu
    Copy the full SHA
    88a519d View commit details
    Browse the repository at this point in the history
  3. SPARK-2341 [MLLIB] loadLibSVMFile doesn't handle regression datasets

    Per discussion at https://issues.apache.org/jira/browse/SPARK-2341 , this is a look at deprecating the multiclass parameter. Thoughts welcome of course.
    
    Author: Sean Owen <srowen@gmail.com>
    
    Closes #1663 from srowen/SPARK-2341 and squashes the following commits:
    
    8a3abd7 [Sean Owen] Suppress MIMA error for removed package private classes
    18a8c8e [Sean Owen] Updates from review
    83d0092 [Sean Owen] Deprecated methods with multiclass, and instead always parse target as a double (ie. multiclass = true)
    srowen authored and mengxr committed Jul 31, 2014
    Configuration menu
    Copy the full SHA
    e9b275b View commit details
    Browse the repository at this point in the history
  4. Update DecisionTreeRunner.scala

    Author: strat0sphere <stratos.dimopoulos@gmail.com>
    
    Closes #1676 from strat0sphere/patch-1 and squashes the following commits:
    
    044d2fa [strat0sphere] Update DecisionTreeRunner.scala
    strat0sphere authored and mengxr committed Jul 31, 2014
    Configuration menu
    Copy the full SHA
    da50176 View commit details
    Browse the repository at this point in the history
  5. SPARK-2045 Sort-based shuffle

    This adds a new ShuffleManager based on sorting, as described in https://issues.apache.org/jira/browse/SPARK-2045. The bulk of the code is in an ExternalSorter class that is similar to ExternalAppendOnlyMap, but sorts key-value pairs by partition ID and can be used to create a single sorted file with a map task's output. (Longer-term I think this can take on the remaining functionality in ExternalAppendOnlyMap and replace it so we don't have code duplication.)
    
    The main TODOs still left are:
    - [x] enabling ExternalSorter to merge across spilled files
      - [x] with an Ordering
      - [x] without an Ordering, using the keys' hash codes
    - [x] adding more tests (e.g. a version of our shuffle suite that runs on this)
    - [x] rebasing on top of the size-tracking refactoring in #1165 when that is merged
    - [x] disabling spilling if spark.shuffle.spill is set to false
    
    Despite this though, this seems to work pretty well (running successfully in cases where the hash shuffle would OOM, such as 1000 reduce tasks on executors with only 1G memory), and it seems to be comparable in speed or faster than hash-based shuffle (it will create much fewer files for the OS to keep track of). So I'm posting it to get some early feedback.
    
    After these TODOs are done, I'd also like to enable ExternalSorter to sort data within each partition by a key as well, which will allow us to use it to implement external spilling in reduce tasks in `sortByKey`.
    
    Author: Matei Zaharia <matei@databricks.com>
    
    Closes #1499 from mateiz/sort-based-shuffle and squashes the following commits:
    
    bd841f9 [Matei Zaharia] Various review comments
    d1c137f [Matei Zaharia] Various review comments
    a611159 [Matei Zaharia] Compile fixes due to rebase
    62c56c8 [Matei Zaharia] Fix ShuffledRDD sometimes not returning Tuple2s.
    f617432 [Matei Zaharia] Fix a failing test (seems to be due to change in SizeTracker logic)
    9464d5f [Matei Zaharia] Simplify code and fix conflicts after latest rebase
    0174149 [Matei Zaharia] Add cleanup behavior and cleanup tests for sort-based shuffle
    eb4ee0d [Matei Zaharia] Remove customizable element type in ShuffledRDD
    fa2e8db [Matei Zaharia] Allow nextBatchStream to be called after we're done looking at all streams
    a34b352 [Matei Zaharia] Fix tracking of indices within a partition in SpillReader, and add test
    03e1006 [Matei Zaharia] Add a SortShuffleSuite that runs ShuffleSuite with sort-based shuffle
    3c7ff1f [Matei Zaharia] Obey the spark.shuffle.spill setting in ExternalSorter
    ad65fbd [Matei Zaharia] Rebase on top of Aaron's Sorter change, and use Sorter in our buffer
    44d2a93 [Matei Zaharia] Use estimateSize instead of atGrowThreshold to test collection sizes
    5686f71 [Matei Zaharia] Optimize merging phase for in-memory only data:
    5461cbb [Matei Zaharia] Review comments and more tests (e.g. tests with 1 element per partition)
    e9ad356 [Matei Zaharia] Update ContextCleanerSuite to make sure shuffle cleanup tests use hash shuffle (since they were written for it)
    c72362a [Matei Zaharia] Added bug fix and test for when iterators are empty
    de1fb40 [Matei Zaharia] Make trait SizeTrackingCollection private[spark]
    4988d16 [Matei Zaharia] tweak
    c1b7572 [Matei Zaharia] Small optimization
    ba7db7f [Matei Zaharia] Handle null keys in hash-based comparator, and add tests for collisions
    ef4e397 [Matei Zaharia] Support for partial aggregation even without an Ordering
    4b7a5ce [Matei Zaharia] More tests, and ability to sort data if a total ordering is given
    e1f84be [Matei Zaharia] Fix disk block manager test
    5a40a1c [Matei Zaharia] More tests
    614f1b4 [Matei Zaharia] Add spill metrics to map tasks
    cc52caf [Matei Zaharia] Add more error handling and tests for error cases
    bbf359d [Matei Zaharia] More work
    3a56341 [Matei Zaharia] More partial work towards sort-based shuffle
    7a0895d [Matei Zaharia] Some more partial work towards sort-based shuffle
    b615476 [Matei Zaharia] Scaffolding for sort-based shuffle
    mateiz authored and rxin committed Jul 31, 2014
    Configuration menu
    Copy the full SHA
    e966284 View commit details
    Browse the repository at this point in the history
  6. [SPARK-2758] UnionRDD's UnionPartition should not reference parent RDDs

    Author: Reynold Xin <rxin@apache.org>
    
    Closes #1675 from rxin/unionrdd and squashes the following commits:
    
    941d316 [Reynold Xin] Clear RDDs for checkpointing.
    c9f05f2 [Reynold Xin] [SPARK-2758] UnionRDD's UnionPartition should not reference parent RDDs
    rxin committed Jul 31, 2014
    Configuration menu
    Copy the full SHA
    894d48f View commit details
    Browse the repository at this point in the history
  7. Required AM memory is "amMem", not "args.amMemory"

    "ERROR yarn.Client: Required AM memory (1024) is above the max threshold (1048) of this cluster" appears if this code is not changed. obviously, 1024 is less than 1048, so change this
    
    Author: derek ma <maji3@asiainfo-linkage.com>
    
    Closes #1494 from maji2014/master and squashes the following commits:
    
    b0f6640 [derek ma] Required AM memory is "amMem", not "args.amMemory"
    maji2014 authored and pwendell committed Jul 31, 2014
    Configuration menu
    Copy the full SHA
    118c1c4 View commit details
    Browse the repository at this point in the history
  8. [SPARK-2340] Resolve event logging and History Server paths properly

    We resolve relative paths to the local `file:/` system for `--jars` and `--files` in spark submit (#853). We should do the same for the history server.
    
    Author: Andrew Or <andrewor14@gmail.com>
    
    Closes #1280 from andrewor14/hist-serv-fix and squashes the following commits:
    
    13ff406 [Andrew Or] Merge branch 'master' of github.com:apache/spark into hist-serv-fix
    b393e17 [Andrew Or] Strip trailing "/" from logging directory
    622a471 [Andrew Or] Fix test in EventLoggingListenerSuite
    0e20f71 [Andrew Or] Shift responsibility of resolving paths up one level
    b037c0c [Andrew Or] Use resolved paths for everything in history server
    c7e36ee [Andrew Or] Resolve paths for event logging too
    40e3933 [Andrew Or] Resolve history server file paths
    andrewor14 authored and pwendell committed Jul 31, 2014
    Configuration menu
    Copy the full SHA
    a7c305b View commit details
    Browse the repository at this point in the history
  9. [SPARK-2737] Add retag() method for changing RDDs' ClassTags.

    The Java API's use of fake ClassTags doesn't seem to cause any problems for Java users, but it can lead to issues when passing JavaRDDs' underlying RDDs to Scala code (e.g. in the MLlib Java API wrapper code). If we call collect() on a Scala RDD with an incorrect ClassTag, this causes ClassCastExceptions when we try to allocate an array of the wrong type (for example, see SPARK-2197).
    
    There are a few possible fixes here. An API-breaking fix would be to completely remove the fake ClassTags and require Java API users to pass java.lang.Class instances to all parallelize() calls and add returnClass fields to all Function implementations. This would be extremely verbose.
    
    Instead, this patch adds internal APIs to "repair" a Scala RDD with an incorrect ClassTag by wrapping it and overriding its ClassTag. This should be okay for cases where the Scala code that calls collect() knows what type of array should be allocated, which is the case in the MLlib wrappers.
    
    Author: Josh Rosen <joshrosen@apache.org>
    
    Closes #1639 from JoshRosen/SPARK-2737 and squashes the following commits:
    
    572b4c8 [Josh Rosen] Replace newRDD[T] with mapPartitions().
    469d941 [Josh Rosen] Preserve partitioner in retag().
    af78816 [Josh Rosen] Allow retag() to get classTag implicitly.
    d1d54e6 [Josh Rosen] [SPARK-2737] Add retag() method for changing RDDs' ClassTags.
    JoshRosen committed Jul 31, 2014
    Configuration menu
    Copy the full SHA
    4fb2593 View commit details
    Browse the repository at this point in the history
  10. [SPARK-2497] Included checks for module symbols too.

    Author: Prashant Sharma <prashant.s@imaginea.com>
    
    Closes #1463 from ScrapCodes/SPARK-2497/mima-exclude-all and squashes the following commits:
    
    72077b1 [Prashant Sharma] Check separately for module symbols.
    cd96192 [Prashant Sharma] SPARK-2497 Produce "member excludes" irrespective of the fact that class itself is excluded or not.
    ScrapCodes authored and pwendell committed Jul 31, 2014
    Configuration menu
    Copy the full SHA
    5a110da View commit details
    Browse the repository at this point in the history
  11. automatically set master according to spark.master in `spark-defaul…

    …ts....
    
    automatically set master according to `spark.master` in `spark-defaults.conf`
    
    Author: CrazyJvm <crazyjvm@gmail.com>
    
    Closes #1644 from CrazyJvm/standalone-guide and squashes the following commits:
    
    bb12b95 [CrazyJvm] automatically set master according to `spark.master` in `spark-defaults.conf`
    CrazyJvm authored and pwendell committed Jul 31, 2014
    Configuration menu
    Copy the full SHA
    669e3f0 View commit details
    Browse the repository at this point in the history
  12. [SPARK-2762] SparkILoop leaks memory in multi-repl configurations

    This pull request is a small refactor so that a partial function (hence a closure) is not created. Instead, a regular function is used. The behavior of the code is not changed.
    
    Author: Timothy Hunter <timhunter@databricks.com>
    
    Closes #1674 from thunterdb/closure_issue and squashes the following commits:
    
    e1e664d [Timothy Hunter] simplify closure
    thunterdb authored and mateiz committed Jul 31, 2014
    Configuration menu
    Copy the full SHA
    92ca910 View commit details
    Browse the repository at this point in the history
  13. [SPARK-2743][SQL] Resolve original attributes in ParquetTableScan

    Author: Michael Armbrust <michael@databricks.com>
    
    Closes #1647 from marmbrus/parquetCase and squashes the following commits:
    
    a1799b7 [Michael Armbrust] move comment
    2a2a68b [Michael Armbrust] Merge remote-tracking branch 'apache/master' into parquetCase
    bb35d5b [Michael Armbrust] Fix test case that produced an invalid plan.
    e6870bf [Michael Armbrust] Better error message.
    539a2e1 [Michael Armbrust] Resolve original attributes in ParquetTableScan
    marmbrus committed Jul 31, 2014
    Configuration menu
    Copy the full SHA
    3072b96 View commit details
    Browse the repository at this point in the history
  14. [SPARK-2397][SQL] Deprecate LocalHiveContext

    LocalHiveContext is redundant with HiveContext.  The only difference is it creates `./metastore` instead of `./metastore_db`.
    
    Author: Michael Armbrust <michael@databricks.com>
    
    Closes #1641 from marmbrus/localHiveContext and squashes the following commits:
    
    e5ec497 [Michael Armbrust] Add deprecation version
    626e056 [Michael Armbrust] Don't remove from imports yet
    905cc5f [Michael Armbrust] Merge remote-tracking branch 'apache/master' into localHiveContext
    1c2727e [Michael Armbrust] Deprecate LocalHiveContext
    marmbrus committed Jul 31, 2014
    Configuration menu
    Copy the full SHA
    72cfb13 View commit details
    Browse the repository at this point in the history
  15. SPARK-2028: Expose mapPartitionsWithInputSplit in HadoopRDD

    This allows users to gain access to the InputSplit which backs each partition.
    
    An alternative solution would have been to have a .withInputSplit() method which returns a new RDD[(InputSplit, (K, V))], but this is confusing because you could not cache this RDD or shuffle it, as InputSplit is not inherently serializable.
    
    Author: Aaron Davidson <aaron@databricks.com>
    
    Closes #973 from aarondav/hadoop and squashes the following commits:
    
    9c9112b [Aaron Davidson] Add JavaAPISuite test
    9942cd7 [Aaron Davidson] Add Java API
    1284a3a [Aaron Davidson] SPARK-2028: Expose mapPartitionsWithInputSplit in HadoopRDD
    aarondav authored and mateiz committed Jul 31, 2014
    Configuration menu
    Copy the full SHA
    f193312 View commit details
    Browse the repository at this point in the history
  16. SPARK-2664. Deal with --conf options in spark-submit that relate to…

    … fl...
    
    ...ags
    
    Author: Sandy Ryza <sandy@cloudera.com>
    
    Closes #1665 from sryza/sandy-spark-2664 and squashes the following commits:
    
    0518c63 [Sandy Ryza] SPARK-2664. Deal with `--conf` options in spark-submit that relate to flags
    sryza authored and pwendell committed Jul 31, 2014
    Configuration menu
    Copy the full SHA
    f68105d View commit details
    Browse the repository at this point in the history
  17. SPARK-2749 [BUILD] Part 2. Fix a follow-on scalastyle error

    The test compile error is fixed, but the build still fails because of one scalastyle error.
    
    https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-pre-YARN/lastFailedBuild/hadoop.version=1.0.4,label=centos/console
    
    Author: Sean Owen <srowen@gmail.com>
    
    Closes #1690 from srowen/SPARK-2749 and squashes the following commits:
    
    1c9e7a6 [Sean Owen] Also: fix scalastyle error by wrapping a long line
    srowen authored and pwendell committed Jul 31, 2014
    Configuration menu
    Copy the full SHA
    4dbabb3 View commit details
    Browse the repository at this point in the history
  18. SPARK-2646. log4j initialization not quite compatible with log4j 2.x

    The logging code that handles log4j initialization leads to an stack overflow error when used with log4j 2.x, which has just been released. This occurs even a downstream project has correctly adjusted SLF4J bindings, and that is the right thing to do for log4j 2.x, since it is effectively a separate project from 1.x.
    
    Here is the relevant bit of Logging.scala:
    
    ```
      private def initializeLogging() {
        // If Log4j is being used, but is not initialized, load a default properties file
        val binder = StaticLoggerBinder.getSingleton
        val usingLog4j = binder.getLoggerFactoryClassStr.endsWith("Log4jLoggerFactory")
        val log4jInitialized = LogManager.getRootLogger.getAllAppenders.hasMoreElements
        if (!log4jInitialized && usingLog4j) {
          val defaultLogProps = "org/apache/spark/log4j-defaults.properties"
          Option(Utils.getSparkClassLoader.getResource(defaultLogProps)) match {
            case Some(url) =>
              PropertyConfigurator.configure(url)
              log.info(s"Using Spark's default log4j profile: $defaultLogProps")
            case None =>
              System.err.println(s"Spark was unable to load $defaultLogProps")
          }
        }
        Logging.initialized = true
    
        // Force a call into slf4j to initialize it. Avoids this happening from mutliple threads
        // and triggering this: http://mailman.qos.ch/pipermail/slf4j-dev/2010-April/002956.html
        log
      }
    ```
    
    The first minor issue is that there is a call to a logger inside this method, which is initializing logging. In this situation, it ends up causing the initialization to be called recursively until the stack overflow. It would be slightly tidier to log this only after Logging.initialized = true. Or not at all. But it's not the root problem, or else, it would not work at all now.
    
    The calls to log4j classes here always reference log4j 1.2 no matter what. For example, there is not getAllAppenders in log4j 2.x. That's fine. Really, "usingLog4j" means "using log4j 1.2" and "log4jInitialized" means "log4j 1.2 is initialized".
    
    usingLog4j should be false for log4j 2.x, because the initialization only matters for log4j 1.2. But, it's true, and that's the real issue. And log4jInitialized is always false, since calls to the log4j 1.2 API are stubs and no-ops in this setup, where the caller has swapped in log4j 2.x. Hence the loop.
    
    This is fixed, I believe, if "usingLog4j" can be false for log4j 2.x. The SLF4J static binding class has the same name for both versions, unfortunately, which causes the issue. However they're in different packages. For example, if the test included "... and begins with org.slf4j", it should work, as the SLF4J binding for log4j 2.x is provided by log4j 2.x at the moment, and is in package org.apache.logging.slf4j.
    
    Of course, I assume that SLF4J will eventually offer its own binding. I hope to goodness they at least name the binding class differently, or else this will again not work. But then some other check can probably be made.
    
    Author: Sean Owen <srowen@gmail.com>
    
    Closes #1547 from srowen/SPARK-2646 and squashes the following commits:
    
    92a9898 [Sean Owen] System.out -> System.err
    94be4c7 [Sean Owen] Add back log message as System.out, with informational comment
    a7f8876 [Sean Owen] Updates from review
    6f3c1d3 [Sean Owen] Remove log statement in logging initialization, and distinguish log4j 1.2 from 2.0, to avoid stack overflow in initialization
    srowen authored and pwendell committed Jul 31, 2014
    Configuration menu
    Copy the full SHA
    e5749a1 View commit details
    Browse the repository at this point in the history
  19. [SPARK-2511][MLLIB] add HashingTF and IDF

    This is roughly the TF-IDF implementation used in the Databricks Cloud Demo: http://databricks.com/cloud/ .
    
    Both `HashingTF` and `IDF` are implemented as transformers, similar to scikit-learn.
    
    Author: Xiangrui Meng <meng@databricks.com>
    
    Closes #1671 from mengxr/tfidf and squashes the following commits:
    
    7d65888 [Xiangrui Meng] use JavaConverters._
    5fe9ec4 [Xiangrui Meng] fix unit test
    6e214ec [Xiangrui Meng] add apache header
    cfd9aed [Xiangrui Meng] add Java-friendly methods move classes to mllib.feature
    3814440 [Xiangrui Meng] add HashingTF and IDF
    mengxr committed Jul 31, 2014
    Configuration menu
    Copy the full SHA
    dc0865b View commit details
    Browse the repository at this point in the history
  20. [SPARK-2523] [SQL] Hadoop table scan bug fixing (fix failing Jenkins …

    …maven test)
    
    This PR tries to resolve the broken Jenkins maven test issue introduced by #1439. Now, we create a single query test to run both the setup work and the test query.
    
    Author: Yin Huai <huai@cse.ohio-state.edu>
    
    Closes #1669 from yhuai/SPARK-2523-fixTest and squashes the following commits:
    
    358af1a [Yin Huai] Make partition_based_table_scan_with_different_serde run atomically.
    yhuai authored and marmbrus committed Jul 31, 2014
    Configuration menu
    Copy the full SHA
    49b3612 View commit details
    Browse the repository at this point in the history
  21. Improvements to merge_spark_pr.py

    This commit fixes a couple of issues in the merge_spark_pr.py developer script:
    
    - Allow recovery from failed cherry-picks.
    - Fix detection of pull requests that have already been merged.
    
    Both of these fixes are useful when backporting changes.
    
    Author: Josh Rosen <joshrosen@apache.org>
    
    Closes #1668 from JoshRosen/pr-script-improvements and squashes the following commits:
    
    ff4f33a [Josh Rosen] Default SPARK_HOME to cwd(); detect missing JIRA credentials.
    ed5bc57 [Josh Rosen] Improvements for backporting using merge_spark_pr:
    JoshRosen committed Jul 31, 2014
    Configuration menu
    Copy the full SHA
    e021362 View commit details
    Browse the repository at this point in the history
  22. Docs: monitoring, streaming programming guide

    Fix several awkward wordings and grammatical issues in the following
    documents:
    
    *   docs/monitoring.md
    
    *   docs/streaming-programming-guide.md
    
    Author: kballou <kballou@devnulllabs.io>
    
    Closes #1662 from kennyballou/grammar_fixes and squashes the following commits:
    
    e1b8ad6 [kballou] Docs: monitoring, streaming programming guide
    kennyballou authored and JoshRosen committed Jul 31, 2014
    Configuration menu
    Copy the full SHA
    cc82050 View commit details
    Browse the repository at this point in the history
  23. SPARK-2740: allow user to specify ascending and numPartitions for sor…

    …tBy...
    
    It should be more convenient if user can specify ascending and numPartitions when calling sortByKey.
    
    Author: Rui Li <rui.li@intel.com>
    
    Closes #1645 from lirui-intel/spark-2740 and squashes the following commits:
    
    fb5d52e [Rui Li] SPARK-2740: allow user to specify ascending and numPartitions for sortByKey
    Rui Li authored and JoshRosen committed Jul 31, 2014
    Configuration menu
    Copy the full SHA
    492a195 View commit details
    Browse the repository at this point in the history
  24. SPARK-2282: Reuse Socket for sending accumulator updates to Pyspark

    Prior to this change, every PySpark task completion opened a new socket to the accumulator server, passed its updates through, and then quit. I'm not entirely sure why PySpark always sends accumulator updates, but regardless this causes a very rapid buildup of ephemeral TCP connections that remain in the TCP_WAIT state for around a minute before being cleaned up.
    
    Rather than trying to allow these sockets to be cleaned up faster, this patch simply reuses the connection between tasks completions (since they're fed updates in a single-threaded manner by the DAGScheduler anyway).
    
    The only tricky part here was making sure that the AccumulatorServer was able to shutdown in a timely manner (i.e., stop polling for new data), and this was accomplished via minor feats of magic.
    
    I have confirmed that this patch eliminates the buildup of ephemeral sockets due to the accumulator updates. However, I did note that there were still significant sockets being created against the PySpark daemon port, but my machine was not able to create enough sockets fast enough to fail. This may not be the last time we've seen this issue, though.
    
    Author: Aaron Davidson <aaron@databricks.com>
    
    Closes #1503 from aarondav/accum and squashes the following commits:
    
    b3e12f7 [Aaron Davidson] SPARK-2282: Reuse Socket for sending accumulator updates to Pyspark
    aarondav authored and JoshRosen committed Jul 31, 2014
    Configuration menu
    Copy the full SHA
    ef4ff00 View commit details
    Browse the repository at this point in the history

Commits on Aug 1, 2014

  1. [SPARK-2531 & SPARK-2436] [SQL] Optimize the BuildSide when planning …

    …BroadcastNestedLoopJoin.
    
    This PR resolves the following two tickets:
    
    - [SPARK-2531](https://issues.apache.org/jira/browse/SPARK-2531): BNLJ currently assumes the build side is the right relation. This patch refactors some of its logic to take into account a BuildSide properly.
    - [SPARK-2436](https://issues.apache.org/jira/browse/SPARK-2436): building on top of the above, we simply use the physical size statistics (if available) of both relations, and make the smaller relation the build side in the planner.
    
    Author: Zongheng Yang <zongheng.y@gmail.com>
    
    Closes #1448 from concretevitamin/bnlj-buildSide and squashes the following commits:
    
    1780351 [Zongheng Yang] Use size estimation to decide optimal build side of BNLJ.
    68e6c5b [Zongheng Yang] Consolidate two adjacent pattern matchings.
    96d312a [Zongheng Yang] Use a while loop instead of collection methods chaining.
    4bc525e [Zongheng Yang] Make BroadcastNestedLoopJoin take a BuildSide.
    concretevitamin authored and marmbrus committed Aug 1, 2014
    Configuration menu
    Copy the full SHA
    8f51491 View commit details
    Browse the repository at this point in the history
  2. [SPARK-2724] Python version of RandomRDDGenerators

    RandomRDDGenerators but without support for randomRDD and randomVectorRDD, which take in arbitrary DistributionGenerator.
    
    `randomRDD.py` is named to avoid collision with the built-in Python `random` package.
    
    Author: Doris Xin <doris.s.xin@gmail.com>
    
    Closes #1628 from dorx/pythonRDD and squashes the following commits:
    
    55c6de8 [Doris Xin] review comments. all python units passed.
    f831d9b [Doris Xin] moved default args logic into PythonMLLibAPI
    2d73917 [Doris Xin] fix for linalg.py
    8663e6a [Doris Xin] reverting back to a single python file for random
    f47c481 [Doris Xin] docs update
    687aac0 [Doris Xin] add RandomRDDGenerators.py to run-tests
    4338f40 [Doris Xin] renamed randomRDD to rand and import as random
    29d205e [Doris Xin] created mllib.random package
    bd2df13 [Doris Xin] typos
    07ddff2 [Doris Xin] units passed.
    23b2ecd [Doris Xin] WIP
    dorx authored and mengxr committed Aug 1, 2014
    Configuration menu
    Copy the full SHA
    d843014 View commit details
    Browse the repository at this point in the history
  3. [SPARK-2756] [mllib] Decision tree bug fixes

    (1) Inconsistent aggregate (agg) indexing for unordered features.
    (2) Fixed gain calculations for edge cases.
    (3) One-off error in choosing thresholds for continuous features for small datasets.
    (4) (not a bug) Changed meaning of tree depth by 1 to fit scikit-learn and rpart. (Depth 1 used to mean 1 leaf node; depth 0 now means 1 leaf node.)
    
    Other updates, to help with tests:
    * Updated DecisionTreeRunner to print more info.
    * Added utility functions to DecisionTreeModel: toString, depth, numNodes
    * Improved internal DecisionTree documentation
    
    Bug fix details:
    
    (1) Indexing was inconsistent for aggregate calculations for unordered features (in multiclass classification with categorical features, where the features had few enough values such that they could be considered unordered, i.e., isSpaceSufficientForAllCategoricalSplits=true).
    
    * updateBinForUnorderedFeature indexed agg as (node, feature, featureValue, binIndex), where
    ** featureValue was from arr (so it was a feature value)
    ** binIndex was in [0,…, 2^(maxFeatureValue-1)-1)
    * The rest of the code indexed agg as (node, feature, binIndex, label).
    * Corrected this bug by changing updateBinForUnorderedFeature to use the second indexing pattern.
    
    Unit tests in DecisionTreeSuite
    * Updated a few tests to train a model and test its training accuracy, which catches the indexing bug from updateBinForUnorderedFeature() discussed above.
    * Added new test (“stump with categorical variables for multiclass classification, with just enough bins”) to test bin extremes.
    
    (2) Bug fix: calculateGainForSplit (for classification):
    * It used to return dummy prediction values when either the right or left children had 0 weight.  These were incorrect for multiclass classification.  It has been corrected.
    
    Updated impurities to allow for count = 0.  This was related to the above bug fix for calculateGainForSplit (for classification).
    
    Small updates to documentation and coding style.
    
    (3) Bug fix: Off-by-1 when finding thresholds for splits for continuous features.
    
    * Exhibited bug in new test in DecisionTreeSuite: “stump with 1 continuous variable for binary classification, to check off-by-1 error”
    * Description: When finding thresholds for possible splits for continuous features in DecisionTree.findSplitsBins, the thresholds were set according to individual training examples’ feature values.
    * Fix: The threshold is set to be the average of 2 consecutive (sorted) examples’ feature values.  E.g.: If the old code set the threshold using example i, the new code sets the threshold using exam
    * Note: In 4 DecisionTreeSuite tests with all labels identical, removed check of threshold since it is somewhat arbitrary.
    
    CC: mengxr manishamde  Please let me know if I missed something!
    
    Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
    
    Closes #1673 from jkbradley/decisiontree-bugfix and squashes the following commits:
    
    2b20c61 [Joseph K. Bradley] Small doc and style updates
    dab0b67 [Joseph K. Bradley] Added documentation for DecisionTree internals
    8bb8aa0 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
    978cfcf [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
    6eed482 [Joseph K. Bradley] In DecisionTree: Changed from using procedural syntax for functions returning Unit to explicitly writing Unit return type.
    376dca2 [Joseph K. Bradley] Updated meaning of maxDepth by 1 to fit scikit-learn and rpart. * In code, replaced usages of maxDepth <-- maxDepth + 1 * In params, replace settings of maxDepth <-- maxDepth - 1
    59750f8 [Joseph K. Bradley] * Updated Strategy to check numClassesForClassification only if algo=Classification. * Updates based on comments: ** DecisionTreeRunner *** Made dataFormat arg default to libsvm ** Small cleanups ** tree.Node: Made recursive helper methods private, and renamed them.
    52e17c5 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
    da50db7 [Joseph K. Bradley] Added one more test to DecisionTreeSuite: stump with 2 continuous variables for binary classification.  Caused problems in past, but fixed now.
    8ea8750 [Joseph K. Bradley] Bug fix: Off-by-1 when finding thresholds for splits for continuous features.
    2283df8 [Joseph K. Bradley] 2 bug fixes.
    73fbea2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
    5f920a1 [Joseph K. Bradley] Demonstration of bug before submitting fix: Updated DecisionTreeSuite so that 3 tests fail.  Will describe bug in next commit.
    jkbradley authored and mengxr committed Aug 1, 2014
    Configuration menu
    Copy the full SHA
    b124de5 View commit details
    Browse the repository at this point in the history
  4. [SPARK-2779] [SQL] asInstanceOf[Map[...]] should use scala.collection…

    ….Map instead of scala.collection.immutable.Map
    
    Since we let users create Rows. It makes sense to accept mutable Maps as values of MapType columns.
    
    JIRA: https://issues.apache.org/jira/browse/SPARK-2779
    
    Author: Yin Huai <huai@cse.ohio-state.edu>
    
    Closes #1705 from yhuai/SPARK-2779 and squashes the following commits:
    
    00d72fd [Yin Huai] Use scala.collection.Map.
    yhuai authored and marmbrus committed Aug 1, 2014
    Configuration menu
    Copy the full SHA
    9632719 View commit details
    Browse the repository at this point in the history
  5. SPARK-2766: ScalaReflectionSuite throw an llegalArgumentException in …

    …JDK 6
    
    Author: GuoQiang Li <witgo@qq.com>
    
    Closes #1683 from witgo/SPARK-2766 and squashes the following commits:
    
    d0db00c [GuoQiang Li] ScalaReflectionSuite  throw an llegalArgumentException in JDK 6
    witgo authored and pwendell committed Aug 1, 2014
    Configuration menu
    Copy the full SHA
    9998efa View commit details
    Browse the repository at this point in the history
  6. [SPARK-2777][MLLIB] change ALS factors storage level to MEMORY_AND_DISK

    Now the factors are persisted in memory only. If they get kicked off by later jobs, we might have to start the computation from very beginning. A better solution is changing the storage level to `MEMORY_AND_DISK`.
    
    srowen
    
    Author: Xiangrui Meng <meng@databricks.com>
    
    Closes #1700 from mengxr/als-level and squashes the following commits:
    
    c103d76 [Xiangrui Meng] change ALS factors storage level to MEMORY_AND_DISK
    mengxr committed Aug 1, 2014
    Configuration menu
    Copy the full SHA
    b190083 View commit details
    Browse the repository at this point in the history
  7. [SPARK-2782][mllib] Bug fix for getRanks in SpearmanCorrelation

    getRanks computes the wrong rank when numPartition >= size in the input RDDs before this patch. added units to address this bug.
    
    Author: Doris Xin <doris.s.xin@gmail.com>
    
    Closes #1710 from dorx/correlationBug and squashes the following commits:
    
    733def4 [Doris Xin] bugs and reviewer comments.
    31db920 [Doris Xin] revert unnecessary change
    043ff83 [Doris Xin] bug fix for spearman corner case
    dorx authored and mengxr committed Aug 1, 2014
    Configuration menu
    Copy the full SHA
    c475540 View commit details
    Browse the repository at this point in the history
  8. [SPARK-2702][Core] Upgrade Tachyon dependency to 0.5.0

    Author: Haoyuan Li <haoyuan@cs.berkeley.edu>
    
    Closes #1651 from haoyuan/upgrade-tachyon and squashes the following commits:
    
    6f3f98f [Haoyuan Li] upgrade tachyon to 0.5.0
    haoyuan authored and pwendell committed Aug 1, 2014
    Configuration menu
    Copy the full SHA
    2cdc3e5 View commit details
    Browse the repository at this point in the history
  9. SPARK-2632, SPARK-2576. Fixed by only importing what is necessary dur…

    …ing class definition.
    
    Without this patch, it imports everything available in the scope.
    
    ```scala
    
    scala> val a = 10l
    val a = 10l
    a: Long = 10
    
    scala> import a._
    import a._
    import a._
    
    scala> case class A(a: Int) // show
    case class A(a: Int) // show
    class $read extends Serializable {
      def <init>() = {
        super.<init>;
        ()
      };
      class $iwC extends Serializable {
        def <init>() = {
          super.<init>;
          ()
        };
        class $iwC extends Serializable {
          def <init>() = {
            super.<init>;
            ()
          };
          import org.apache.spark.SparkContext._;
          class $iwC extends Serializable {
            def <init>() = {
              super.<init>;
              ()
            };
            val $VAL5 = $line5.$read.INSTANCE;
            import $VAL5.$iw.$iw.$iw.$iw.a;
            class $iwC extends Serializable {
              def <init>() = {
                super.<init>;
                ()
              };
              import a._;
              class $iwC extends Serializable {
                def <init>() = {
                  super.<init>;
                  ()
                };
                class $iwC extends Serializable {
                  def <init>() = {
                    super.<init>;
                    ()
                  };
                  case class A extends scala.Product with scala.Serializable {
                    <caseaccessor> <paramaccessor> val a: Int = _;
                    def <init>(a: Int) = {
                      super.<init>;
                      ()
                    }
                  }
                };
                val $iw = new $iwC.<init>
              };
              val $iw = new $iwC.<init>
            };
            val $iw = new $iwC.<init>
          };
          val $iw = new $iwC.<init>
        };
        val $iw = new $iwC.<init>
      };
      val $iw = new $iwC.<init>
    }
    object $read extends scala.AnyRef {
      def <init>() = {
        super.<init>;
        ()
      };
      val INSTANCE = new $read.<init>
    }
    defined class A
    ```
    
    With this patch, it just imports  only the necessary.
    
    ```scala
    
    scala> val a = 10l
    val a = 10l
    a: Long = 10
    
    scala> import a._
    import a._
    import a._
    
    scala> case class A(a: Int) // show
    case class A(a: Int) // show
    class $read extends Serializable {
      def <init>() = {
        super.<init>;
        ()
      };
      class $iwC extends Serializable {
        def <init>() = {
          super.<init>;
          ()
        };
        class $iwC extends Serializable {
          def <init>() = {
            super.<init>;
            ()
          };
          case class A extends scala.Product with scala.Serializable {
            <caseaccessor> <paramaccessor> val a: Int = _;
            def <init>(a: Int) = {
              super.<init>;
              ()
            }
          }
        };
        val $iw = new $iwC.<init>
      };
      val $iw = new $iwC.<init>
    }
    object $read extends scala.AnyRef {
      def <init>() = {
        super.<init>;
        ()
      };
      val INSTANCE = new $read.<init>
    }
    defined class A
    
    scala>
    
    ```
    
    This patch also adds a `:fallback` mode on being enabled it will restore the spark-shell's 1.0.0 behaviour.
    
    Author: Prashant Sharma <scrapcodes@gmail.com>
    Author: Yin Huai <huai@cse.ohio-state.edu>
    Author: Prashant Sharma <prashant.s@imaginea.com>
    
    Closes #1635 from ScrapCodes/repl-fix-necessary-imports and squashes the following commits:
    
    b1968d2 [Prashant Sharma] Added toschemaRDD to test case.
    0b712bb [Yin Huai] Add a REPL test to test importing a method.
    02ad8ff [Yin Huai] Add a REPL test for importing SQLContext.createSchemaRDD.
    ed6d0c7 [Prashant Sharma] Added a fallback mode, incase users run into issues while using repl.
    b63d3b2 [Prashant Sharma] SPARK-2632, SPARK-2576. Fixed by only importing what is necessary during class definition.
    ScrapCodes authored and marmbrus committed Aug 1, 2014
    Configuration menu
    Copy the full SHA
    1499101 View commit details
    Browse the repository at this point in the history
  10. SPARK-2738. Remove redundant imports in BlockManagerSuite

    Author: Sandy Ryza <sandy@cloudera.com>
    
    Closes #1642 from sryza/sandy-spark-2738 and squashes the following commits:
    
    a923e4e [Sandy Ryza] SPARK-2738. Remove redundant imports in BlockManagerSuite
    sryza authored and pwendell committed Aug 1, 2014
    Configuration menu
    Copy the full SHA
    cb9e7d5 View commit details
    Browse the repository at this point in the history
  11. [SPARK-2670] FetchFailedException should be thrown when local fetch h…

    …as failed
    
    Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
    
    Closes #1578 from sarutak/SPARK-2670 and squashes the following commits:
    
    85c8938 [Kousuke Saruta] Removed useless results.put for fail fast
    e8713cc [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2670
    d353984 [Kousuke Saruta] Refined assertion messages in BlockFetcherIteratorSuite.scala
    03bcb02 [Kousuke Saruta] Merge branch 'SPARK-2670' of github.com:sarutak/spark into SPARK-2670
    5d05855 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2670
    4fca130 [Kousuke Saruta] Added test cases for BasicBlockFetcherIterator
    b7b8250 [Kousuke Saruta] Modified BasicBlockFetchIterator to fail fast when local fetch error has been occurred
    a3a9be1 [Kousuke Saruta] Modified BlockFetcherIterator for SPARK-2670
    460dc01 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2670
    e310c0b [Kousuke Saruta] Modified BlockFetcherIterator to handle local fetch failure as fatch fail
    sarutak authored and mateiz committed Aug 1, 2014
    Configuration menu
    Copy the full SHA
    8ff4417 View commit details
    Browse the repository at this point in the history
  12. SPARK-983. Support external sorting in sortByKey()

    This patch simply uses the ExternalSorter class from sort-based shuffle.
    
    Closes #931 and Closes #1090
    
    Author: Matei Zaharia <matei@databricks.com>
    
    Closes #1677 from mateiz/spark-983 and squashes the following commits:
    
    96b3fda [Matei Zaharia] SPARK-983. Support external sorting in sortByKey()
    mateiz committed Aug 1, 2014
    Configuration menu
    Copy the full SHA
    72e3369 View commit details
    Browse the repository at this point in the history
  13. SPARK-2134: Report metrics before application finishes

    Author: Rahul Singhal <rahul.singhal@guavus.com>
    
    Closes #1076 from rahulsinghaliitd/SPARK-2134 and squashes the following commits:
    
    15f18b6 [Rahul Singhal] SPARK-2134: Report metrics before application finishes
    Rahul Singhal authored and mateiz committed Aug 1, 2014
    Configuration menu
    Copy the full SHA
    f1957e1 View commit details
    Browse the repository at this point in the history
  14. [Spark 2557] fix LOCAL_N_REGEX in createTaskScheduler and make local-…

    …n and local-n-failures consistent
    
    [SPARK-2557](https://issues.apache.org/jira/browse/SPARK-2557)
    
    Author: Ye Xianjin <advancedxy@gmail.com>
    
    Closes #1464 from advancedxy/SPARK-2557 and squashes the following commits:
    
    d844d67 [Ye Xianjin] add local-*-n-failures, bad-local-n, bad-local-n-failures test case
    3bbc668 [Ye Xianjin] fix LOCAL_N_REGEX regular expression and make local_n_failures accept * as all cores on the computer
    advancedxy authored and aarondav committed Aug 1, 2014
    Configuration menu
    Copy the full SHA
    284771e View commit details
    Browse the repository at this point in the history
  15. [SPARK-2103][Streaming] Change to ClassTag for KafkaInputDStream and …

    …fix reflection issue
    
    This PR updates previous Manifest for KafkaInputDStream's Decoder to ClassTag, also fix the problem addressed in [SPARK-2103](https://issues.apache.org/jira/browse/SPARK-2103).
    
    Previous Java interface cannot actually get the type of Decoder, so when using this Manifest to reconstruct the decode object will meet reflection exception.
    
    Also for other two Java interfaces, ClassTag[String] is useless because calling Scala API will get the right implicit ClassTag.
    
    Current Kafka unit test cannot actually verify the interface. I've tested these interfaces in my local and distribute settings.
    
    Author: jerryshao <saisai.shao@intel.com>
    
    Closes #1508 from jerryshao/SPARK-2103 and squashes the following commits:
    
    e90c37b [jerryshao] Add Mima excludes
    7529810 [jerryshao] Change Manifest to ClassTag for KafkaInputDStream's Decoder and fix Decoder construct issue when using Java API
    jerryshao authored and tdas committed Aug 1, 2014
    Configuration menu
    Copy the full SHA
    a32f0fb View commit details
    Browse the repository at this point in the history
  16. SPARK-2768 [MLLIB] Add product, user recommend method to MatrixFactor…

    …izationModel
    
    Right now, `MatrixFactorizationModel` can only predict a score for one or more `(user,product)` tuples. As a comment in the file notes, it would be more useful to expose a recommend method, that computes top N scoring products for a user (or vice versa – users for a product).
    
    (This also corrects some long lines in the Java ALS test suite.)
    
    As you can see, it's a little messy to access the class from Java. Should there be a Java-friendly wrapper for it? with a pointer about where that should go, I could add that.
    
    Author: Sean Owen <srowen@gmail.com>
    
    Closes #1687 from srowen/SPARK-2768 and squashes the following commits:
    
    b349675 [Sean Owen] Additional review changes
    c9edb04 [Sean Owen] Updates from code review
    7bc35f9 [Sean Owen] Add recommend methods to MatrixFactorizationModel
    srowen authored and mengxr committed Aug 1, 2014
    Configuration menu
    Copy the full SHA
    82d209d View commit details
    Browse the repository at this point in the history
  17. [SPARK-1997] update breeze to version 0.8.1

    `breeze 0.8.1`  dependent on  `scala-logging-slf4j 2.1.1` The relevant code on #1369
    
    Author: witgo <witgo@qq.com>
    
    Closes #940 from witgo/breeze-8.0.1 and squashes the following commits:
    
    65cc65e [witgo] update breeze  to version 0.8.1
    witgo authored and mengxr committed Aug 1, 2014
    Configuration menu
    Copy the full SHA
    0dacb1a View commit details
    Browse the repository at this point in the history
  18. [HOTFIX] downgrade breeze version to 0.7

    breeze-0.8.1 causes dependency issues, as discussed in #940 .
    
    Author: Xiangrui Meng <meng@databricks.com>
    
    Closes #1718 from mengxr/revert-breeze and squashes the following commits:
    
    99c4681 [Xiangrui Meng] downgrade breeze version to 0.7
    mengxr committed Aug 1, 2014
    Configuration menu
    Copy the full SHA
    5328c0a View commit details
    Browse the repository at this point in the history
  19. SPARK-2099. Report progress while task is running.

    This is a sketch of a patch that allows the UI to show metrics for tasks that have not yet completed.  It adds a heartbeat every 2 seconds from the executors to the driver, reporting metrics for all of the executor's tasks.
    
    It still needs unit tests, polish, and cluster testing, but I wanted to put it up to get feedback on the approach.
    
    Author: Sandy Ryza <sandy@cloudera.com>
    
    Closes #1056 from sryza/sandy-spark-2099 and squashes the following commits:
    
    93b9fdb [Sandy Ryza] Up heartbeat interval to 10 seconds and other tidying
    132aec7 [Sandy Ryza] Heartbeat and HeartbeatResponse are already Serializable as case classes
    38dffde [Sandy Ryza] Additional review feedback and restore test that was removed in BlockManagerSuite
    51fa396 [Sandy Ryza] Remove hostname race, add better comments about threading, and some stylistic improvements
    3084f10 [Sandy Ryza] Make TaskUIData a case class again
    3bda974 [Sandy Ryza] Stylistic fixes
    0dae734 [Sandy Ryza] SPARK-2099. Report progress while task is running.
    sryza authored and pwendell committed Aug 1, 2014
    Configuration menu
    Copy the full SHA
    8d338f6 View commit details
    Browse the repository at this point in the history
  20. [SPARK-2179][SQL] A minor refactoring Java data type APIs (2179 follo…

    …w-up).
    
    It is a follow-up PR of SPARK-2179 (https://issues.apache.org/jira/browse/SPARK-2179). It makes package names of data type APIs more consistent across languages (Scala: `org.apache.spark.sql`, Java: `org.apache.spark.sql.api.java`, Python: `pyspark.sql`).
    
    Author: Yin Huai <huai@cse.ohio-state.edu>
    
    Closes #1712 from yhuai/javaDataType and squashes the following commits:
    
    62eb705 [Yin Huai] Move package-info.
    add4bcb [Yin Huai] Make the package names of data type classes consistent across languages by moving all Java data type classes to package sql.api.java.
    yhuai authored and marmbrus committed Aug 1, 2014
    Configuration menu
    Copy the full SHA
    c41fdf0 View commit details
    Browse the repository at this point in the history
  21. [SQL][SPARK-2212]Hash Outer Join

    This patch is to support the hash based outer join. Currently, outer join for big relations are resort to `BoradcastNestedLoopJoin`, which is super slow. This PR will create 2 hash tables for both relations in the same partition, which greatly reduce the table scans.
    
    Here is the testing code that I used:
    ```
    package org.apache.spark.sql.hive
    
    import org.apache.spark.SparkContext
    import org.apache.spark.SparkConf
    import org.apache.spark.sql._
    
    case class Record(key: String, value: String)
    
    object JoinTablePrepare extends App {
      import TestHive2._
    
      val rdd = sparkContext.parallelize((1 to 3000000).map(i => Record(s"${i % 828193}", s"val_$i")))
    
      runSqlHive("SHOW TABLES")
      runSqlHive("DROP TABLE if exists a")
      runSqlHive("DROP TABLE if exists b")
      runSqlHive("DROP TABLE if exists result")
      rdd.registerAsTable("records")
    
      runSqlHive("""CREATE TABLE a (key STRING, value STRING)
                     | ROW FORMAT SERDE
                     | 'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe'
                     | STORED AS RCFILE
                   """.stripMargin)
      runSqlHive("""CREATE TABLE b (key STRING, value STRING)
                     | ROW FORMAT SERDE
                     | 'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe'
                     | STORED AS RCFILE
                   """.stripMargin)
      runSqlHive("""CREATE TABLE result (key STRING, value STRING)
                     | ROW FORMAT SERDE
                     | 'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe'
                     | STORED AS RCFILE
                   """.stripMargin)
    
      hql(s"""from records
                 | insert into table a
                 | select key, value
               """.stripMargin)
      hql(s"""from records
                 | insert into table b select key + 100000, value
               """.stripMargin)
    }
    
    object JoinTablePerformanceTest extends App {
      import TestHive2._
    
      hql("SHOW TABLES")
      hql("set spark.sql.shuffle.partitions=20")
    
      val leftOuterJoin = "insert overwrite table result select a.key, b.value from a left outer join b on a.key=b.key"
      val rightOuterJoin = "insert overwrite table result select a.key, b.value from a right outer join b on a.key=b.key"
      val fullOuterJoin = "insert overwrite table result select a.key, b.value from a full outer join b on a.key=b.key"
    
      val results = ("LeftOuterJoin", benchmark(leftOuterJoin)) :: ("LeftOuterJoin", benchmark(leftOuterJoin)) ::
                    ("RightOuterJoin", benchmark(rightOuterJoin)) :: ("RightOuterJoin", benchmark(rightOuterJoin)) ::
                    ("FullOuterJoin", benchmark(fullOuterJoin)) :: ("FullOuterJoin", benchmark(fullOuterJoin)) :: Nil
      val explains = hql(s"explain $leftOuterJoin").collect ++ hql(s"explain $rightOuterJoin").collect ++ hql(s"explain $fullOuterJoin").collect
      println(explains.mkString(",\n"))
      results.foreach { case (prompt, result) => {
          println(s"$prompt: took ${result._1} ms (${result._2} records)")
        }
      }
    
      def benchmark(cmd: String) = {
        val begin = System.currentTimeMillis()
        val result = hql(cmd)
        val end = System.currentTimeMillis()
        val count = hql("select count(1) from result").collect.mkString("")
        ((end - begin), count)
      }
    }
    ```
    And the result as shown below:
    ```
    [Physical execution plan:],
    [InsertIntoHiveTable (MetastoreRelation default, result, None), Map(), true],
    [ Project [key#95,value#98]],
    [  HashOuterJoin [key#95], [key#97], LeftOuter, None],
    [   Exchange (HashPartitioning [key#95], 20)],
    [    HiveTableScan [key#95], (MetastoreRelation default, a, None), None],
    [   Exchange (HashPartitioning [key#97], 20)],
    [    HiveTableScan [key#97,value#98], (MetastoreRelation default, b, None), None],
    [Physical execution plan:],
    [InsertIntoHiveTable (MetastoreRelation default, result, None), Map(), true],
    [ Project [key#102,value#105]],
    [  HashOuterJoin [key#102], [key#104], RightOuter, None],
    [   Exchange (HashPartitioning [key#102], 20)],
    [    HiveTableScan [key#102], (MetastoreRelation default, a, None), None],
    [   Exchange (HashPartitioning [key#104], 20)],
    [    HiveTableScan [key#104,value#105], (MetastoreRelation default, b, None), None],
    [Physical execution plan:],
    [InsertIntoHiveTable (MetastoreRelation default, result, None), Map(), true],
    [ Project [key#109,value#112]],
    [  HashOuterJoin [key#109], [key#111], FullOuter, None],
    [   Exchange (HashPartitioning [key#109], 20)],
    [    HiveTableScan [key#109], (MetastoreRelation default, a, None), None],
    [   Exchange (HashPartitioning [key#111], 20)],
    [    HiveTableScan [key#111,value#112], (MetastoreRelation default, b, None), None]
    LeftOuterJoin: took 16072 ms ([3000000] records)
    LeftOuterJoin: took 14394 ms ([3000000] records)
    RightOuterJoin: took 14802 ms ([3000000] records)
    RightOuterJoin: took 14747 ms ([3000000] records)
    FullOuterJoin: took 17715 ms ([6000000] records)
    FullOuterJoin: took 17629 ms ([6000000] records)
    ```
    
    Without this PR, the benchmark will run seems never end.
    
    Author: Cheng Hao <hao.cheng@intel.com>
    
    Closes #1147 from chenghao-intel/hash_based_outer_join and squashes the following commits:
    
    65c599e [Cheng Hao] Fix issues with the community comments
    72b1394 [Cheng Hao] Fix bug of stale value in joinedRow
    55baef7 [Cheng Hao] Add HashOuterJoin
    chenghao-intel authored and marmbrus committed Aug 1, 2014
    Configuration menu
    Copy the full SHA
    4415722 View commit details
    Browse the repository at this point in the history
  22. [SPARK-2729] [SQL] Forgot to match Timestamp type in ColumnBuilder

    just a match forgot, found after SPARK-2710 , TimestampType can be used by a SchemaRDD generated from JDBC ResultSet
    
    Author: chutium <teng.qiu@gmail.com>
    
    Closes #1636 from chutium/SPARK-2729 and squashes the following commits:
    
    71af77a [chutium] [SPARK-2729] [SQL] added Timestamp in NullableColumnAccessorSuite
    39cf9f8 [chutium] [SPARK-2729] add Timestamp Type into ColumnBuilder TestSuite, ref. #1636
    ab6ff97 [chutium] [SPARK-2729] Forgot to match Timestamp type in ColumnBuilder
    chutium authored and marmbrus committed Aug 1, 2014
    Configuration menu
    Copy the full SHA
    580c701 View commit details
    Browse the repository at this point in the history
  23. [SPARK-2767] [SQL] SparkSQL CLI doens't output error message if query…

    … failed.
    
    Author: Cheng Hao <hao.cheng@intel.com>
    
    Closes #1686 from chenghao-intel/spark_sql_cli and squashes the following commits:
    
    eb664cc [Cheng Hao] Output detailed failure message in console
    93b0382 [Cheng Hao] Fix Bug of no output in cli if exception thrown internally
    chenghao-intel authored and marmbrus committed Aug 1, 2014
    Configuration menu
    Copy the full SHA
    c0b47ba View commit details
    Browse the repository at this point in the history
  24. [SQL] Documentation: Explain cacheTable command

    add the `cacheTable` specification
    
    Author: CrazyJvm <crazyjvm@gmail.com>
    
    Closes #1681 from CrazyJvm/sql-programming-guide-cache and squashes the following commits:
    
    0a231e0 [CrazyJvm] grammar fixes
    a04020e [CrazyJvm] modify title to Cached tables
    18b6594 [CrazyJvm] fix format
    2cbbf58 [CrazyJvm] add cacheTable guide
    CrazyJvm authored and pwendell committed Aug 1, 2014
    Configuration menu
    Copy the full SHA
    c82fe47 View commit details
    Browse the repository at this point in the history
  25. [SPARK-695] In DAGScheduler's getPreferredLocs, track set of visited …

    …partitions.
    
    getPreferredLocs traverses a dependency graph of partitions using depth first search.  Given a complex dependency graph, the old implementation may explore a set of paths in the graph that is exponential in the number of nodes.  By maintaining a set of visited nodes the new implementation avoids revisiting nodes, preventing exponential blowup.
    
    Some comment and whitespace cleanups are also included.
    
    Author: Aaron Staple <aaron.staple@gmail.com>
    
    Closes #1362 from staple/SPARK-695 and squashes the following commits:
    
    ecea0f3 [Aaron Staple] address review comments
    751c661 [Aaron Staple] [SPARK-695] Add a unit test.
    5adf326 [Aaron Staple] Replace getPreferredLocsInternal's HashMap argument with a simpler HashSet.
    58e37d0 [Aaron Staple] Replace comment documenting NarrowDependency.
    6751ced [Aaron Staple] Revert "Remove unused variable."
    04c7097 [Aaron Staple] Fix indentation.
    0030884 [Aaron Staple] Remove unused variable.
    33f67c6 [Aaron Staple] Clarify comment.
    4e42b46 [Aaron Staple] Remove apparently incorrect comment describing NarrowDependency.
    65c2d3d [Aaron Staple] [SPARK-695] In DAGScheduler's getPreferredLocs, track set of visited partitions.
    staple authored and mateiz committed Aug 1, 2014
    Configuration menu
    Copy the full SHA
    eb5bdca View commit details
    Browse the repository at this point in the history
  26. [SPARK-2490] Change recursive visiting on RDD dependencies to iterati…

    …ve approach
    
    When performing some transformations on RDDs after many iterations, the dependencies of RDDs could be very long. It can easily cause StackOverflowError when recursively visiting these dependencies in Spark core. For example:
    
        var rdd = sc.makeRDD(Array(1))
        for (i <- 1 to 1000) {
          rdd = rdd.coalesce(1).cache()
          rdd.collect()
        }
    
    This PR changes recursive visiting on rdd's dependencies to iterative approach to avoid StackOverflowError.
    
    In addition to the recursive visiting, since the Java serializer has a known [bug](http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4152790) that causes StackOverflowError too when serializing/deserializing a large graph of objects. So applying this PR only solves part of the problem. Using KryoSerializer to replace Java serializer might be helpful. However, since KryoSerializer is not supported for `spark.closure.serializer` now, I can not test if KryoSerializer can solve Java serializer's problem completely.
    
    Author: Liang-Chi Hsieh <viirya@gmail.com>
    
    Closes #1418 from viirya/remove_recursive_visit and squashes the following commits:
    
    6b2c615 [Liang-Chi Hsieh] change function name; comply with code style.
    5f072a7 [Liang-Chi Hsieh] add comments to explain Stack usage.
    8742dbb [Liang-Chi Hsieh] comply with code style.
    900538b [Liang-Chi Hsieh] change recursive visiting on rdd's dependencies to iterative approach to avoid stackoverflowerror.
    viirya authored and mateiz committed Aug 1, 2014
    Configuration menu
    Copy the full SHA
    baf9ce1 View commit details
    Browse the repository at this point in the history
  27. SPARK-1612: Fix potential resource leaks

    JIRA: https://issues.apache.org/jira/browse/SPARK-1612
    
    Move the "close" statements into a "finally" block.
    
    Author: zsxwing <zsxwing@gmail.com>
    
    Closes #535 from zsxwing/SPARK-1612 and squashes the following commits:
    
    ae52f50 [zsxwing] Update to follow the code style
    549ba13 [zsxwing] SPARK-1612: Fix potential resource leaks
    zsxwing authored and mateiz committed Aug 1, 2014
    Configuration menu
    Copy the full SHA
    f5d9bea View commit details
    Browse the repository at this point in the history
  28. [SPARK-2379] Fix the bug that streaming's receiver may fall into a de…

    …ad loop
    
    Author: joyyoj <sunshch@gmail.com>
    
    Closes #1694 from joyyoj/SPARK-2379 and squashes the following commits:
    
    d73790d [joyyoj] SPARK-2379 Fix the bug that streaming's receiver may fall into a dead loop
    22e7821 [joyyoj] Merge remote-tracking branch 'apache/master'
    3f4a602 [joyyoj] Merge remote-tracking branch 'remotes/apache/master'
    f4660c5 [joyyoj] [SPARK-1998] SparkFlumeEvent with body bigger than 1020 bytes are not read properly
    joyyoj authored and tdas committed Aug 1, 2014
    Configuration menu
    Copy the full SHA
    b270309 View commit details
    Browse the repository at this point in the history
  29. SPARK-2791: Fix committing, reverting and state tracking in shuffle f…

    …ile consolidation
    
    All changes from this PR are by mridulm and are drawn from his work in #1609. This patch is intended to fix all major issues related to shuffle file consolidation that mridulm found, while minimizing changes to the code, with the hope that it may be more easily merged into 1.1.
    
    This patch is **not** intended as a replacement for #1609, which provides many additional benefits, including fixes to ExternalAppendOnlyMap, improvements to DiskBlockObjectWriter's API, and several new unit tests.
    
    If it is feasible to merge #1609 for the 1.1 deadline, that is a preferable option.
    
    Author: Aaron Davidson <aaron@databricks.com>
    
    Closes #1678 from aarondav/consol and squashes the following commits:
    
    53b3f6d [Aaron Davidson] Correct behavior when writing unopened file
    701d045 [Aaron Davidson] Rebase with sort-based shuffle
    9160149 [Aaron Davidson] SPARK-2532: Minimal shuffle consolidation fixes
    aarondav authored and mateiz committed Aug 1, 2014
    Configuration menu
    Copy the full SHA
    78f2af5 View commit details
    Browse the repository at this point in the history
  30. [SPARK-2786][mllib] Python correlations

    Author: Doris Xin <doris.s.xin@gmail.com>
    
    Closes #1713 from dorx/pythonCorrelation and squashes the following commits:
    
    5f1e60c [Doris Xin] reviewer comments.
    46ff6eb [Doris Xin] reviewer comments.
    ad44085 [Doris Xin] style fix
    e69d446 [Doris Xin] fixed missed conflicts.
    eb5bf56 [Doris Xin] merge master
    cc9f725 [Doris Xin] units passed.
    9141a63 [Doris Xin] WIP2
    d199f1f [Doris Xin] Moved correlation names into a public object
    cd163d6 [Doris Xin] WIP
    dorx authored and mengxr committed Aug 1, 2014
    Configuration menu
    Copy the full SHA
    d88e695 View commit details
    Browse the repository at this point in the history
  31. [SPARK-2796] [mllib] DecisionTree bug fix: ordered categorical features

    Bug: In DecisionTree, the method sequentialBinSearchForOrderedCategoricalFeatureInClassification() indexed bins from 0 to (math.pow(2, featureCategories.toInt - 1) - 1). This upper bound is the bound for unordered categorical features, not ordered ones. The upper bound should be the arity (i.e., max value) of the feature.
    
    Added new test to DecisionTreeSuite to catch this: "regression stump with categorical variables of arity 2"
    
    Bug fix: Modified upper bound discussed above.
    
    Also: Small improvements to coding style in DecisionTree.
    
    CC mengxr manishamde
    
    Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
    
    Closes #1720 from jkbradley/decisiontree-bugfix2 and squashes the following commits:
    
    225822f [Joseph K. Bradley] Bug: In DecisionTree, the method sequentialBinSearchForOrderedCategoricalFeatureInClassification() indexed bins from 0 to (math.pow(2, featureCategories.toInt - 1) - 1). This upper bound is the bound for unordered categorical features, not ordered ones. The upper bound should be the arity (i.e., max value) of the feature.
    jkbradley authored and mengxr committed Aug 1, 2014
    Configuration menu
    Copy the full SHA
    7058a53 View commit details
    Browse the repository at this point in the history

Commits on Aug 2, 2014

  1. [SPARK-2010] [PySpark] [SQL] support nested structure in SchemaRDD

    Convert Row in JavaSchemaRDD into Array[Any] and unpickle them as tuple in Python, then convert them into namedtuple, so use can access fields just like attributes.
    
    This will let nested structure can be accessed as object, also it will reduce the size of serialized data and better performance.
    
    root
     |-- field1: integer (nullable = true)
     |-- field2: string (nullable = true)
     |-- field3: struct (nullable = true)
     |    |-- field4: integer (nullable = true)
     |    |-- field5: array (nullable = true)
     |    |    |-- element: integer (containsNull = false)
     |-- field6: array (nullable = true)
     |    |-- element: struct (containsNull = false)
     |    |    |-- field7: string (nullable = true)
    
    Then we can access them by row.field3.field5[0]  or row.field6[5].field7
    
    It also will infer the schema in Python, convert Row/dict/namedtuple/objects into tuple before serialization, then call applySchema in JVM. During inferSchema(), the top level of dict in row will be StructType, but any nested dictionary will be MapType.
    
    You can use pyspark.sql.Row to convert unnamed structure into Row object, make the RDD can be inferable. Such as:
    
    ctx.inferSchema(rdd.map(lambda x: Row(a=x[0], b=x[1]))
    
    Or you could use Row to create a class just like namedtuple, for example:
    
    Person = Row("name", "age")
    ctx.inferSchema(rdd.map(lambda x: Person(*x)))
    
    Also, you can call applySchema to apply an schema to a RDD of tuple/list and turn it into a SchemaRDD. The `schema` should be StructType, see the API docs for details.
    
    schema = StructType([StructField("name, StringType, True),
                                        StructType("age", IntegerType, True)])
    ctx.applySchema(rdd, schema)
    
    PS: In order to use namedtuple to inferSchema, you should make namedtuple picklable.
    
    Author: Davies Liu <davies.liu@gmail.com>
    
    Closes #1598 from davies/nested and squashes the following commits:
    
    f1d15b6 [Davies Liu] verify schema with the first few rows
    8852aaf [Davies Liu] check type of schema
    abe9e6e [Davies Liu] address comments
    61b2292 [Davies Liu] add @deprecated to pythonToJavaMap
    1e5b801 [Davies Liu] improve cache of classes
    51aa135 [Davies Liu] use Row to infer schema
    e9c0d5c [Davies Liu] remove string typed schema
    353a3f2 [Davies Liu] fix code style
    63de8f8 [Davies Liu] fix typo
    c79ca67 [Davies Liu] fix serialization of nested data
    6b258b5 [Davies Liu] fix pep8
    9d8447c [Davies Liu] apply schema provided by string of names
    f5df97f [Davies Liu] refactor, address comments
    9d9af55 [Davies Liu] use arrry to applySchema and infer schema in Python
    84679b3 [Davies Liu] Merge branch 'master' of github.com:apache/spark into nested
    0eaaf56 [Davies Liu] fix doc tests
    b3559b4 [Davies Liu] use generated Row instead of namedtuple
    c4ddc30 [Davies Liu] fix conflict between name of fields and variables
    7f6f251 [Davies Liu] address all comments
    d69d397 [Davies Liu] refactor
    2cc2d45 [Davies Liu] refactor
    182fb46 [Davies Liu] refactor
    bc6e9e1 [Davies Liu] switch to new Schema API
    547bf3e [Davies Liu] Merge branch 'master' into nested
    a435b5a [Davies Liu] add docs and code refactor
    2c8debc [Davies Liu] Merge branch 'master' into nested
    644665a [Davies Liu] use tuple and namedtuple for schemardd
    davies authored and marmbrus committed Aug 2, 2014
    Configuration menu
    Copy the full SHA
    880eabe View commit details
    Browse the repository at this point in the history
  2. [SPARK-2212][SQL] Hash Outer Join (follow-up bug fix).

    We need to carefully set the ouputPartitioning of the HashOuterJoin Operator. Otherwise, we may not correctly handle nulls.
    
    Author: Yin Huai <huai@cse.ohio-state.edu>
    
    Closes #1721 from yhuai/SPARK-2212-BugFix and squashes the following commits:
    
    ed5eef7 [Yin Huai] Correctly choosing outputPartitioning for the HashOuterJoin operator.
    yhuai authored and marmbrus committed Aug 2, 2014
    Configuration menu
    Copy the full SHA
    3822f33 View commit details
    Browse the repository at this point in the history
  3. [SPARK-2116] Load spark-defaults.conf from SPARK_CONF_DIR if set

    If SPARK_CONF_DIR environment variable is set, search it for spark-defaults.conf.
    
    Author: Albert Chu <chu11@llnl.gov>
    
    Closes #1059 from chu11/SPARK-2116 and squashes the following commits:
    
    9f3ac94 [Albert Chu] SPARK-2116: If SPARK_CONF_DIR environment variable is set, search it for spark-defaults.conf.
    chu11 authored and mateiz committed Aug 2, 2014
    Configuration menu
    Copy the full SHA
    0da07da View commit details
    Browse the repository at this point in the history
  4. [SPARK-2800]: Exclude scalastyle-output.xml Apache RAT checks

    Author: GuoQiang Li <witgo@qq.com>
    
    Closes #1729 from witgo/SPARK-2800 and squashes the following commits:
    
    13ca966 [GuoQiang Li] Add scalastyle-output.xml  to .rat-excludes file
    witgo authored and pwendell committed Aug 2, 2014
    Configuration menu
    Copy the full SHA
    a38d3c9 View commit details
    Browse the repository at this point in the history
  5. [SPARK-2764] Simplify daemon.py process structure

    Curently, daemon.py forks a pool of numProcessors subprocesses, and those processes fork themselves again to create the actual Python worker processes that handle data.
    
    I think that this extra layer of indirection is unnecessary and adds a lot of complexity.  This commit attempts to remove this middle layer of subprocesses by launching the workers directly from daemon.py.
    
    See mesos/spark#563 for the original PR that added daemon.py, where I raise some issues with the current design.
    
    Author: Josh Rosen <joshrosen@apache.org>
    
    Closes #1680 from JoshRosen/pyspark-daemon and squashes the following commits:
    
    5abbcb9 [Josh Rosen] Replace magic number: 4 -> EINTR
    5495dff [Josh Rosen] Throw IllegalStateException if worker launch fails.
    b79254d [Josh Rosen] Detect failed fork() calls; improve error logging.
    282c2c4 [Josh Rosen] Remove daemon.py exit logging, since it caused problems:
    8554536 [Josh Rosen] Fix daemon’s shutdown(); log shutdown reason.
    4e0fab8 [Josh Rosen] Remove shared-memory exit_flag; don't die on worker death.
    e9892b4 [Josh Rosen] [WIP] [SPARK-2764] Simplify daemon.py process structure.
    JoshRosen authored and aarondav committed Aug 2, 2014
    Configuration menu
    Copy the full SHA
    e8e0fd6 View commit details
    Browse the repository at this point in the history
  6. Streaming mllib [SPARK-2438][MLLIB]

    This PR implements a streaming linear regression analysis, in which a linear regression model is trained online as new data arrive. The design is based on discussions with tdas and mengxr, in which we determined how to add this functionality in a general way, with minimal changes to existing libraries.
    
    __Summary of additions:__
    
    _StreamingLinearAlgorithm_
    - An abstract class for fitting generalized linear models online to streaming data, including training on (and updating) a model, and making predictions.
    
    _StreamingLinearRegressionWithSGD_
    - Class and companion object for running streaming linear regression
    
    _StreamingLinearRegressionTestSuite_
    - Unit tests
    
    _StreamingLinearRegression_
    - Example use case: fitting a model online to data from one stream, and making predictions on other data
    
    __Notes__
    - If this looks good, I can use the StreamingLinearAlgorithm class to easily implement other analyses that follow the same logic (Ridge, Lasso, Logistic, SVM).
    
    Author: Jeremy Freeman <the.freeman.lab@gmail.com>
    Author: freeman <the.freeman.lab@gmail.com>
    
    Closes #1361 from freeman-lab/streaming-mllib and squashes the following commits:
    
    775ea29 [Jeremy Freeman] Throw error if user doesn't initialize weights
    4086fee [Jeremy Freeman] Fixed current weight formatting
    8b95b27 [Jeremy Freeman] Restored broadcasting
    29f27ec [Jeremy Freeman] Formatting
    8711c41 [Jeremy Freeman] Used return to avoid indentation
    777b596 [Jeremy Freeman] Restored treeAggregate
    74cf440 [Jeremy Freeman] Removed static methods
    d28cf9a [Jeremy Freeman] Added usage notes
    c3326e7 [Jeremy Freeman] Improved documentation
    9541a41 [Jeremy Freeman] Merge remote-tracking branch 'upstream/master' into streaming-mllib
    66eba5e [Jeremy Freeman] Fixed line lengths
    2fe0720 [Jeremy Freeman] Minor cleanup
    7d51378 [Jeremy Freeman] Moved streaming loader to MLUtils
    b9b69f6 [Jeremy Freeman] Added setter methods
    c3f8b5a [Jeremy Freeman] Modified logging
    00aafdc [Jeremy Freeman] Add modifiers
    14b801e [Jeremy Freeman] Name changes
    c7d38a3 [Jeremy Freeman] Move check for empty data to GradientDescent
    4b0a5d3 [Jeremy Freeman] Cleaned up tests
    74188d6 [Jeremy Freeman] Eliminate dependency on commons
    50dd237 [Jeremy Freeman] Removed experimental tag
    6bfe1e6 [Jeremy Freeman] Fixed imports
    a2a63ad [freeman] Makes convergence test more robust
    86220bc [freeman] Streaming linear regression unit tests
    fb4683a [freeman] Minor changes for scalastyle consistency
    fd31e03 [freeman] Changed logging behavior
    453974e [freeman] Fixed indentation
    c4b1143 [freeman] Streaming linear regression
    604f4d7 [freeman] Expanded private class to include mllib
    d99aa85 [freeman] Helper methods for streaming MLlib apps
    0898add [freeman] Added dependency on streaming
    freeman-lab authored and mengxr committed Aug 2, 2014
    Configuration menu
    Copy the full SHA
    f6a1899 View commit details
    Browse the repository at this point in the history
  7. [SPARK-2550][MLLIB][APACHE SPARK] Support regularization and intercep…

    …t in pyspark's linear methods.
    
    Related to issue: [SPARK-2550](https://issues.apache.org/jira/browse/SPARK-2550?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20priority%20%3D%20Major%20ORDER%20BY%20key%20DESC).
    
    Author: Michael Giannakopoulos <miccagiann@gmail.com>
    
    Closes #1624 from miccagiann/new-branch and squashes the following commits:
    
    c02e5f5 [Michael Giannakopoulos] Merge cleanly with upstream/master.
    8dcb888 [Michael Giannakopoulos] Putting the if/else if statements in brackets.
    fed8eaa [Michael Giannakopoulos] Adding a space in the message related to the IllegalArgumentException.
    44e6ff0 [Michael Giannakopoulos] Adding a blank line before python class LinearRegressionWithSGD.
    8eba9c5 [Michael Giannakopoulos] Change function signatures. Exception is thrown from the scala component and not from the python one.
    638be47 [Michael Giannakopoulos] Modified code to comply with code standards.
    ec50ee9 [Michael Giannakopoulos] Shorten the if-elif-else statement in regression.py file
    b962744 [Michael Giannakopoulos] Replaced the enum classes, with strings-keywords for defining the values of 'regType' parameter.
    78853ec [Michael Giannakopoulos] Providing intercept and regualizer functionallity for linear methods in only one function.
    3ac8874 [Michael Giannakopoulos] Added support for regularizer and intercection parameters for linear regression method.
    miccagiann authored and mengxr committed Aug 2, 2014
    Configuration menu
    Copy the full SHA
    c281189 View commit details
    Browse the repository at this point in the history
  8. [SPARK-1580][MLLIB] Estimate ALS communication and computation costs.

    Continue the work from #493.
    
    Closes #493 and Closes #593
    
    Author: Tor Myklebust <tmyklebu@gmail.com>
    Author: Xiangrui Meng <meng@databricks.com>
    
    Closes #1731 from mengxr/tmyklebu-alscost and squashes the following commits:
    
    9b56a8b [Xiangrui Meng] updated API and added a simple test
    68a3229 [Xiangrui Meng] merge master
    217bd1d [Tor Myklebust] Documentation and choleskies -> subproblems.
    8cbb718 [Tor Myklebust] Braces get spaces.
    0455cd4 [Tor Myklebust] Parens for collectAsMap.
    2b2febe [Tor Myklebust] Use `makeLinkRDDs` when estimating costs.
    2ab7a5d [Tor Myklebust] Reindent estimateCost's declaration and make it return Seqs.
    8b21e6d [Tor Myklebust] Fix overlong lines.
    8cbebf1 [Tor Myklebust] Rename and clean up the return format of cost estimator.
    6615ed5 [Tor Myklebust] It's more useful to give per-partition estimates.  Do that.
    5530678 [Tor Myklebust] Merge branch 'master' of https://github.com/apache/spark into alscost
    6c31324 [Tor Myklebust] Make it actually build...
    a1184d1 [Tor Myklebust] Mark ALS.evaluatePartitioner DeveloperApi.
    657a71b [Tor Myklebust] Simple-minded estimates of computation and communication costs in ALS.
    dcf583a [Tor Myklebust] Remove the partitioner member variable; instead, thread that needle everywhere it needs to go.
    23d6f91 [Tor Myklebust] Stop making the partitioner configurable.
    495784f [Tor Myklebust] Merge branch 'master' of https://github.com/apache/spark
    674933a [Tor Myklebust] Fix style.
    40edc23 [Tor Myklebust] Fix missing space.
    f841345 [Tor Myklebust] Fix daft bug creating 'pairs', also for -> foreach.
    5ec9e6c [Tor Myklebust] Clean a couple of things up using 'map'.
    36a0f43 [Tor Myklebust] Make the partitioner private.
    d872b09 [Tor Myklebust] Add negative id ALS test.
    df27697 [Tor Myklebust] Support custom partitioners.  Currently we use the same partitioner for users and products.
    c90b6d8 [Tor Myklebust] Scramble user and product ids before bucketing.
    c774d7d [Tor Myklebust] Make the partitioner a member variable and use it instead of modding directly.
    tmyklebu authored and mengxr committed Aug 2, 2014
    Configuration menu
    Copy the full SHA
    e25ec06 View commit details
    Browse the repository at this point in the history
  9. [SPARK-2801][MLlib]: DistributionGenerator renamed to RandomDataGener…

    …ator. RandomRDD is now of generic type
    
    The RandomRDDGenerators used to only output RDD[Double].
    Now RandomRDDGenerators.randomRDD can be used to generate a random RDD[T] via a class that extends RandomDataGenerator, by supplying a type T and overriding the nextValue() function as they wish.
    
    Author: Burak <brkyvz@gmail.com>
    
    Closes #1732 from brkyvz/SPARK-2801 and squashes the following commits:
    
    c94a694 [Burak] [SPARK-2801][MLlib] Missing ClassTags added
    22d96fe [Burak] [SPARK-2801][MLlib]: DistributionGenerator renamed to RandomDataGenerator, generic types added for RandomRDD instead of Double
    brkyvz authored and mengxr committed Aug 2, 2014
    Configuration menu
    Copy the full SHA
    fda4759 View commit details
    Browse the repository at this point in the history
  10. StatCounter on NumPy arrays [PYSPARK][SPARK-2012]

    These changes allow StatCounters to work properly on NumPy arrays, to fix the issue reported here  (https://issues.apache.org/jira/browse/SPARK-2012).
    
    If NumPy is installed, the NumPy functions ``maximum``, ``minimum``, and ``sqrt``, which work on arrays, are used to merge statistics. If not, we fall back on scalar operators, so it will work on arrays with NumPy, but will also work without NumPy.
    
    New unit tests added, along with a check for NumPy in the tests.
    
    Author: Jeremy Freeman <the.freeman.lab@gmail.com>
    
    Closes #1725 from freeman-lab/numpy-max-statcounter and squashes the following commits:
    
    fe973b1 [Jeremy Freeman] Avoid duplicate array import in tests
    7f0e397 [Jeremy Freeman] Refactored check for numpy
    8e764dd [Jeremy Freeman] Explicit numpy imports
    875414c [Jeremy Freeman] Fixed indents
    1c8a832 [Jeremy Freeman] Unit tests for StatCounter with NumPy arrays
    176a127 [Jeremy Freeman] Use numpy arrays in StatCounter
    freeman-lab authored and JoshRosen committed Aug 2, 2014
    Configuration menu
    Copy the full SHA
    4bc3bb2 View commit details
    Browse the repository at this point in the history
  11. [SPARK-1470][SPARK-1842] Use the scala-logging wrapper instead of the…

    … directly sfl4j api
    
    Author: GuoQiang Li <witgo@qq.com>
    
    Closes #1369 from witgo/SPARK-1470_new and squashes the following commits:
    
    66a1641 [GuoQiang Li] IncompatibleResultTypeProblem
    73a89ba [GuoQiang Li] Use the scala-logging wrapper instead of the directly sfl4j api.
    witgo authored and pwendell committed Aug 2, 2014
    Configuration menu
    Copy the full SHA
    adc8303 View commit details
    Browse the repository at this point in the history
  12. Revert "[SPARK-1470][SPARK-1842] Use the scala-logging wrapper instea…

    …d of the directly sfl4j api"
    
    This reverts commit adc8303.
    pwendell committed Aug 2, 2014
    Configuration menu
    Copy the full SHA
    dab3796 View commit details
    Browse the repository at this point in the history
  13. [SPARK-2316] Avoid O(blocks) operations in listeners

    The existing code in `StorageUtils` is not the most efficient. Every time we want to update an `RDDInfo` we end up iterating through all blocks on all block managers just to discard most of them. The symptoms manifest themselves in the bountiful UI bugs observed in the wild. Many of these bugs are caused by the slow consumption of events in `LiveListenerBus`, which frequently leads to the event queue overflowing and `SparkListenerEvent`s being dropped on the floor. The changes made in this PR avoid this by first filtering out only the blocks relevant to us before computing storage information from them.
    
    It's worth a mention that this corner of the Spark code is also not very well-tested at all. The bulk of the changes in this PR (more than 60%) is actually test cases for the various logic in `StorageUtils.scala` as well as `StorageTab.scala`. These will eventually be extended to cover the various listeners that constitute the `SparkUI`.
    
    Author: Andrew Or <andrewor14@gmail.com>
    
    Closes #1679 from andrewor14/fix-drop-events and squashes the following commits:
    
    f80c1fa [Andrew Or] Rewrite fold and reduceOption as sum
    e132d69 [Andrew Or] Merge branch 'master' of github.com:apache/spark into fix-drop-events
    14fa1c3 [Andrew Or] Simplify some code + update a few comments
    a91be46 [Andrew Or] Make ExecutorsPage blazingly fast
    bf6f09b [Andrew Or] Minor changes
    8981de1 [Andrew Or] Merge branch 'master' of github.com:apache/spark into fix-drop-events
    af19bc0 [Andrew Or] *UsedByRDD -> *UsedByRdd (minor)
    6970bc8 [Andrew Or] Add extensive tests for StorageListener and the new code in StorageUtils
    e080b9e [Andrew Or] Reduce run time of StorageUtils.updateRddInfo to near constant
    2c3ef6a [Andrew Or] Actually filter out only the relevant RDDs
    6fef86a [Andrew Or] Add extensive tests for new code in StorageStatus
    b66b6b0 [Andrew Or] Use more efficient underlying data structures for blocks
    6a7b7c0 [Andrew Or] Avoid chained operations on TraversableLike
    a9ec384 [Andrew Or] Merge branch 'master' of github.com:apache/spark into fix-drop-events
    b12fcd7 [Andrew Or] Fix tests + simplify sc.getRDDStorageInfo
    da8e322 [Andrew Or] Merge branch 'master' of github.com:apache/spark into fix-drop-events
    8e91921 [Andrew Or] Iterate through a filtered set of blocks when updating RDDInfo
    7b2c4aa [Andrew Or] Rewrite blockLocationsFromStorageStatus + clean up method signatures
    41fa50d [Andrew Or] Add a legacy constructor for StorageStatus
    53af15d [Andrew Or] Refactor StorageStatus + add a bunch of tests
    andrewor14 authored and pwendell committed Aug 2, 2014
    Configuration menu
    Copy the full SHA
    d934801 View commit details
    Browse the repository at this point in the history
  14. [SPARK-2454] Do not ship spark home to Workers

    When standalone Workers launch executors, they inherit the Spark home set by the driver. This means if the worker machines do not share the same directory structure as the driver node, the Workers will attempt to run scripts (e.g. bin/compute-classpath.sh) that do not exist locally and fail. This is a common scenario if the driver is launched from outside of the cluster.
    
    The solution is to simply not pass the driver's Spark home to the Workers. This PR further makes an attempt to avoid overloading the usages of `spark.home`, which is now only used for setting executor Spark home on Mesos and in python.
    
    This is based on top of #1392 and originally reported by YanTangZhai. Tested on standalone cluster.
    
    Author: Andrew Or <andrewor14@gmail.com>
    
    Closes #1734 from andrewor14/spark-home-reprise and squashes the following commits:
    
    f71f391 [Andrew Or] Revert changes in python
    1c2532c [Andrew Or] Merge branch 'master' of github.com:apache/spark into spark-home-reprise
    188fc5d [Andrew Or] Avoid using spark.home where possible
    09272b7 [Andrew Or] Always use Worker's working directory as spark home
    andrewor14 authored and pwendell committed Aug 2, 2014
    Configuration menu
    Copy the full SHA
    148af60 View commit details
    Browse the repository at this point in the history
  15. [SPARK-1812] sql/catalyst - Provide explicit type information

    For Scala 2.11 compatibility.
    
    Without the explicit type specification, withNullability
    return type is inferred to be Attribute, and thus calling
    at() on the returned object fails in these tests:
    
    [ERROR] /Users/avati/work/spark/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvaluationSuite.scala:370: value at is not a
    [ERROR]     val c4_notNull = 'a.boolean.notNull.at(3)
    [ERROR]                                         ^
    [ERROR] /Users/avati/work/spark/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvaluationSuite.scala:371: value at is not a
    [ERROR]     val c5_notNull = 'a.boolean.notNull.at(4)
    [ERROR]                                         ^
    [ERROR] /Users/avati/work/spark/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvaluationSuite.scala:372: value at is not a
    [ERROR]     val c6_notNull = 'a.boolean.notNull.at(5)
    [ERROR]                                         ^
    [ERROR] /Users/avati/work/spark/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvaluationSuite.scala:558: value at is not a
    [ERROR]     val s_notNull = 'a.string.notNull.at(0)
    
    Signed-off-by: Anand Avati <avatiredhat.com>
    
    Author: Anand Avati <avati@redhat.com>
    
    Closes #1709 from avati/SPARK-1812-notnull and squashes the following commits:
    
    0470eb3 [Anand Avati] SPARK-1812: sql/catalyst - Provide explicit type information
    avati authored and marmbrus committed Aug 2, 2014
    Configuration menu
    Copy the full SHA
    08c095b View commit details
    Browse the repository at this point in the history
  16. HOTFIX: Fixing test error in maven for flume-sink.

    We needed to add an explicit dependency on scalatest since this
    module will not get it from spark core like others do.
    pwendell committed Aug 2, 2014
    Configuration menu
    Copy the full SHA
    25cad6a View commit details
    Browse the repository at this point in the history
  17. HOTFIX: Fix concurrency issue in FlumePollingStreamSuite.

    This has been failing on master. One possible cause is that the port
    gets contended if multiple test runs happen concurrently and they
    hit this test at the same time. Since this test takes a long time
    (60 seconds) that's very plausible. This patch randomizes the port
    used in this test to avoid contention.
    pwendell committed Aug 2, 2014
    Configuration menu
    Copy the full SHA
    44460ba View commit details
    Browse the repository at this point in the history
  18. MAINTENANCE: Automated closing of pull requests.

    This commit exists to close the following pull requests on Github:
    
    Closes #706 (close requested by 'pwendell')
    Closes #453 (close requested by 'pwendell')
    Closes #557 (close requested by 'tdas')
    Closes #495 (close requested by 'tdas')
    Closes #1232 (close requested by 'pwendell')
    Closes #82 (close requested by 'pwendell')
    Closes #600 (close requested by 'pwendell')
    Closes #473 (close requested by 'pwendell')
    Closes #351 (close requested by 'pwendell')
    pwendell committed Aug 2, 2014
    Configuration menu
    Copy the full SHA
    87738bf View commit details
    Browse the repository at this point in the history
  19. [HOTFIX] Do not throw NPE if spark.test.home is not set

    `spark.test.home` was introduced in #1734. This is fine for SBT but is failing maven tests. Either way it shouldn't throw an NPE.
    
    Author: Andrew Or <andrewor14@gmail.com>
    
    Closes #1739 from andrewor14/fix-spark-test-home and squashes the following commits:
    
    ce2624c [Andrew Or] Do not throw NPE if spark.test.home is not set
    andrewor14 authored and pwendell committed Aug 2, 2014
    Configuration menu
    Copy the full SHA
    e09e18b View commit details
    Browse the repository at this point in the history
  20. [SPARK-2478] [mllib] DecisionTree Python API

    Added experimental Python API for Decision Trees.
    
    API:
    * class DecisionTreeModel
    ** predict() for single examples and RDDs, taking both feature vectors and LabeledPoints
    ** numNodes()
    ** depth()
    ** __str__()
    * class DecisionTree
    ** trainClassifier()
    ** trainRegressor()
    ** train()
    
    Examples and testing:
    * Added example testing classification and regression with batch prediction: examples/src/main/python/mllib/tree.py
    * Have also tested example usage in doc of python/pyspark/mllib/tree.py which tests single-example prediction with dense and sparse vectors
    
    Also: Small bug fix in python/pyspark/mllib/_common.py: In _linear_predictor_typecheck, changed check for RDD to use isinstance() instead of type() in order to catch RDD subclasses.
    
    CC mengxr manishamde
    
    Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
    
    Closes #1727 from jkbradley/decisiontree-python-new and squashes the following commits:
    
    3744488 [Joseph K. Bradley] Renamed test tree.py to decision_tree_runner.py Small updates based on github review.
    6b86a9d [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
    affceb9 [Joseph K. Bradley] * Fixed bug in doc tests in pyspark/mllib/util.py caused by change in loadLibSVMFile behavior.  (It used to threshold labels at 0 to make them 0/1, but it now leaves them as they are.) * Fixed small bug in loadLibSVMFile: If a data file had no features, then loadLibSVMFile would create a single all-zero feature.
    67a29bc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
    cf46ad7 [Joseph K. Bradley] Python DecisionTreeModel * predict(empty RDD) returns an empty RDD instead of an error. * Removed support for calling predict() on LabeledPoint and RDD[LabeledPoint] * predict() does not cache serialized RDD any more.
    aa29873 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
    bf21be4 [Joseph K. Bradley] removed old run() func from DecisionTree
    fa10ea7 [Joseph K. Bradley] Small style update
    7968692 [Joseph K. Bradley] small braces typo fix
    e34c263 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
    4801b40 [Joseph K. Bradley] Small style update to DecisionTreeSuite
    db0eab2 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix2' into decisiontree-python-new
    6873fa9 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
    225822f [Joseph K. Bradley] Bug: In DecisionTree, the method sequentialBinSearchForOrderedCategoricalFeatureInClassification() indexed bins from 0 to (math.pow(2, featureCategories.toInt - 1) - 1). This upper bound is the bound for unordered categorical features, not ordered ones. The upper bound should be the arity (i.e., max value) of the feature.
    93953f1 [Joseph K. Bradley] Likely done with Python API.
    6df89a9 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
    4562c08 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
    665ba78 [Joseph K. Bradley] Small updates towards Python DecisionTree API
    188cb0d [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
    6622247 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
    b8fac57 [Joseph K. Bradley] Finished Python DecisionTree API and example but need to test a bit more.
    2b20c61 [Joseph K. Bradley] Small doc and style updates
    1b29c13 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
    584449a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
    dab0b67 [Joseph K. Bradley] Added documentation for DecisionTree internals
    8bb8aa0 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
    978cfcf [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
    6eed482 [Joseph K. Bradley] In DecisionTree: Changed from using procedural syntax for functions returning Unit to explicitly writing Unit return type.
    376dca2 [Joseph K. Bradley] Updated meaning of maxDepth by 1 to fit scikit-learn and rpart. * In code, replaced usages of maxDepth <-- maxDepth + 1 * In params, replace settings of maxDepth <-- maxDepth - 1
    e06e423 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
    bab3f19 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
    59750f8 [Joseph K. Bradley] * Updated Strategy to check numClassesForClassification only if algo=Classification. * Updates based on comments: ** DecisionTreeRunner *** Made dataFormat arg default to libsvm ** Small cleanups ** tree.Node: Made recursive helper methods private, and renamed them.
    52e17c5 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
    f5a036c [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
    da50db7 [Joseph K. Bradley] Added one more test to DecisionTreeSuite: stump with 2 continuous variables for binary classification.  Caused problems in past, but fixed now.
    8e227ea [Joseph K. Bradley] Changed Strategy so it only requires numClassesForClassification >= 2 for classification
    cd1d933 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
    8ea8750 [Joseph K. Bradley] Bug fix: Off-by-1 when finding thresholds for splits for continuous features.
    8a758db [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
    5fe44ed [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
    2283df8 [Joseph K. Bradley] 2 bug fixes.
    73fbea2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
    5f920a1 [Joseph K. Bradley] Demonstration of bug before submitting fix: Updated DecisionTreeSuite so that 3 tests fail.  Will describe bug in next commit.
    f825352 [Joseph K. Bradley] Wrote Python API and example for DecisionTree.  Also added toString, depth, and numNodes methods to DecisionTreeModel.
    jkbradley authored and mengxr committed Aug 2, 2014
    Configuration menu
    Copy the full SHA
    3f67382 View commit details
    Browse the repository at this point in the history
  21. [SQL] Set outputPartitioning of BroadcastHashJoin correctly.

    I think we will not generate the plan triggering this bug at this moment. But, let me explain it...
    
    Right now, we are using `left.outputPartitioning` as the `outputPartitioning` of a `BroadcastHashJoin`. We may have a wrong physical plan for cases like...
    ```sql
    SELECT l.key, count(*)
    FROM (SELECT key, count(*) as cnt
          FROM src
          GROUP BY key) l // This is buildPlan
    JOIN r // This is the streamedPlan
    ON (l.cnt = r.value)
    GROUP BY l.key
    ```
    Let's say we have a `BroadcastHashJoin` on `l` and `r`. For this case, we will pick `l`'s `outputPartitioning` for the `outputPartitioning`of the `BroadcastHashJoin` on `l` and `r`. Also, because the last `GROUP BY` is using `l.key` as the key, we will not introduce an `Exchange` for this aggregation. However, `r`'s outputPartitioning may not match the required distribution of the last `GROUP BY` and we fail to group data correctly.
    
    JIRA is being reindexed. I will create a JIRA ticket once it is back online.
    
    Author: Yin Huai <huai@cse.ohio-state.edu>
    
    Closes #1735 from yhuai/BroadcastHashJoin and squashes the following commits:
    
    96d9cb3 [Yin Huai] Set outputPartitioning correctly.
    yhuai authored and marmbrus committed Aug 2, 2014
    Configuration menu
    Copy the full SHA
    67bd8e3 View commit details
    Browse the repository at this point in the history
  22. [SPARK-1981] Add AWS Kinesis streaming support

    Author: Chris Fregly <chris@fregly.com>
    
    Closes #1434 from cfregly/master and squashes the following commits:
    
    4774581 [Chris Fregly] updated docs, renamed retry to retryRandom to be more clear, removed retries around store() method
    0393795 [Chris Fregly] moved Kinesis examples out of examples/ and back into extras/kinesis-asl
    691a6be [Chris Fregly] fixed tests and formatting, fixed a bug with JavaKinesisWordCount during union of streams
    0e1c67b [Chris Fregly] Merge remote-tracking branch 'upstream/master'
    74e5c7c [Chris Fregly] updated per TD's feedback.  simplified examples, updated docs
    e33cbeb [Chris Fregly] Merge remote-tracking branch 'upstream/master'
    bf614e9 [Chris Fregly] per matei's feedback:  moved the kinesis examples into the examples/ dir
    d17ca6d [Chris Fregly] per TD's feedback:  updated docs, simplified the KinesisUtils api
    912640c [Chris Fregly] changed the foundKinesis class to be a publically-avail class
    db3eefd [Chris Fregly] Merge remote-tracking branch 'upstream/master'
    21de67f [Chris Fregly] Merge remote-tracking branch 'upstream/master'
    6c39561 [Chris Fregly] parameterized the versions of the aws java sdk and kinesis client
    338997e [Chris Fregly] improve build docs for kinesis
    828f8ae [Chris Fregly] more cleanup
    e7c8978 [Chris Fregly] Merge remote-tracking branch 'upstream/master'
    cd68c0d [Chris Fregly] fixed typos and backward compatibility
    d18e680 [Chris Fregly] Merge remote-tracking branch 'upstream/master'
    b3b0ff1 [Chris Fregly] [SPARK-1981] Add AWS Kinesis streaming support
    cfregly authored and tdas committed Aug 2, 2014
    Configuration menu
    Copy the full SHA
    91f9504 View commit details
    Browse the repository at this point in the history
  23. SPARK-2804: Remove scalalogging-slf4j dependency

    This also Closes #1701.
    
    Author: GuoQiang Li <witgo@qq.com>
    
    Closes #1208 from witgo/SPARK-1470 and squashes the following commits:
    
    422646b [GuoQiang Li] Remove scalalogging-slf4j dependency
    witgo authored and pwendell committed Aug 2, 2014
    Configuration menu
    Copy the full SHA
    4c47711 View commit details
    Browse the repository at this point in the history
  24. [SPARK-2097][SQL] UDF Support

    This patch adds the ability to register lambda functions written in Python, Java or Scala as UDFs for use in SQL or HiveQL.
    
    Scala:
    ```scala
    registerFunction("strLenScala", (_: String).length)
    sql("SELECT strLenScala('test')")
    ```
    Python:
    ```python
    sqlCtx.registerFunction("strLenPython", lambda x: len(x), IntegerType())
    sqlCtx.sql("SELECT strLenPython('test')")
    ```
    Java:
    ```java
    sqlContext.registerFunction("stringLengthJava", new UDF1<String, Integer>() {
      Override
      public Integer call(String str) throws Exception {
        return str.length();
      }
    }, DataType.IntegerType);
    
    sqlContext.sql("SELECT stringLengthJava('test')");
    ```
    
    Author: Michael Armbrust <michael@databricks.com>
    
    Closes #1063 from marmbrus/udfs and squashes the following commits:
    
    9eda0fe [Michael Armbrust] newline
    747c05e [Michael Armbrust] Add some scala UDF tests.
    d92727d [Michael Armbrust] Merge remote-tracking branch 'apache/master' into udfs
    005d684 [Michael Armbrust] Fix naming and formatting.
    d14dac8 [Michael Armbrust] Fix last line of autogened java files.
    8135c48 [Michael Armbrust] Move UDF unit tests to pyspark.
    40b0ffd [Michael Armbrust] Merge remote-tracking branch 'apache/master' into udfs
    6a36890 [Michael Armbrust] Switch logging so that SQLContext can be serializable.
    7a83101 [Michael Armbrust] Drop toString
    795fd15 [Michael Armbrust] Try to avoid capturing SQLContext.
    e54fb45 [Michael Armbrust] Docs and tests.
    437cbe3 [Michael Armbrust] Update use of dataTypes, fix some python tests, address review comments.
    01517d6 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into udfs
    8e6c932 [Michael Armbrust] WIP
    3f96a52 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into udfs
    6237c8d [Michael Armbrust] WIP
    2766f0b [Michael Armbrust] Move udfs support to SQL from hive. Add support for Java UDFs.
    0f7d50c [Michael Armbrust] Draft of native Spark SQL UDFs for Scala and Python.
    marmbrus committed Aug 2, 2014
    Configuration menu
    Copy the full SHA
    158ad0b View commit details
    Browse the repository at this point in the history
  25. [SPARK-2785][SQL] Remove assertions that throw when users try unsuppo…

    …rted Hive commands.
    
    Author: Michael Armbrust <michael@databricks.com>
    
    Closes #1742 from marmbrus/asserts and squashes the following commits:
    
    5182d54 [Michael Armbrust] Remove assertions that throw when users try unsupported Hive commands.
    marmbrus committed Aug 2, 2014
    Configuration menu
    Copy the full SHA
    198df11 View commit details
    Browse the repository at this point in the history

Commits on Aug 3, 2014

  1. [SPARK-2729][SQL] Added test case for SPARK-2729

    This is a follow up of #1636.
    
    Author: Cheng Lian <lian.cs.zju@gmail.com>
    
    Closes #1738 from liancheng/test-for-spark-2729 and squashes the following commits:
    
    b13692a [Cheng Lian] Added test case for SPARK-2729
    liancheng authored and marmbrus committed Aug 3, 2014
    Configuration menu
    Copy the full SHA
    866cf1f View commit details
    Browse the repository at this point in the history
  2. [SPARK-2797] [SQL] SchemaRDDs don't support unpersist()

    The cause is explained in https://issues.apache.org/jira/browse/SPARK-2797.
    
    Author: Yin Huai <huai@cse.ohio-state.edu>
    
    Closes #1745 from yhuai/SPARK-2797 and squashes the following commits:
    
    7b1627d [Yin Huai] The unpersist method of the Scala RDD cannot be called without the input parameter (blocking) from PySpark.
    yhuai authored and marmbrus committed Aug 3, 2014
    Configuration menu
    Copy the full SHA
    d210022 View commit details
    Browse the repository at this point in the history
  3. [SPARK-2739][SQL] Rename registerAsTable to registerTempTable

    There have been user complaints that the difference between `registerAsTable` and `saveAsTable` is too subtle.  This PR addresses this by renaming `registerAsTable` to `registerTempTable`, which more clearly reflects what is happening.  `registerAsTable` remains, but will cause a deprecation warning.
    
    Author: Michael Armbrust <michael@databricks.com>
    
    Closes #1743 from marmbrus/registerTempTable and squashes the following commits:
    
    d031348 [Michael Armbrust] Merge remote-tracking branch 'apache/master' into registerTempTable
    4dff086 [Michael Armbrust] Fix .java files too
    89a2f12 [Michael Armbrust] Merge remote-tracking branch 'apache/master' into registerTempTable
    0b7b71e [Michael Armbrust] Rename registerAsTable to registerTempTable
    marmbrus committed Aug 3, 2014
    Configuration menu
    Copy the full SHA
    1a80437 View commit details
    Browse the repository at this point in the history
  4. SPARK-2602 [BUILD] Tests steal focus under Java 6

    As per https://issues.apache.org/jira/browse/SPARK-2602 , this may be resolved for Java 6 with the java.awt.headless system property, which never hurt anyone running a command line app. I tested it and seemed to get rid of focus stealing.
    
    Author: Sean Owen <srowen@gmail.com>
    
    Closes #1747 from srowen/SPARK-2602 and squashes the following commits:
    
    b141018 [Sean Owen] Set java.awt.headless during tests
    srowen authored and pwendell committed Aug 3, 2014
    Configuration menu
    Copy the full SHA
    33f167d View commit details
    Browse the repository at this point in the history
  5. SPARK-2414 [BUILD] Add LICENSE entry for jquery

    The JIRA concerned removing jquery, and this does not remove jquery. While it is distributed by Spark it should have an accompanying line in LICENSE, very technically, as per http://www.apache.org/dev/licensing-howto.html
    
    Author: Sean Owen <srowen@gmail.com>
    
    Closes #1748 from srowen/SPARK-2414 and squashes the following commits:
    
    2fdb03c [Sean Owen] Add LICENSE entry for jquery
    srowen authored and pwendell committed Aug 3, 2014
    Configuration menu
    Copy the full SHA
    9cf429a View commit details
    Browse the repository at this point in the history
  6. [Minor] Fixes on top of #1679

    Minor fixes on top of #1679.
    
    Author: Andrew Or <andrewor14@gmail.com>
    
    Closes #1736 from andrewor14/amend-#1679 and squashes the following commits:
    
    3b46f5e [Andrew Or] Minor fixes
    andrewor14 authored and pwendell committed Aug 3, 2014
    Configuration menu
    Copy the full SHA
    3dc55fd View commit details
    Browse the repository at this point in the history
  7. SPARK-2712 - Add a small note to maven doc that mvn package must happ…

    …en ...
    
    Per request by Reynold adding small note about proper sequencing of build then test.
    
    Author: Stephen Boesch <javadba@gmail.com>
    
    Closes #1615 from javadba/docs and squashes the following commits:
    
    6c3183e [Stephen Boesch] Moved updated testing blurb per PWendell
    5764757 [Stephen Boesch] SPARK-2712 - Add a small note to maven doc that mvn package must happen before test
    javadba authored and pwendell committed Aug 3, 2014
    Configuration menu
    Copy the full SHA
    f8cd143 View commit details
    Browse the repository at this point in the history
  8. SPARK-2246: Add user-data option to EC2 scripts

    Author: Allan Douglas R. de Oliveira <allan@chaordicsystems.com>
    
    Closes #1186 from douglaz/spark_ec2_user_data and squashes the following commits:
    
    94a36f9 [Allan Douglas R. de Oliveira] Added user data option to EC2 script
    Allan Douglas R. de Oliveira authored and pwendell committed Aug 3, 2014
    Configuration menu
    Copy the full SHA
    a0bcbc1 View commit details
    Browse the repository at this point in the history
  9. [SPARK-2197] [mllib] Java DecisionTree bug fix and easy-of-use

    Bug fix: Before, when an RDD was created in Java and passed to DecisionTree.train(), the fake class tag caused problems.
    * Fix: DecisionTree: Used new RDD.retag() method to allow passing RDDs from Java.
    
    Other improvements to Decision Trees for easy-of-use with Java:
    * impurity classes: Added instance() methods to help with Java interface.
    * Strategy: Added Java-friendly constructor
    --> Note: I removed quantileCalculationStrategy from the Java-friendly constructor since (a) it is a special class and (b) there is only 1 option currently.  I suspect we will redo the API before the other options are included.
    
    CC: mengxr
    
    Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
    
    Closes #1740 from jkbradley/dt-java-new and squashes the following commits:
    
    0805dc6 [Joseph K. Bradley] Changed Strategy to use JavaConverters instead of JavaConversions
    519b1b7 [Joseph K. Bradley] * Organized imports in JavaDecisionTreeSuite.java * Using JavaConverters instead of JavaConversions in DecisionTreeSuite.scala
    f7b5ca1 [Joseph K. Bradley] Improvements to make it easier to run DecisionTree from Java. * DecisionTree: Used new RDD.retag() method to allow passing RDDs from Java. * impurity classes: Added instance() methods to help with Java interface. * Strategy: Added Java-friendly constructor ** Note: I removed quantileCalculationStrategy from the Java-friendly constructor since (a) it is a special class and (b) there is only 1 option currently.  I suspect we will redo the API before the other options are included.
    d78ada6 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-java
    320853f [Joseph K. Bradley] Added JavaDecisionTreeSuite, partly written
    13a585e [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-java
    f1a8283 [Joseph K. Bradley] Added old JavaDecisionTreeSuite, to be updated later
    225822f [Joseph K. Bradley] Bug: In DecisionTree, the method sequentialBinSearchForOrderedCategoricalFeatureInClassification() indexed bins from 0 to (math.pow(2, featureCategories.toInt - 1) - 1). This upper bound is the bound for unordered categorical features, not ordered ones. The upper bound should be the arity (i.e., max value) of the feature.
    jkbradley authored and mengxr committed Aug 3, 2014
    Configuration menu
    Copy the full SHA
    2998e38 View commit details
    Browse the repository at this point in the history
  10. [SPARK-2784][SQL] Deprecate hql() method in favor of a config option,…

    … 'spark.sql.dialect'
    
    Many users have reported being confused by the distinction between the `sql` and `hql` methods.  Specifically, many users think that `sql(...)` cannot be used to read hive tables.  In this PR I introduce a new configuration option `spark.sql.dialect` that picks which dialect with be used for parsing.  For SQLContext this must be set to `sql`.  In `HiveContext` it defaults to `hiveql` but can also be set to `sql`.
    
    The `hql` and `hiveql` methods continue to act the same but are now marked as deprecated.
    
    **This is a possibly breaking change for some users unless they set the dialect manually, though this is unlikely.**
    
    For example: `hiveContex.sql("SELECT 1")` will now throw a parsing exception by default.
    
    Author: Michael Armbrust <michael@databricks.com>
    
    Closes #1746 from marmbrus/sqlLanguageConf and squashes the following commits:
    
    ad375cc [Michael Armbrust] Merge remote-tracking branch 'apache/master' into sqlLanguageConf
    20c43f8 [Michael Armbrust] override function instead of just setting the value
    7e4ae93 [Michael Armbrust] Deprecate hql() method in favor of a config option, 'spark.sql.dialect'
    marmbrus committed Aug 3, 2014
    Configuration menu
    Copy the full SHA
    236dfac View commit details
    Browse the repository at this point in the history
  11. [SPARK-2814][SQL] HiveThriftServer2 throws NPE when executing native …

    …commands
    
    JIRA issue: [SPARK-2814](https://issues.apache.org/jira/browse/SPARK-2814)
    
    Author: Cheng Lian <lian.cs.zju@gmail.com>
    
    Closes #1753 from liancheng/spark-2814 and squashes the following commits:
    
    c74a3b2 [Cheng Lian] Fixed SPARK-2814
    liancheng authored and marmbrus committed Aug 3, 2014
    Configuration menu
    Copy the full SHA
    ac33cbb View commit details
    Browse the repository at this point in the history
  12. [SPARK-2783][SQL] Basic support for analyze in HiveContext

    JIRA: https://issues.apache.org/jira/browse/SPARK-2783
    
    Author: Yin Huai <huai@cse.ohio-state.edu>
    
    Closes #1741 from yhuai/analyzeTable and squashes the following commits:
    
    7bb5f02 [Yin Huai] Use sql instead of hql.
    4d09325 [Yin Huai] Merge remote-tracking branch 'upstream/master' into analyzeTable
    e3ebcd4 [Yin Huai] Renaming.
    c170f4e [Yin Huai] Do not use getContentSummary.
    62393b6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into analyzeTable
    db233a6 [Yin Huai] Trying to debug jenkins...
    fee84f0 [Yin Huai] Merge remote-tracking branch 'upstream/master' into analyzeTable
    f0501f3 [Yin Huai] Fix compilation error.
    24ad391 [Yin Huai] Merge remote-tracking branch 'upstream/master' into analyzeTable
    8918140 [Yin Huai] Wording.
    23df227 [Yin Huai] Add a simple analyze method to get the size of a table and update the "totalSize" property of this table in the Hive metastore.
    yhuai authored and marmbrus committed Aug 3, 2014
    Configuration menu
    Copy the full SHA
    e139e2b View commit details
    Browse the repository at this point in the history
  13. [SPARK-1740] [PySpark] kill the python worker

    Kill only the python worker related to cancelled tasks.
    
    The daemon will start a background thread to monitor all the opened sockets for all workers. If the socket is closed by JVM, this thread will kill the worker.
    
    When an task is cancelled, the socket to worker will be closed, then the worker will be killed by deamon.
    
    Author: Davies Liu <davies.liu@gmail.com>
    
    Closes #1643 from davies/kill and squashes the following commits:
    
    8ffe9f3 [Davies Liu] kill worker by deamon, because runtime.exec() is too heavy
    46ca150 [Davies Liu] address comment
    acd751c [Davies Liu] kill the worker when task is canceled
    davies authored and JoshRosen committed Aug 3, 2014
    Configuration menu
    Copy the full SHA
    55349f9 View commit details
    Browse the repository at this point in the history

Commits on Aug 4, 2014

  1. [SPARK-2810] upgrade to scala-maven-plugin 3.2.0

    Needed for Scala 2.11 compiler-interface
    
    Signed-off-by: Anand Avati <avatiredhat.com>
    
    Author: Anand Avati <avati@redhat.com>
    
    Closes #1711 from avati/SPARK-1812-scala-maven-plugin and squashes the following commits:
    
    9a22fc8 [Anand Avati] SPARK-1812: upgrade to scala-maven-plugin 3.2.0
    avati authored and pwendell committed Aug 4, 2014
    Configuration menu
    Copy the full SHA
    6ba6c3e View commit details
    Browse the repository at this point in the history
  2. Fix some bugs with spaces in directory name.

    Any time you use the directory name (`FWDIR`) it needs to be surrounded
    in quotes. If you're also using wildcards, you can safely put the quotes
    around just `$FWDIR`.
    
    Author: Sarah Gerweck <sarah.a180@gmail.com>
    
    Closes #1756 from sarahgerweck/folderSpaces and squashes the following commits:
    
    732629d [Sarah Gerweck] Fix some bugs with spaces in directory name.
    sarahgerweck authored and pwendell committed Aug 4, 2014
    Configuration menu
    Copy the full SHA
    5507dd8 View commit details
    Browse the repository at this point in the history
  3. SPARK-2272 [MLlib] Feature scaling which standardizes the range of in…

    …dependent variables or features of data
    
    Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is generally performed during the data preprocessing step.
    
    In this work, a trait called `VectorTransformer` is defined for generic transformation on a vector. It contains one method to be implemented, `transform` which applies transformation on a vector.
    
    There are two implementations of `VectorTransformer` now, and they all can be easily extended with PMML transformation support.
    
    1) `StandardScaler` - Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set.
    
    2) `Normalizer` - Normalizes samples individually to unit L^n norm
    
    Author: DB Tsai <dbtsai@alpinenow.com>
    
    Closes #1207 from dbtsai/dbtsai-feature-scaling and squashes the following commits:
    
    78c15d3 [DB Tsai] Alpine Data Labs
    DB Tsai authored and mengxr committed Aug 4, 2014
    Configuration menu
    Copy the full SHA
    ae58aea View commit details
    Browse the repository at this point in the history
  4. [MLlib] [SPARK-2510]Word2Vec: Distributed Representation of Words

    This is a pull request regarding SPARK-2510 at https://issues.apache.org/jira/browse/SPARK-2510. Word2Vec creates vector representation of words in a text corpus. The algorithm first constructs a vocabulary from the corpus and then learns vector representation of words in the vocabulary. The vector representation can be used as features in natural language processing and machine learning algorithms.
    
    To make our implementation more scalable, we train each partition separately and merge the model of each partition after each iteration. To make the model more accurate, multiple iterations may be needed.
    
    To investigate the vector representations is to find the closest words for a query word. For example, the top 20 closest words to "china" are for 1 partition and 1 iteration :
    
    taiwan 0.8077646146334014
    korea 0.740913304563621
    japan 0.7240667798885471
    republic 0.7107151279078352
    thailand 0.6953217332072862
    tibet 0.6916782118129544
    mongolia 0.6800858715972612
    macau 0.6794925677480378
    singapore 0.6594048695593799
    manchuria 0.658989931844148
    laos 0.6512978726001666
    nepal 0.6380792327845325
    mainland 0.6365469459587788
    myanmar 0.6358614338840394
    macedonia 0.6322366180313249
    xinjiang 0.6285291551708028
    russia 0.6279951236068411
    india 0.6272874944023487
    shanghai 0.6234544135576999
    macao 0.6220588462925876
    
    The result with 10 partitions and 5 iterations is:
    taiwan 0.8310495079388313
    india 0.7737171315919039
    japan 0.756777901233668
    korea 0.7429767187102452
    indonesia 0.7407557427278356
    pakistan 0.712883426985585
    mainland 0.7053379963140822
    thailand 0.696298191073948
    mongolia 0.693690656871415
    laos 0.6913069680735292
    macau 0.6903427690029617
    republic 0.6766381604813666
    malaysia 0.676460699141784
    singapore 0.6728790997360923
    malaya 0.672345232966194
    manchuria 0.6703732292753156
    macedonia 0.6637955686322028
    myanmar 0.6589462882439646
    kazakhstan 0.657017801081494
    cambodia 0.6542383836451932
    
    Author: Liquan Pei <lpei@gopivotal.com>
    Author: Xiangrui Meng <meng@databricks.com>
    Author: Liquan Pei <liquanpei@gmail.com>
    
    Closes #1719 from Ishiihara/master and squashes the following commits:
    
    2ba9483 [Liquan Pei] minor fix for Word2Vec test
    e248441 [Liquan Pei] minor style change
    26a948d [Liquan Pei] Merge pull request #1 from mengxr/Ishiihara-master
    c14da41 [Xiangrui Meng] fix styles
    384c771 [Xiangrui Meng] remove minCount and window from constructor change model to use float instead of double
    e93e726 [Liquan Pei] use treeAggregate instead of aggregate
    1a8fb41 [Liquan Pei] use weighted sum in combOp
    7efbb6f [Liquan Pei] use broadcast version of vocab in aggregate
    6bcc8be [Liquan Pei] add multiple iteration support
    720b5a3 [Liquan Pei] Add test for Word2Vec algorithm, minor fixes
    2e92b59 [Liquan Pei] modify according to feedback
    57dc50d [Liquan Pei] code formatting
    e4a04d3 [Liquan Pei] minor fix
    0aafb1b [Liquan Pei] Add comments, minor fixes
    8d6befe [Liquan Pei] initial commit
    Liquan Pei authored and mengxr committed Aug 4, 2014
    Configuration menu
    Copy the full SHA
    e053c55 View commit details
    Browse the repository at this point in the history
  5. [SPARK-1687] [PySpark] pickable namedtuple

    Add an hook to replace original namedtuple with an pickable one, then namedtuple could be used in RDDs.
    
    PS: pyspark should be import BEFORE "from collections import namedtuple"
    
    Author: Davies Liu <davies.liu@gmail.com>
    
    Closes #1623 from davies/namedtuple and squashes the following commits:
    
    045dad8 [Davies Liu] remove unrelated code changes
    4132f32 [Davies Liu] address comment
    55b1c1a [Davies Liu] fix tests
    61f86eb [Davies Liu] replace all the reference of namedtuple to new hacked one
    98df6c6 [Davies Liu] Merge branch 'master' of github.com:apache/spark into namedtuple
    f7b1bde [Davies Liu] add hack for CloudPickleSerializer
    0c5c849 [Davies Liu] Merge branch 'master' of github.com:apache/spark into namedtuple
    21991e6 [Davies Liu] hack namedtuple in __main__ module, make it picklable.
    93b03b8 [Davies Liu] pickable namedtuple
    davies authored and JoshRosen committed Aug 4, 2014
    Configuration menu
    Copy the full SHA
    59f84a9 View commit details
    Browse the repository at this point in the history
  6. SPARK-2792. Fix reading too much or too little data from each stream …

    …in ExternalMap / Sorter
    
    All these changes are from mridulm's work in #1609, but extracted here to fix this specific issue and make it easier to merge not 1.1. This particular set of changes is to make sure that we read exactly the right range of bytes from each spill file in EAOM: some serializers can write bytes after the last object (e.g. the TC_RESET flag in Java serialization) and that would confuse the previous code into reading it as part of the next batch. There are also improvements to cleanup to make sure files are closed.
    
    In addition to bringing in the changes to ExternalAppendOnlyMap, I also copied them to the corresponding code in ExternalSorter and updated its test suite to test for the same issues.
    
    Author: Matei Zaharia <matei@databricks.com>
    
    Closes #1722 from mateiz/spark-2792 and squashes the following commits:
    
    5d4bfb5 [Matei Zaharia] Make objectStreamReset counter count the last object written too
    18fe865 [Matei Zaharia] Update docs on objectStreamReset
    576ee83 [Matei Zaharia] Allow objectStreamReset to be 0
    0374217 [Matei Zaharia] Remove super paranoid code to close file handles
    bda37bb [Matei Zaharia] Implement Mridul's ExternalAppendOnlyMap fixes in ExternalSorter too
    0d6dad7 [Matei Zaharia] Added Mridul's test changes for ExternalAppendOnlyMap
    9a78e4b [Matei Zaharia] Add @mridulm's fixes to ExternalAppendOnlyMap for batch sizes
    mateiz committed Aug 4, 2014
    Configuration menu
    Copy the full SHA
    8e7d5ba View commit details
    Browse the repository at this point in the history
  7. [SPARK-1687] [PySpark] fix unit tests related to pickable namedtuple

    serializer is imported multiple times during doctests, so it's better to make _hijack_namedtuple() safe to be called multiple times.
    
    Author: Davies Liu <davies.liu@gmail.com>
    
    Closes #1771 from davies/fix and squashes the following commits:
    
    1a9e336 [Davies Liu] fix unit tests
    davies authored and JoshRosen committed Aug 4, 2014
    Configuration menu
    Copy the full SHA
    9fd82db View commit details
    Browse the repository at this point in the history

Commits on Aug 5, 2014

  1. [SPARK-2323] Exception in accumulator update should not crash DAGSche…

    …duler & SparkContext
    
    Author: Reynold Xin <rxin@apache.org>
    
    Closes #1772 from rxin/accumulator-dagscheduler and squashes the following commits:
    
    6a58520 [Reynold Xin] [SPARK-2323] Exception in accumulator update should not crash DAGScheduler & SparkContext.
    rxin committed Aug 5, 2014
    Configuration menu
    Copy the full SHA
    05bf4e4 View commit details
    Browse the repository at this point in the history
  2. SPARK-2685. Update ExternalAppendOnlyMap to avoid buffer.remove()

    Replaces this with an O(1) operation that does not have to shift over
    the whole tail of the array into the gap produced by the element removed.
    
    Author: Matei Zaharia <matei@databricks.com>
    
    Closes #1773 from mateiz/SPARK-2685 and squashes the following commits:
    
    1ea028a [Matei Zaharia] Update comments in StreamBuffer and EAOM, and reuse ArrayBuffers
    eb1abfd [Matei Zaharia] Update ExternalAppendOnlyMap to avoid buffer.remove()
    mateiz committed Aug 5, 2014
    Configuration menu
    Copy the full SHA
    066765d View commit details
    Browse the repository at this point in the history
  3. SPARK-2711. Create a ShuffleMemoryManager to track memory for all spi…

    …lling collections
    
    This tracks memory properly if there are multiple spilling collections in the same task (which was a problem before), and also implements an algorithm that lets each thread grow up to 1 / 2N of the memory pool (where N is the number of threads) before spilling, which avoids an inefficiency with small spills we had before (some threads would spill many times at 0-1 MB because the pool was allocated elsewhere).
    
    Author: Matei Zaharia <matei@databricks.com>
    
    Closes #1707 from mateiz/spark-2711 and squashes the following commits:
    
    debf75b [Matei Zaharia] Review comments
    24f28f3 [Matei Zaharia] Small rename
    c8f3a8b [Matei Zaharia] Update ShuffleMemoryManager to be able to partially grant requests
    315e3a5 [Matei Zaharia] Some review comments
    b810120 [Matei Zaharia] Create central manager to track memory for all spilling collections
    mateiz committed Aug 5, 2014
    Configuration menu
    Copy the full SHA
    4fde28c View commit details
    Browse the repository at this point in the history
  4. [SPARK-2857] Correct properties to set Master / Worker ports

    `master.ui.port` and `worker.ui.port` were never picked up by SparkConf, simply because they are not prefixed with "spark." Unfortunately, this is also currently the documented way of setting these values.
    
    Author: Andrew Or <andrewor14@gmail.com>
    
    Closes #1779 from andrewor14/master-worker-port and squashes the following commits:
    
    8475e95 [Andrew Or] Update docs to reflect changes in configs
    4db3d5d [Andrew Or] Stop using configs that don't actually work
    andrewor14 authored and pwendell committed Aug 5, 2014
    Configuration menu
    Copy the full SHA
    a646a36 View commit details
    Browse the repository at this point in the history
  5. [SPARK-1779] Throw an exception if memory fractions are not between 0…

    … and 1
    
    Author: wangfei <scnbwf@yeah.net>
    Author: wangfei <wangfei1@huawei.com>
    
    Closes #714 from scwf/memoryFraction and squashes the following commits:
    
    6e385b9 [wangfei] Update SparkConf.scala
    da6ee59 [wangfei] add configs
    829a195 [wangfei] add indent
    717c0ca [wangfei] updated to make more concise
    fc45476 [wangfei] validate memoryfraction in sparkconf
    2e79b3d [wangfei] && => ||
    43621bd [wangfei] && => ||
    cf38bcf [wangfei] throw IllegalArgumentException
    14d18ac [wangfei] throw IllegalArgumentException
    dff1f0f [wangfei] Update BlockManager.scala
    764965f [wangfei] Update ExternalAppendOnlyMap.scala
    a59d76b [wangfei] Throw exception when memoryFracton is out of range
    7b899c2 [wangfei] 【SPARK-1779
    wangfei authored and pwendell committed Aug 5, 2014
    Configuration menu
    Copy the full SHA
    9862c61 View commit details
    Browse the repository at this point in the history
  6. [SPARK-2856] Decrease initial buffer size for Kryo to 64KB.

    Author: Reynold Xin <rxin@apache.org>
    
    Closes #1780 from rxin/kryo-init-size and squashes the following commits:
    
    551b935 [Reynold Xin] [SPARK-2856] Decrease initial buffer size for Kryo to 64KB.
    rxin committed Aug 5, 2014
    Configuration menu
    Copy the full SHA
    184048f View commit details
    Browse the repository at this point in the history
  7. [SPARK-1022][Streaming] Add Kafka real unit test

    This PR is a updated version of (#557) to actually test sending and receiving data through Kafka, and fix previous flaky issues.
    
    @tdas, would you mind reviewing this PR? Thanks a lot.
    
    Author: jerryshao <saisai.shao@intel.com>
    
    Closes #1751 from jerryshao/kafka-unit-test and squashes the following commits:
    
    b6a505f [jerryshao] code refactor according to comments
    5222330 [jerryshao] Change JavaKafkaStreamSuite to better test it
    5525f10 [jerryshao] Fix flaky issue of Kafka real unit test
    4559310 [jerryshao] Minor changes for Kafka unit test
    860f649 [jerryshao] Minor style changes, and tests ignored due to flakiness
    796d4ca [jerryshao] Add real Kafka streaming test
    jerryshao authored and tdas committed Aug 5, 2014
    Configuration menu
    Copy the full SHA
    e87075d View commit details
    Browse the repository at this point in the history
  8. SPARK-1528 - spark on yarn, add support for accessing remote HDFS

    Add a config (spark.yarn.access.namenodes) to allow applications running on yarn to access other secure HDFS cluster.  User just specifies the namenodes of the other clusters and we get Tokens for those and ship them with the spark application.
    
    Author: Thomas Graves <tgraves@apache.org>
    
    Closes #1159 from tgravescs/spark-1528 and squashes the following commits:
    
    ddbcd16 [Thomas Graves] review comments
    0ac8501 [Thomas Graves] SPARK-1528 - add support for accessing remote HDFS
    tgravescs committed Aug 5, 2014
    Configuration menu
    Copy the full SHA
    2c0f705 View commit details
    Browse the repository at this point in the history
  9. SPARK-1890 and SPARK-1891- add admin and modify acls

    It was easier to combine these 2 jira since they touch many of the same places.  This pr adds the following:
    
    - adds modify acls
    - adds admin acls (list of admins/users that get added to both view and modify acls)
    - modify Kill button on UI to take modify acls into account
    - changes config name of spark.ui.acls.enable to spark.acls.enable since I choose poorly in original name. We keep backwards compatibility so people can still use spark.ui.acls.enable. The acls should apply to any web ui as well as any CLI interfaces.
    - send view and modify acls information on to YARN so that YARN interfaces can use (yarn cli for killing applications for example).
    
    Author: Thomas Graves <tgraves@apache.org>
    
    Closes #1196 from tgravescs/SPARK-1890 and squashes the following commits:
    
    8292eb1 [Thomas Graves] review comments
    b92ec89 [Thomas Graves] remove unneeded variable from applistener
    4c765f4 [Thomas Graves] Add in admin acls
    72eb0ac [Thomas Graves] Add modify acls
    tgravescs committed Aug 5, 2014
    Configuration menu
    Copy the full SHA
    1c5555a View commit details
    Browse the repository at this point in the history
  10. [SPARK-2860][SQL] Fix coercion of CASE WHEN.

    Author: Michael Armbrust <michael@databricks.com>
    
    Closes #1785 from marmbrus/caseNull and squashes the following commits:
    
    126006d [Michael Armbrust] better error message
    2fe357f [Michael Armbrust] Fix coercion of CASE WHEN.
    marmbrus committed Aug 5, 2014
    Configuration menu
    Copy the full SHA
    6e821e3 View commit details
    Browse the repository at this point in the history
  11. [SPARK-2859] Update url of Kryo project in related docs

    JIRA Issue: https://issues.apache.org/jira/browse/SPARK-2859
    
    Kryo project has been migrated from googlecode to github, hence we need to update its URL in related docs such as tuning.md.
    
    Author: Guancheng (G.C.) Chen <chenguancheng@gmail.com>
    
    Closes #1782 from gchen/kryo-docs and squashes the following commits:
    
    b62543c [Guancheng (G.C.) Chen] update url of Kryo project
    gchen authored and pwendell committed Aug 5, 2014
    Configuration menu
    Copy the full SHA
    ac3440f View commit details
    Browse the repository at this point in the history
  12. SPARK-2380: Support displaying accumulator values in the web UI

    This patch adds support for giving accumulators user-visible names and displaying accumulator values in the web UI. This allows users to create custom counters that can display in the UI. The current approach displays both the accumulator deltas caused by each task and a "current" value of the accumulator totals for each stage, which gets update as tasks finish.
    
    Currently in Spark developers have been extending the `TaskMetrics` functionality to provide custom instrumentation for RDD's. This provides a potentially nicer alternative of going through the existing accumulator framework (actually `TaskMetrics` and accumulators are on an awkward collision course as we add more features to the former). The current patch demo's how we can use the feature to provide instrumentation for RDD input sizes. The nice thing about going through accumulators is that users can read the current value of the data being tracked in their programs. This could be useful to e.g. decide to short-circuit a Spark stage depending on how things are going.
    
    ![counters](https://cloud.githubusercontent.com/assets/320616/3488815/6ee7bc34-0505-11e4-84ce-e36d9886e2cf.png)
    
    Author: Patrick Wendell <pwendell@gmail.com>
    
    Closes #1309 from pwendell/metrics and squashes the following commits:
    
    8815308 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into HEAD
    93fbe0f [Patrick Wendell] Other minor fixes
    cc43f68 [Patrick Wendell] Updating unit tests
    c991b1b [Patrick Wendell] Moving some code into the Accumulators class
    9a9ba3c [Patrick Wendell] More merge fixes
    c5ace9e [Patrick Wendell] More merge conflicts
    1da15e3 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into metrics
    9860c55 [Patrick Wendell] Potential solution to posting listener events
    0bb0e33 [Patrick Wendell] Remove "display" variable and assume display = name.isDefined
    0ec4ac7 [Patrick Wendell] Java API's
    e95bf69 [Patrick Wendell] Stash
    be97261 [Patrick Wendell] Style fix
    8407308 [Patrick Wendell] Removing examples in Hadoop and RDD class
    64d405f [Patrick Wendell] Adding missing file
    5d8b156 [Patrick Wendell] Changes based on Kay's review.
    9f18bad [Patrick Wendell] Minor style changes and tests
    7a63abc [Patrick Wendell] Adding Json serialization and responding to Reynold's feedback
    ad85076 [Patrick Wendell] Example of using named accumulators for custom RDD metrics.
    0b72660 [Patrick Wendell] Initial WIP example of supporing globally named accumulators.
    pwendell committed Aug 5, 2014
    Configuration menu
    Copy the full SHA
    74f82c7 View commit details
    Browse the repository at this point in the history
  13. SPARK-1680: use configs for specifying environment variables on YARN

    Note that this also documents spark.executorEnv.*  which to me means its public.  If we don't want that please speak up.
    
    Author: Thomas Graves <tgraves@apache.org>
    
    Closes #1512 from tgravescs/SPARK-1680 and squashes the following commits:
    
    11525df [Thomas Graves] more doc changes
    553bad0 [Thomas Graves] fix documentation
    152bf7c [Thomas Graves] fix docs
    5382326 [Thomas Graves] try fix docs
    32f86a4 [Thomas Graves] use configs for specifying environment variables on YARN
    tgravescs committed Aug 5, 2014
    Configuration menu
    Copy the full SHA
    41e0a21 View commit details
    Browse the repository at this point in the history
  14. [SPARK-2864][MLLIB] fix random seed in word2vec; move model to local

    It also moves the model to local in order to map `RDD[String]` to `RDD[Vector]`.
    
    Ishiihara
    
    Author: Xiangrui Meng <meng@databricks.com>
    
    Closes #1790 from mengxr/word2vec-fix and squashes the following commits:
    
    a87146c [Xiangrui Meng] add setters and make a default constructor
    e5c923b [Xiangrui Meng] fix random seed in word2vec; move model to local
    mengxr committed Aug 5, 2014
    Configuration menu
    Copy the full SHA
    cc491f6 View commit details
    Browse the repository at this point in the history
  15. [SPARK-2503] Lower shuffle output buffer (spark.shuffle.file.buffer.k…

    …b) to 32KB.
    
    This can substantially reduce memory usage during shuffle.
    
    Author: Reynold Xin <rxin@apache.org>
    
    Closes #1781 from rxin/SPARK-2503-spark.shuffle.file.buffer.kb and squashes the following commits:
    
    104b8d8 [Reynold Xin] [SPARK-2503] Lower shuffle output buffer (spark.shuffle.file.buffer.kb) to 32KB.
    rxin committed Aug 5, 2014
    Configuration menu
    Copy the full SHA
    acff9a7 View commit details
    Browse the repository at this point in the history
  16. [SPARK-2550][MLLIB][APACHE SPARK] Support regularization and intercep…

    …t in pyspark's linear methods
    
    Related to Jira Issue: [SPARK-2550](https://issues.apache.org/jira/browse/SPARK-2550?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20priority%20%3D%20Major%20ORDER%20BY%20key%20DESC)
    
    Author: Michael Giannakopoulos <miccagiann@gmail.com>
    
    Closes #1775 from miccagiann/linearMethodsReg and squashes the following commits:
    
    cb774c3 [Michael Giannakopoulos] MiniBatchFraction added in related PythonMLLibAPI java stubs.
    81fcbc6 [Michael Giannakopoulos] Fixing a typo-error.
    8ad263e [Michael Giannakopoulos] Adding regularizer type and intercept parameters to LogisticRegressionWithSGD and SVMWithSGD.
    miccagiann authored and mengxr committed Aug 5, 2014
    Configuration menu
    Copy the full SHA
    1aad911 View commit details
    Browse the repository at this point in the history

Commits on Aug 6, 2014

  1. SPARK-2869 - Fix tiny bug in JdbcRdd for closing jdbc connection

    I inquired on  dev mailing list about the motivation for checking the jdbc statement instead of the connection in the close() logic of JdbcRDD. Ted Yu believes there essentially is none-  it is a simple cut and paste issue. So here is the tiny fix to patch it.
    
    Author: Stephen Boesch <javadba>
    Author: Stephen Boesch <javadba@gmail.com>
    
    Closes #1792 from javadba/closejdbc and squashes the following commits:
    
    363be4f [Stephen Boesch] SPARK-2869 - Fix tiny bug in JdbcRdd for closing jdbc connection (reformat with braces)
    6518d36 [Stephen Boesch] SPARK-2689 Fix tiny bug in JdbcRdd for closing jdbc connection
    3fb23ed [Stephen Boesch] SPARK-2689 Fix potential leak of connection/PreparedStatement in case of error in JdbcRDD
    095b2c9 [Stephen Boesch] Fix tiny bug (likely copy and paste error) in closing jdbc connection
    Stephen Boesch authored and rxin committed Aug 6, 2014
    Configuration menu
    Copy the full SHA
    2643e66 View commit details
    Browse the repository at this point in the history
  2. [sql] rename project name in pom.xml of hive-thriftserver module

    module spark-hive-thriftserver_2.10 and spark-hive_2.10 both named "Spark Project Hive" in pom.xml, so rename spark-hive-thriftserver_2.10 project name to "Spark Project Hive Thrift Server"
    
    Author: wangfei <wangfei1@huawei.com>
    
    Closes #1789 from scwf/patch-1 and squashes the following commits:
    
    ca1f5e9 [wangfei] [sql] rename module name of hive-thriftserver
    scwf authored and marmbrus committed Aug 6, 2014
    Configuration menu
    Copy the full SHA
    d94f599 View commit details
    Browse the repository at this point in the history
  3. [SPARK-2650][SQL] Try to partially fix SPARK-2650 by adjusting initia…

    …l buffer size and reducing memory allocation
    
    JIRA issue: [SPARK-2650](https://issues.apache.org/jira/browse/SPARK-2650)
    
    Please refer to [comments](https://issues.apache.org/jira/browse/SPARK-2650?focusedCommentId=14084397&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14084397) of SPARK-2650 for some other details.
    
    This PR adjusts the initial in-memory columnar buffer size to 1MB, same as the default value of Shark's `shark.column.partitionSize.mb` property when running in local mode. Will add Shark style partition size estimation in another PR.
    
    Also, before this PR, `NullableColumnBuilder` copies the whole buffer to add the null positions section, and then `CompressibleColumnBuilder` copies and compresses the buffer again, even if compression is disabled (`PassThrough` compression scheme is used to disable compression). In this PR the first buffer copy is eliminated to reduce memory consumption.
    
    Author: Cheng Lian <lian.cs.zju@gmail.com>
    
    Closes #1769 from liancheng/spark-2650 and squashes the following commits:
    
    88a042e [Cheng Lian] Fixed method visibility and removed dead code
    001f2e5 [Cheng Lian] Try fixing SPARK-2650 by adjusting initial buffer size and reducing memory allocation
    liancheng authored and marmbrus committed Aug 6, 2014
    Configuration menu
    Copy the full SHA
    d0ae3f3 View commit details
    Browse the repository at this point in the history
  4. [SPARK-2854][SQL] Finalize _acceptable_types in pyspark.sql

    This PR aims to finalize accepted data value types in Python RDDs provided to Python `applySchema`.
    
    JIRA: https://issues.apache.org/jira/browse/SPARK-2854
    
    Author: Yin Huai <huai@cse.ohio-state.edu>
    
    Closes #1793 from yhuai/SPARK-2854 and squashes the following commits:
    
    32f0708 [Yin Huai] LongType only accepts long values.
    c2b23dd [Yin Huai] Do data type conversions based on the specified Spark SQL data type.
    yhuai authored and marmbrus committed Aug 6, 2014
    Configuration menu
    Copy the full SHA
    69ec678 View commit details
    Browse the repository at this point in the history
  5. [SPARK-2866][SQL] Support attributes in ORDER BY that aren't in SELECT

    Minor refactoring to allow resolution either using a nodes input or output.
    
    Author: Michael Armbrust <michael@databricks.com>
    
    Closes #1795 from marmbrus/ordering and squashes the following commits:
    
    237f580 [Michael Armbrust] style
    74d833b [Michael Armbrust] newline
    705d963 [Michael Armbrust] Add a rule for resolving ORDER BY expressions that reference attributes not present in the SELECT clause.
    82cabda [Michael Armbrust] Generalize attribute resolution.
    marmbrus committed Aug 6, 2014
    Configuration menu
    Copy the full SHA
    1d70c4f View commit details
    Browse the repository at this point in the history
  6. [SPARK-2806] core - upgrade to json4s-jackson 3.2.10

    Scala 2.11 packages not available for the current version (3.2.6)
    
    Signed-off-by: Anand Avati <avatiredhat.com>
    
    Author: Anand Avati <avati@redhat.com>
    
    Closes #1702 from avati/SPARK-1812-json4s-jackson-3.2.10 and squashes the following commits:
    
    7be8324 [Anand Avati] SPARK-1812: core - upgrade to json4s 3.2.10
    avati authored and pwendell committed Aug 6, 2014
    Configuration menu
    Copy the full SHA
    82624e2 View commit details
    Browse the repository at this point in the history
  7. [SQL] Tighten the visibility of various SQLConf methods and renamed s…

    …etter/getters
    
    Author: Reynold Xin <rxin@apache.org>
    
    Closes #1794 from rxin/sql-conf and squashes the following commits:
    
    3ac11ef [Reynold Xin] getAllConfs return an immutable Map instead of an Array.
    4b19d6c [Reynold Xin] Tighten the visibility of various SQLConf methods and renamed setter/getters.
    rxin authored and marmbrus committed Aug 6, 2014
    Configuration menu
    Copy the full SHA
    b70bae4 View commit details
    Browse the repository at this point in the history
  8. [SQL] Fix logging warn -> debug

    Author: Michael Armbrust <michael@databricks.com>
    
    Closes #1800 from marmbrus/warning and squashes the following commits:
    
    8ea9cf1 [Michael Armbrust] [SQL] Fix logging warn -> debug.
    marmbrus committed Aug 6, 2014
    Configuration menu
    Copy the full SHA
    5a826c0 View commit details
    Browse the repository at this point in the history
  9. SPARK-2294: fix locality inversion bug in TaskManager

    copied from original JIRA (https://issues.apache.org/jira/browse/SPARK-2294):
    
    If an executor E is free, a task may be speculatively assigned to E when there are other tasks in the job that have not been launched (at all) yet. Similarly, a task without any locality preferences may be assigned to E when there was another NODE_LOCAL task that could have been scheduled.
    This happens because TaskSchedulerImpl calls TaskSetManager.resourceOffer (which in turn calls TaskSetManager.findTask) with increasing locality levels, beginning with PROCESS_LOCAL, followed by NODE_LOCAL, and so on until the highest currently allowed level. Now, supposed NODE_LOCAL is the highest currently allowed locality level. The first time findTask is called, it will be called with max level PROCESS_LOCAL; if it cannot find any PROCESS_LOCAL tasks, it will try to schedule tasks with no locality preferences or speculative tasks. As a result, speculative tasks or tasks with no preferences may be scheduled instead of NODE_LOCAL tasks.
    
    ----
    
    I added an additional parameter in resourceOffer and findTask, maxLocality, indicating when we should consider the tasks without locality preference
    
    Author: CodingCat <zhunansjtu@gmail.com>
    
    Closes #1313 from CodingCat/SPARK-2294 and squashes the following commits:
    
    bf3f13b [CodingCat] rollback some forgotten changes
    89f9bc0 [CodingCat] address matei's comments
    18cae02 [CodingCat] add test case for node-local tasks
    2ba6195 [CodingCat] fix failed test cases
    87dd09e [CodingCat] fix style
    9b9432f [CodingCat] remove hasNodeLocalOnlyTasks
    fdd1573 [CodingCat] fix failed test cases
    941a4fd [CodingCat] see my shocked face..........
    f600085 [CodingCat] remove hasNodeLocalOnlyTasks checking
    0b8a46b [CodingCat] test whether hasNodeLocalOnlyTasks affect the results
    73ceda8 [CodingCat] style fix
    b3a430b [CodingCat] remove fine granularity tracking for node-local only tasks
    f9a2ad8 [CodingCat] simplify the logic in TaskSchedulerImpl
    c8c1de4 [CodingCat] simplify the patch
    be652ed [CodingCat] avoid unnecessary delay when we only have nopref tasks
    dee9e22 [CodingCat] fix locality inversion bug in TaskManager by moving nopref branch
    CodingCat authored and mateiz committed Aug 6, 2014
    Configuration menu
    Copy the full SHA
    63bdb1f View commit details
    Browse the repository at this point in the history
  10. [MLlib] Use this.type as return type in k-means' builder pattern

    to ensure that the return object is itself.
    
    Author: DB Tsai <dbtsai@alpinenow.com>
    
    Closes #1796 from dbtsai/dbtsai-kmeans and squashes the following commits:
    
    658989e [DB Tsai] Alpine Data Labs
    DB Tsai authored and mengxr committed Aug 6, 2014
    Configuration menu
    Copy the full SHA
    c7b5201 View commit details
    Browse the repository at this point in the history
  11. [SPARK-1022][Streaming][HOTFIX] Fixed zookeeper dependency of Kafka

    #1751 caused maven builds to fail.
    
    ```
    ~/Apache/spark(branch-1.1|✔) ➤ mvn -U -DskipTests clean install
    .
    .
    .
    [error] Apache/spark/external/kafka/src/test/scala/org/apache/spark/streaming/kafka/KafkaStreamSuite.scala:36: object NIOServerCnxnFactory is not a member of package org.apache.zookeeper.server
    [error] import org.apache.zookeeper.server.NIOServerCnxnFactory
    [error]        ^
    [error] Apache/spark/external/kafka/src/test/scala/org/apache/spark/streaming/kafka/KafkaStreamSuite.scala:199: not found: type NIOServerCnxnFactory
    [error]     val factory = new NIOServerCnxnFactory()
    [error]                       ^
    [error] two errors found
    [error] Compile failed at Aug 5, 2014 1:42:36 PM [0.503s]
    ```
    
    The problem is how SBT and Maven resolves multiple versions of the same library, which in this case, is Zookeeper. Observing and comparing the dependency trees from Maven and SBT showed this. Spark depends on ZK 3.4.5 whereas Apache Kafka transitively depends on upon ZK 3.3.4. SBT decides to evict 3.3.4 and use the higher version 3.4.5. But Maven decides to stick to the closest (in the tree) dependent version of 3.3.4. And 3.3.4 does not have NIOServerCnxnFactory.
    
    The solution in this patch excludes zookeeper from the apache-kafka dependency in streaming-kafka module so that it just inherits zookeeper from Spark core.
    
    Author: Tathagata Das <tathagata.das1565@gmail.com>
    
    Closes #1797 from tdas/kafka-zk-fix and squashes the following commits:
    
    94b3931 [Tathagata Das] Fixed zookeeper dependency of Kafka
    tdas authored and pwendell committed Aug 6, 2014
    Configuration menu
    Copy the full SHA
    ee7f308 View commit details
    Browse the repository at this point in the history
  12. [SPARK-2157] Enable tight firewall rules for Spark

    The goal of this PR is to allow users of Spark to write tight firewall rules for their clusters. This is currently not possible because Spark uses random ports in many places, notably the communication between executors and drivers. The changes in this PR are based on top of ash211's changes in #1107.
    
    The list covered here may or may not be the complete set of port needed for Spark to operate perfectly. However, as of the latest commit there are no known sources of random ports (except in tests). I have not documented a few of the more obscure configs.
    
    My spark-env.sh looks like this:
    ```
    export SPARK_MASTER_PORT=6060
    export SPARK_WORKER_PORT=7070
    export SPARK_MASTER_WEBUI_PORT=9090
    export SPARK_WORKER_WEBUI_PORT=9091
    ```
    and my spark-defaults.conf looks like this:
    ```
    spark.master spark://andrews-mbp:6060
    spark.driver.port 5001
    spark.fileserver.port 5011
    spark.broadcast.port 5021
    spark.replClassServer.port 5031
    spark.blockManager.port 5041
    spark.executor.port 5051
    ```
    
    Author: Andrew Or <andrewor14@gmail.com>
    Author: Andrew Ash <andrew@andrewash.com>
    
    Closes #1777 from andrewor14/configure-ports and squashes the following commits:
    
    621267b [Andrew Or] Merge branch 'master' of github.com:apache/spark into configure-ports
    8a6b820 [Andrew Or] Use a random UI port during tests
    7da0493 [Andrew Or] Fix tests
    523c30e [Andrew Or] Add test for isBindCollision
    b97b02a [Andrew Or] Minor fixes
    c22ad00 [Andrew Or] Merge branch 'master' of github.com:apache/spark into configure-ports
    93d359f [Andrew Or] Executors connect to wrong port when collision occurs
    d502e5f [Andrew Or] Handle port collisions when creating Akka systems
    a2dd05c [Andrew Or] Patrick's comment nit
    86461e2 [Andrew Or] Remove spark.executor.env.port and spark.standalone.client.port
    1d2d5c6 [Andrew Or] Fix ports for standalone cluster mode
    cb3be88 [Andrew Or] Various doc fixes (broken link, format etc.)
    e837cde [Andrew Or] Remove outdated TODOs
    bfbab28 [Andrew Or] Merge branch 'master' of github.com:apache/spark into configure-ports
    de1b207 [Andrew Or] Update docs to reflect new ports
    b565079 [Andrew Or] Add spark.ports.maxRetries
    2551eb2 [Andrew Or] Remove spark.worker.watcher.port
    151327a [Andrew Or] Merge branch 'master' of github.com:apache/spark into configure-ports
    9868358 [Andrew Or] Add a few miscellaneous ports
    6016e77 [Andrew Or] Add spark.executor.port
    8d836e6 [Andrew Or] Also document SPARK_{MASTER/WORKER}_WEBUI_PORT
    4d9e6f3 [Andrew Or] Fix super subtle bug
    3f8e51b [Andrew Or] Correct erroneous docs...
    e111d08 [Andrew Or] Add names for UI services
    470f38c [Andrew Or] Special case non-"Address already in use" exceptions
    1d7e408 [Andrew Or] Treat 0 ports specially + return correct ConnectionManager port
    ba32280 [Andrew Or] Minor fixes
    6b550b0 [Andrew Or] Assorted fixes
    73fbe89 [Andrew Or] Move start service logic to Utils
    ec676f4 [Andrew Or] Merge branch 'SPARK-2157' of github.com:ash211/spark into configure-ports
    038a579 [Andrew Ash] Trust the server start function to report the port the service started on
    7c5bdc4 [Andrew Ash] Fix style issue
    0347aef [Andrew Ash] Unify port fallback logic to a single place
    24a4c32 [Andrew Ash] Remove type on val to match surrounding style
    9e4ad96 [Andrew Ash] Reformat for style checker
    5d84e0e [Andrew Ash] Document new port configuration options
    066dc7a [Andrew Ash] Fix up HttpServer port increments
    cad16da [Andrew Ash] Add fallover increment logic for HttpServer
    c5a0568 [Andrew Ash] Fix ConnectionManager to retry with increment
    b80d2fd [Andrew Ash] Make Spark's block manager port configurable
    17c79bb [Andrew Ash] Add a configuration option for spark-shell's class server
    f34115d [Andrew Ash] SPARK-1176 Add port configuration for HttpBroadcast
    49ee29b [Andrew Ash] SPARK-1174 Add port configuration for HttpFileServer
    1c0981a [Andrew Ash] Make port in HttpServer configurable
    andrewor14 authored and pwendell committed Aug 6, 2014
    Configuration menu
    Copy the full SHA
    09f7e45 View commit details
    Browse the repository at this point in the history