Skip to content
Permalink
Branch: master
Commits on Jan 19, 2018
  1. [BUILD][MINOR] Fix java style check issues

    sameeragarwal committed Jan 19, 2018
    ## What changes were proposed in this pull request?
    
    This patch fixes a few recently introduced java style check errors in master and release branch.
    
    As an aside, given that [java linting currently fails](#10763
    ) on machines with a clean maven cache, it'd be great to find another workaround to [re-enable the java style checks](https://github.com/apache/spark/blob/3a07eff5af601511e97a05e6fea0e3d48f74c4f0/dev/run-tests.py#L577) as part of Spark PRB.
    
    /cc zsxwing JoshRosen srowen for any suggestions
    
    ## How was this patch tested?
    
    Manual Check
    
    Author: Sameer Agarwal <sameerag@apache.org>
    
    Closes #20323 from sameeragarwal/java.
Commits on Jan 17, 2018
  1. [SPARK-23020] Ignore Flaky Test: SparkLauncherSuite.testInProcessLaun…

    sameeragarwal committed Jan 17, 2018
    …cher
    
    ## What changes were proposed in this pull request?
    
    Temporarily ignoring flaky test `SparkLauncherSuite.testInProcessLauncher` to de-flake the builds. This should be re-enabled when SPARK-23020 is merged.
    
    ## How was this patch tested?
    
    N/A (Test Only Change)
    
    Author: Sameer Agarwal <sameerag@apache.org>
    
    Closes #20291 from sameeragarwal/disable-test-2.
Commits on Jan 16, 2018
  1. [SPARK-23000] Use fully qualified table names in HiveMetastoreCatalog…

    sameeragarwal authored and gatorsmile committed Jan 16, 2018
    …Suite
    
    ## What changes were proposed in this pull request?
    
    In another attempt to fix DataSourceWithHiveMetastoreCatalogSuite, this patch uses qualified table names (`default.t`) in the individual tests.
    
    ## How was this patch tested?
    
    N/A (Test Only Change)
    
    Author: Sameer Agarwal <sameerag@apache.org>
    
    Closes #20273 from sameeragarwal/flaky-test.
Commits on Jan 12, 2018
Commits on May 2, 2017
  1. [SPARK-20548] Disable ReplSuite.newProductSeqEncoder with REPL define…

    sameeragarwal authored and hvanhovell committed May 2, 2017
    …d class
    
    ## What changes were proposed in this pull request?
    
    `newProductSeqEncoder with REPL defined class` in `ReplSuite` has been failing in-deterministically : https://spark-tests.appspot.com/failed-tests over the last few days. Disabling the test until a fix is in place.
    
    https://spark.test.databricks.com/job/spark-master-test-sbt-hadoop-2.7/176/testReport/junit/org.apache.spark.repl/ReplSuite/newProductSeqEncoder_with_REPL_defined_class/history/
    
    ## How was this patch tested?
    
    N/A
    
    Author: Sameer Agarwal <sameerag@cs.berkeley.edu>
    
    Closes #17823 from sameeragarwal/disable-test.
Commits on Apr 26, 2017
  1. [SPARK-18127] Add hooks and extension points to Spark

    sameeragarwal authored and gatorsmile committed Apr 26, 2017
    ## What changes were proposed in this pull request?
    
    This patch adds support for customizing the spark session by injecting user-defined custom extensions. This allows a user to add custom analyzer rules/checks, optimizer rules, planning strategies or even a customized parser.
    
    ## How was this patch tested?
    
    Unit Tests in SparkSessionExtensionSuite
    
    Author: Sameer Agarwal <sameerag@cs.berkeley.edu>
    
    Closes #17724 from sameeragarwal/session-extensions.
Commits on Apr 25, 2017
  1. [SPARK-20451] Filter out nested mapType datatypes from sort order in …

    sameeragarwal authored and cloud-fan committed Apr 25, 2017
    …randomSplit
    
    ## What changes were proposed in this pull request?
    
    In `randomSplit`, It is possible that the underlying dataset doesn't guarantee the ordering of rows in its constituent partitions each time a split is materialized which could result in overlapping
    splits.
    
    To prevent this, as part of SPARK-12662, we explicitly sort each input partition to make the ordering deterministic. Given that `MapTypes` cannot be sorted this patch explicitly prunes them out from the sort order. Additionally, if the resulting sort order is empty, this patch then materializes the dataset to guarantee determinism.
    
    ## How was this patch tested?
    
    Extended `randomSplit on reordered partitions` in `DataFrameStatSuite` to also test for dataframes with mapTypes nested mapTypes.
    
    Author: Sameer Agarwal <sameerag@cs.berkeley.edu>
    
    Closes #17751 from sameeragarwal/randomsplit2.
Commits on Mar 22, 2017
  1. [BUILD][MINOR] Fix 2.10 build

    sameeragarwal authored and ueshin committed Mar 22, 2017
    ## What changes were proposed in this pull request?
    
    #17385 breaks the 2.10 sbt/maven builds by hitting an empty-string interpolation bug (https://issues.scala-lang.org/browse/SI-7919).
    
    https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-sbt-scala-2.10/4072/
    https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-scala-2.10/3987/
    
    ## How was this patch tested?
    
    Compiles
    
    Author: Sameer Agarwal <sameerag@cs.berkeley.edu>
    
    Closes #17391 from sameeragarwal/build-fix.
Commits on Sep 26, 2016
  1. [SPARK-17652] Fix confusing exception message while reserving capacity

    sameeragarwal authored and yhuai committed Sep 26, 2016
    ## What changes were proposed in this pull request?
    
    This minor patch fixes a confusing exception message while reserving additional capacity in the vectorized parquet reader.
    
    ## How was this patch tested?
    
    Exisiting Unit Tests
    
    Author: Sameer Agarwal <sameerag@cs.berkeley.edu>
    
    Closes #15225 from sameeragarwal/error-msg.
Commits on Sep 11, 2016
  1. [SPARK-17415][SQL] Better error message for driver-side broadcast joi…

    sameeragarwal authored and hvanhovell committed Sep 11, 2016
    …n OOMs
    
    ## What changes were proposed in this pull request?
    
    This is a trivial patch that catches all `OutOfMemoryError` while building the broadcast hash relation and rethrows it by wrapping it in a nice error message.
    
    ## How was this patch tested?
    
    Existing Tests
    
    Author: Sameer Agarwal <sameerag@cs.berkeley.edu>
    
    Closes #14979 from sameeragarwal/broadcast-join-error.
Commits on Sep 2, 2016
  1. [SPARK-16334] Reusing same dictionary column for decoding consecutive…

    sameeragarwal authored and davies committed Sep 2, 2016
    … row groups shouldn't throw an error
    
    ## What changes were proposed in this pull request?
    
    This patch fixes a bug in the vectorized parquet reader that's caused by re-using the same dictionary column vector while reading consecutive row groups. Specifically, this issue manifests for a certain distribution of dictionary/plain encoded data while we read/populate the underlying bit packed dictionary data into a column-vector based data structure.
    
    ## How was this patch tested?
    
    Manually tested on datasets provided by the community. Thanks to Chris Perluss and Keith Kraus for their invaluable help in tracking down this issue!
    
    Author: Sameer Agarwal <sameerag@cs.berkeley.edu>
    
    Closes #14941 from sameeragarwal/parquet-exception-2.
Commits on Aug 26, 2016
  1. [SPARK-17244] Catalyst should not pushdown non-deterministic join con…

    sameeragarwal authored and yhuai committed Aug 26, 2016
    …ditions
    
    ## What changes were proposed in this pull request?
    
    Given that non-deterministic expressions can be stateful, pushing them down the query plan during the optimization phase can cause incorrect behavior. This patch fixes that issue by explicitly disabling that.
    
    ## How was this patch tested?
    
    A new test in `FilterPushdownSuite` that checks catalyst behavior for both deterministic and non-deterministic join conditions.
    
    Author: Sameer Agarwal <sameerag@cs.berkeley.edu>
    
    Closes #14815 from sameeragarwal/constraint-inputfile.
Commits on Aug 25, 2016
  1. [SPARK-17228][SQL] Not infer/propagate non-deterministic constraints

    sameeragarwal authored and rxin committed Aug 25, 2016
    ## What changes were proposed in this pull request?
    
    Given that filters based on non-deterministic constraints shouldn't be pushed down in the query plan, unnecessarily inferring them is confusing and a source of potential bugs. This patch simplifies the inferring logic by simply ignoring them.
    
    ## How was this patch tested?
    
    Added a new test in `ConstraintPropagationSuite`.
    
    Author: Sameer Agarwal <sameerag@cs.berkeley.edu>
    
    Closes #14795 from sameeragarwal/deterministic-constraints.
Commits on Jul 28, 2016
  1. [SPARK-16764][SQL] Recommend disabling vectorized parquet reader on O…

    sameeragarwal authored and rxin committed Jul 28, 2016
    …utOfMemoryError
    
    ## What changes were proposed in this pull request?
    
    We currently don't bound or manage the data array size used by column vectors in the vectorized reader (they're just bound by INT.MAX) which may lead to OOMs while reading data. As a short term fix, this patch intercepts the OutOfMemoryError exception and suggest the user to disable the vectorized parquet reader.
    
    ## How was this patch tested?
    
    Existing Tests
    
    Author: Sameer Agarwal <sameerag@cs.berkeley.edu>
    
    Closes #14387 from sameeragarwal/oom.
Commits on Jul 25, 2016
  1. [SPARK-16668][TEST] Test parquet reader for row groups containing bot…

    sameeragarwal authored and liancheng committed Jul 25, 2016
    …h dictionary and plain encoded pages
    
    ## What changes were proposed in this pull request?
    
    This patch adds an explicit test for [SPARK-14217] by setting the parquet dictionary and page size the generated parquet file spans across 3 pages (within a single row group) where the first page is dictionary encoded and the remaining two are plain encoded.
    
    ## How was this patch tested?
    
    1. ParquetEncodingSuite
    2. Also manually tested that this test fails without #12279
    
    Author: Sameer Agarwal <sameerag@cs.berkeley.edu>
    
    Closes #14304 from sameeragarwal/hybrid-encoding-test.
Commits on Jul 21, 2016
  1. [SPARK-16334] Maintain single dictionary per row-batch in vectorized …

    sameeragarwal authored and rxin committed Jul 21, 2016
    …parquet reader
    
    ## What changes were proposed in this pull request?
    
    As part of the bugfix in #12279, if a row batch consist of both dictionary encoded and non-dictionary encoded pages, we explicitly decode the dictionary for the values that are already dictionary encoded. Currently we reset the dictionary while reading every page that can potentially cause ` java.lang.ArrayIndexOutOfBoundsException` while decoding older pages. This patch fixes the problem by maintaining a single dictionary per row-batch in vectorized parquet reader.
    
    ## How was this patch tested?
    
    Manual Tests against a number of hand-generated parquet files.
    
    Author: Sameer Agarwal <sameerag@cs.berkeley.edu>
    
    Closes #14225 from sameeragarwal/vectorized.
Commits on Jul 16, 2016
  1. [SPARK-16582][SQL] Explicitly define isNull = false for non-nullable …

    sameeragarwal authored and rxin committed Jul 16, 2016
    …expressions
    
    ## What changes were proposed in this pull request?
    
    This patch is just a slightly safer way to fix the issue we encountered in #14168 should this pattern re-occur at other places in the code.
    
    ## How was this patch tested?
    
    Existing tests. Also, I manually tested that it fixes the problem in SPARK-16514 without having the proposed change in #14168
    
    Author: Sameer Agarwal <sameerag@cs.berkeley.edu>
    
    Closes #14227 from sameeragarwal/codegen.
Commits on Jul 12, 2016
  1. [SPARK-16488] Fix codegen variable namespace collision in pmod and pa…

    sameeragarwal authored and rxin committed Jul 12, 2016
    …rtitionBy
    
    ## What changes were proposed in this pull request?
    
    This patch fixes a variable namespace collision bug in pmod and partitionBy
    
    ## How was this patch tested?
    
    Regression test for one possible occurrence. A more general fix in `ExpressionEvalHelper.checkEvaluation` will be in a subsequent PR.
    
    Author: Sameer Agarwal <sameer@databricks.com>
    
    Closes #14144 from sameeragarwal/codegen-bug.
Commits on Jun 24, 2016
  1. [SPARK-16123] Avoid NegativeArraySizeException while reserving additi…

    sameeragarwal authored and hvanhovell committed Jun 24, 2016
    …onal capacity in VectorizedColumnReader
    
    ## What changes were proposed in this pull request?
    
    This patch fixes an overflow bug in vectorized parquet reader where both off-heap and on-heap variants of `ColumnVector.reserve()` can unfortunately overflow while reserving additional capacity during reads.
    
    ## How was this patch tested?
    
    Manual Tests
    
    Author: Sameer Agarwal <sameer@databricks.com>
    
    Closes #13832 from sameeragarwal/negative-array.
Commits on Jun 17, 2016
  1. Remove non-obvious conf settings from TPCDS benchmark

    sameeragarwal authored and hvanhovell committed Jun 17, 2016
    ## What changes were proposed in this pull request?
    
    My fault -- these 2 conf entries are mysteriously hidden inside the benchmark code and makes it non-obvious to disable whole stage codegen and/or the vectorized parquet reader.
    
    PS: Didn't attach a JIRA as this change should otherwise be a no-op (both these conf are enabled by default in Spark)
    
    ## How was this patch tested?
    
    N/A
    
    Author: Sameer Agarwal <sameer@databricks.com>
    
    Closes #13726 from sameeragarwal/tpcds-conf.
Commits on Jun 11, 2016
  1. [SPARK-15678] Add support to REFRESH data source paths

    sameeragarwal authored and davies committed Jun 11, 2016
    ## What changes were proposed in this pull request?
    
    Spark currently incorrectly continues to use cached data even if the underlying data is overwritten.
    
    Current behavior:
    ```scala
    val dir = "/tmp/test"
    sqlContext.range(1000).write.mode("overwrite").parquet(dir)
    val df = sqlContext.read.parquet(dir).cache()
    df.count() // outputs 1000
    sqlContext.range(10).write.mode("overwrite").parquet(dir)
    sqlContext.read.parquet(dir).count() // outputs 1000 <---- We are still using the cached dataset
    ```
    
    This patch fixes this bug by adding support for `REFRESH path` that invalidates and refreshes all the cached data (and the associated metadata) for any dataframe that contains the given data source path.
    
    Expected behavior:
    ```scala
    val dir = "/tmp/test"
    sqlContext.range(1000).write.mode("overwrite").parquet(dir)
    val df = sqlContext.read.parquet(dir).cache()
    df.count() // outputs 1000
    sqlContext.range(10).write.mode("overwrite").parquet(dir)
    spark.catalog.refreshResource(dir)
    sqlContext.read.parquet(dir).count() // outputs 10 <---- We are not using the cached dataset
    ```
    
    ## How was this patch tested?
    
    Unit tests for overwrites and appends in `ParquetQuerySuite` and `CachedTableSuite`.
    
    Author: Sameer Agarwal <sameer@databricks.com>
    
    Closes #13566 from sameeragarwal/refresh-path-2.
Commits on Jun 3, 2016
  1. [SPARK-15745][SQL] Use classloader's getResource() for reading resour…

    sameeragarwal authored and rxin committed Jun 3, 2016
    …ce files in HiveTests
    
    ## What changes were proposed in this pull request?
    
    This is a cleaner approach in general but my motivation behind this change in particular is to be able to run these tests from anywhere without relying on system properties.
    
    ## How was this patch tested?
    
    Test only change
    
    Author: Sameer Agarwal <sameer@databricks.com>
    
    Closes #13489 from sameeragarwal/resourcepath.
Commits on Jun 2, 2016
  1. [SPARK-14752][SQL] Explicitly implement KryoSerialization for LazilyG…

    sameeragarwal authored and rxin committed Jun 2, 2016
    …enerateOrdering
    
    ## What changes were proposed in this pull request?
    
    This patch fixes a number of `com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException` exceptions reported in [SPARK-15604], [SPARK-14752] etc. (while executing sparkSQL queries with the kryo serializer) by explicitly implementing `KryoSerialization` for `LazilyGenerateOrdering`.
    
    ## How was this patch tested?
    
    1. Modified `OrderingSuite` so that all tests in the suite also test kryo serialization (for both interpreted and generated ordering).
    2. Manually verified TPC-DS q1.
    
    Author: Sameer Agarwal <sameer@databricks.com>
    
    Closes #13466 from sameeragarwal/kryo.
Commits on May 27, 2016
  1. [SPARK-15599][SQL][DOCS] API docs for `createDataset` functions in Sp…

    sameeragarwal authored and Andrew Or committed May 27, 2016
    …arkSession
    
    ## What changes were proposed in this pull request?
    
    Adds API docs and usage examples for the 3 `createDataset` calls in `SparkSession`
    
    ## How was this patch tested?
    
    N/A
    
    Author: Sameer Agarwal <sameer@databricks.com>
    
    Closes #13345 from sameeragarwal/dataset-doc.
Commits on May 26, 2016
  1. [SPARK-8428][SPARK-13850] Fix integer overflows in TimSort

    sameeragarwal authored and rxin committed May 26, 2016
    ## What changes were proposed in this pull request?
    
    This patch fixes a few integer overflows in `UnsafeSortDataFormat.copyRange()` and `ShuffleSortDataFormat copyRange()` that seems to be the most likely cause behind a number of `TimSort` contract violation errors seen in Spark 2.0 and Spark 1.6 while sorting large datasets.
    
    ## How was this patch tested?
    
    Added a test in `ExternalSorterSuite` that instantiates a large array of the form of [150000000, 150000001, 150000002, ...., 300000000, 0, 1, 2, ..., 149999999] that triggers a `copyRange` in `TimSort.mergeLo` or `TimSort.mergeHi`. Note that the input dataset should contain at least 268.43 million rows with a certain data distribution for an overflow to occur.
    
    Author: Sameer Agarwal <sameer@databricks.com>
    
    Closes #13336 from sameeragarwal/timsort-bug.
  2. [SPARK-15533][SQL] Deprecate Dataset.explode

    sameeragarwal authored and rxin committed May 26, 2016
    ## What changes were proposed in this pull request?
    
    This patch deprecates `Dataset.explode` and documents appropriate workarounds to use `flatMap()` or `functions.explode()` instead.
    
    ## How was this patch tested?
    
    N/A
    
    Author: Sameer Agarwal <sameer@databricks.com>
    
    Closes #13312 from sameeragarwal/deprecate.
Commits on May 23, 2016
  1. [SPARK-15425][SQL] Disallow cross joins by default

    sameeragarwal authored and rxin committed May 23, 2016
    ## What changes were proposed in this pull request?
    
    In order to prevent users from inadvertently writing queries with cartesian joins, this patch introduces a new conf `spark.sql.crossJoin.enabled` (set to `false` by default) that if not set, results in a `SparkException` if the query contains one or more cartesian products.
    
    ## How was this patch tested?
    
    Added a test to verify the new behavior in `JoinSuite`. Additionally, `SQLQuerySuite` and `SQLMetricsSuite` were modified to explicitly enable cartesian products.
    
    Author: Sameer Agarwal <sameer@databricks.com>
    
    Closes #13209 from sameeragarwal/disallow-cartesian.
Commits on May 20, 2016
  1. [SPARK-15078] [SQL] Add all TPCDS 1.4 benchmark queries for SparkSQL

    sameeragarwal authored and davies committed May 20, 2016
    ## What changes were proposed in this pull request?
    
    Now that SparkSQL supports all TPC-DS queries, this patch adds all 99 benchmark queries inside SparkSQL.
    
    ## How was this patch tested?
    
    Benchmark only
    
    Author: Sameer Agarwal <sameer@databricks.com>
    
    Closes #13188 from sameeragarwal/tpcds-all.
Commits on Apr 29, 2016
  1. [SPARK-14996][SQL] Add TPCDS Benchmark Queries for SparkSQL

    sameeragarwal authored and rxin committed Apr 29, 2016
    ## What changes were proposed in this pull request?
    
    This PR adds support for easily running and benchmarking a set of common TPCDS queries locally in SparkSQL.
    
    ## How was this patch tested?
    
    N/A
    
    Author: Sameer Agarwal <sameer@databricks.com>
    
    Closes #12771 from sameeragarwal/tpcds-2.
Commits on Apr 26, 2016
  1. [SPARK-14929] [SQL] Disable vectorized map for wide schemas & high-pr…

    sameeragarwal authored and davies committed Apr 26, 2016
    …ecision decimals
    
    ## What changes were proposed in this pull request?
    
    While the vectorized hash map in `TungstenAggregate` is currently supported for all primitive data types during partial aggregation, this patch only enables the hash map for a subset of cases that've been verified to show performance improvements on our benchmarks subject to an internal conf that sets an upper limit on the maximum length of the aggregate key/value schema. This list of supported use-cases should be expanded over time.
    
    ## How was this patch tested?
    
    This is no new change in functionality so existing tests should suffice. Performance tests were done on TPCDS benchmarks.
    
    Author: Sameer Agarwal <sameer@databricks.com>
    
    Closes #12710 from sameeragarwal/vectorized-enable.
  2. [SPARK-14870][SQL][FOLLOW-UP] Move decimalDataWithNulls in DataFrameA…

    sameeragarwal authored and rxin committed Apr 26, 2016
    …ggregateSuite
    
    ## What changes were proposed in this pull request?
    
    Minor followup to #12651
    
    ## How was this patch tested?
    
    Test-only change
    
    Author: Sameer Agarwal <sameer@databricks.com>
    
    Closes #12674 from sameeragarwal/tpcds-fix-2.
Commits on Apr 25, 2016
  1. [SPARK-14870] [SQL] Fix NPE in TPCDS q14a

    sameeragarwal authored and davies committed Apr 25, 2016
    ## What changes were proposed in this pull request?
    
    This PR fixes a bug in `TungstenAggregate` that manifests while aggregating by keys over nullable `BigDecimal` columns. This causes a null pointer exception while executing TPCDS q14a.
    
    ## How was this patch tested?
    
    1. Added regression test in `DataFrameAggregateSuite`.
    2. Verified that TPCDS q14a works
    
    Author: Sameer Agarwal <sameer@databricks.com>
    
    Closes #12651 from sameeragarwal/tpcds-fix.
Commits on Apr 22, 2016
  1. [SPARK-14680] [SQL] Support all datatypes to use VectorizedHashmap in…

    sameeragarwal authored and davies committed Apr 22, 2016
    … TungstenAggregate
    
    ## What changes were proposed in this pull request?
    
    This PR adds support for all primitive datatypes, decimal types and stringtypes in the VectorizedHashmap during aggregation.
    
    ## How was this patch tested?
    
    Existing tests for group-by aggregates should already test for all these datatypes. Additionally, manually inspected the generated code for all supported datatypes (details below).
    
    Author: Sameer Agarwal <sameer@databricks.com>
    
    Closes #12440 from sameeragarwal/all-datatypes.
Commits on Apr 21, 2016
  1. [SPARK-14774][SQL] Write unscaled values in ColumnVector.putDecimal

    sameeragarwal authored and rxin committed Apr 21, 2016
    ## What changes were proposed in this pull request?
    
    We recently made `ColumnarBatch.row` mutable and added a new `ColumnVector.putDecimal` method to support putting `Decimal` values in the `ColumnarBatch`. This unfortunately introduced a bug wherein we were not updating the vector with the proper unscaled values.
    
    ## How was this patch tested?
    
    This codepath is hit only when the vectorized aggregate hashmap is enabled. #12440 makes sure that a number of regression tests/benchmarks test this bugfix.
    
    Author: Sameer Agarwal <sameer@databricks.com>
    
    Closes #12541 from sameeragarwal/fix-bigdecimal.
Older
You can’t perform that action at this time.