Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-33813][SQL][3.0] Fix the issue that JDBC source can't treat MS SQL Server's spatial types #31289

Closed
wants to merge 1,280 commits into from
This pull request is big! We’re only showing the most recent 250 commits.

Commits on Sep 8, 2020

  1. [SPARK-31511][FOLLOW-UP][TEST][SQL] Make BytesToBytesMap iterators th…

    …read-safe
    
    ### What changes were proposed in this pull request?
    Before SPARK-31511 is fixed, `BytesToBytesMap` iterator() is not thread-safe and may cause data inaccuracy.
    We need to add a unit test.
    
    ### Why are the changes needed?
    Increase test coverage to ensure that iterator() is thread-safe.
    
    ### Does this PR introduce _any_ user-facing change?
    No.
    
    ### How was this patch tested?
    add ut
    
    Closes apache#29669 from cxzl25/SPARK-31511-test.
    
    Authored-by: sychen <sychen@ctrip.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    (cherry picked from commit bd3dc2f)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    cxzl25 authored and cloud-fan committed Sep 8, 2020
    Configuration menu
    Copy the full SHA
    4656ee5 View commit details
    Browse the repository at this point in the history
  2. [SPARK-32753][SQL][3.0] Only copy tags to node with no tags

    This PR backports apache#29593 to branch-3.0
    
    ### What changes were proposed in this pull request?
    Only copy tags to node with no tags when transforming plans.
    
    ### Why are the changes needed?
    cloud-fan [made a good point](apache#29593 (comment)) that it doesn't make sense to append tags to existing nodes when nodes are removed. That will cause such bugs as duplicate rows when deduplicating and repartitioning by the same column with AQE.
    
    ```
    spark.range(10).union(spark.range(10)).createOrReplaceTempView("v1")
    val df = spark.sql("select id from v1 group by id distribute by id")
    println(df.collect().toArray.mkString(","))
    println(df.queryExecution.executedPlan)
    
    // With AQE
    [4],[0],[3],[2],[1],[7],[6],[8],[5],[9],[4],[0],[3],[2],[1],[7],[6],[8],[5],[9]
    AdaptiveSparkPlan(isFinalPlan=true)
    +- CustomShuffleReader local
       +- ShuffleQueryStage 0
          +- Exchange hashpartitioning(id#183L, 10), true
             +- *(3) HashAggregate(keys=[id#183L], functions=[], output=[id#183L])
                +- Union
                   :- *(1) Range (0, 10, step=1, splits=2)
                   +- *(2) Range (0, 10, step=1, splits=2)
    
    // Without AQE
    [4],[7],[0],[6],[8],[3],[2],[5],[1],[9]
    *(4) HashAggregate(keys=[id#206L], functions=[], output=[id#206L])
    +- Exchange hashpartitioning(id#206L, 10), true
       +- *(3) HashAggregate(keys=[id#206L], functions=[], output=[id#206L])
          +- Union
             :- *(1) Range (0, 10, step=1, splits=2)
             +- *(2) Range (0, 10, step=1, splits=2)
    ```
    
    It's too expensive to detect node removal so we make a compromise only to copy tags to node with no tags.
    
    ### Does this PR introduce any user-facing change?
    Yes. Fix a bug.
    
    ### How was this patch tested?
    Add test.
    
    Closes apache#29665 from manuzhang/spark-32753-3.0.
    
    Authored-by: manuzhang <owenzhang1990@gmail.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    manuzhang authored and cloud-fan committed Sep 8, 2020
    Configuration menu
    Copy the full SHA
    9b39e4b View commit details
    Browse the repository at this point in the history
  3. [SPARK-32815][ML][3.0] Fix LibSVM data source loading error on file p…

    …aths with glob metacharacters
    
    ### What changes were proposed in this pull request?
    In the PR, I propose to fix an issue with LibSVM datasource when both of the following are true:
    * no user specified schema
    * some file paths contain escaped glob metacharacters, such as `[``]`, `{``}`, `*` etc.
    
    The fix is a backport of apache#29670, and it is based on another bug fix for CSV/JSON datasources apache#29659.
    
    ### Why are the changes needed?
    To fix the issue when the follow two queries try to read from paths `[abc]`:
    ```scala
    spark.read.format("libsvm").load("""/tmp/\[abc\].csv""").show
    ```
    but would end up hitting an exception:
    ```
    Path does not exist: file:/private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tc0000gn/T/spark-6ef0ae5e-ff9f-4c4f-9ff4-0db3ee1f6a82/[abc]/part-00000-26406ab9-4e56-45fd-a25a-491c18a05e76-c000.libsvm;
    org.apache.spark.sql.AnalysisException: Path does not exist: file:/private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tc0000gn/T/spark-6ef0ae5e-ff9f-4c4f-9ff4-0db3ee1f6a82/[abc]/part-00000-26406ab9-4e56-45fd-a25a-491c18a05e76-c000.libsvm;
    	at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$3(DataSource.scala:770)
    	at org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:373)
    	at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)
    	at scala.util.Success.$anonfun$map$1(Try.scala:255)
    	at scala.util.Success.map(Try.scala:213)
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    Yes
    
    ### How was this patch tested?
    Added UT to `LibSVMRelationSuite`.
    
    Closes apache#29675 from MaxGekk/globbing-paths-when-inferring-schema-ml-3.0.
    
    Authored-by: Max Gekk <max.gekk@gmail.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    MaxGekk authored and cloud-fan committed Sep 8, 2020
    Configuration menu
    Copy the full SHA
    8c0b9cb View commit details
    Browse the repository at this point in the history
  4. [SPARK-32638][SQL][3.0] Corrects references when adding aliases in Wi…

    …denSetOperationTypes
    
    ### What changes were proposed in this pull request?
    
    This PR intends to fix a bug where references can be missing when adding aliases to widen data types in `WidenSetOperationTypes`. For example,
    ```
    CREATE OR REPLACE TEMPORARY VIEW t3 AS VALUES (decimal(1)) tbl(v);
    SELECT t.v FROM (
      SELECT v FROM t3
      UNION ALL
      SELECT v + v AS v FROM t3
    ) t;
    
    org.apache.spark.sql.AnalysisException: Resolved attribute(s) v#1 missing from v#3 in operator !Project [v#1]. Attribute(s) with the same name appear in the operation: v. Please check if the right attribute(s) are used.;;
    !Project [v#1]  <------ the reference got missing
    +- SubqueryAlias t
       +- Union
          :- Project [cast(v#1 as decimal(11,0)) AS v#3]
          :  +- Project [v#1]
          :     +- SubqueryAlias t3
          :        +- SubqueryAlias tbl
          :           +- LocalRelation [v#1]
          +- Project [v#2]
             +- Project [CheckOverflow((promote_precision(cast(v#1 as decimal(11,0))) + promote_precision(cast(v#1 as decimal(11,0)))), DecimalType(11,0), true) AS v#2]
                +- SubqueryAlias t3
                   +- SubqueryAlias tbl
                      +- LocalRelation [v#1]
    ```
    In the case, `WidenSetOperationTypes` added the alias `cast(v#1 as decimal(11,0)) AS v#3`, then the reference in the top `Project` got missing. This PR correct the reference (`exprId` and widen `dataType`) after adding aliases in the rule.
    
    This backport for 3.0 comes from apache#29485 and apache#29643
    
    ### Why are the changes needed?
    
    bugfixes
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    Added unit tests
    
    Closes apache#29680 from maropu/SPARK-32638-BRANCH3.0.
    
    Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
    Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org>
    Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
    cloud-fan and maropu committed Sep 8, 2020
    Configuration menu
    Copy the full SHA
    3f20f14 View commit details
    Browse the repository at this point in the history

Commits on Sep 9, 2020

  1. [SPARK-32824][CORE] Improve the error message when the user forgets t…

    …he .amount in a resource config
    
    ### What changes were proposed in this pull request?
    
    If the user forgets to specify .amount on a resource config like spark.executor.resource.gpu, the error message thrown is very confusing:
    
    ```
    ERROR SparkContext: Error initializing SparkContext.java.lang.StringIndexOutOfBoundsException: String index out of range:
    -1 at java.lang.String.substring(String.java:1967) at
    ```
    
    This makes it so we have a readable error thrown
    
    ### Why are the changes needed?
    
    confusing error for users
    ### Does this PR introduce _any_ user-facing change?
    
    just error message
    
    ### How was this patch tested?
    
    Tested manually on standalone cluster
    
    Closes apache#29685 from tgravescs/SPARK-32824.
    
    Authored-by: Thomas Graves <tgraves@nvidia.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    (cherry picked from commit e8634d8)
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    tgravescs authored and HyukjinKwon committed Sep 9, 2020
    Configuration menu
    Copy the full SHA
    e86d90b View commit details
    Browse the repository at this point in the history
  2. [SPARK-32823][WEB UI] Fix the master ui resources reporting

    ### What changes were proposed in this pull request?
    
    Fixes the master UI for properly summing the resources total across multiple workers.
    field:
    Resources in use: 0 / 8 gpu
    
    The bug here is that it was creating MutableResourceInfo and then reducing using the + operator.  the + operator in MutableResourceInfo simple adds the address from one to the addresses of the other.  But its using a HashSet so if the addresses are the same then you lose the correct amount.  ie worker1 has gpu addresses 0,1,2,3 and worker2 has addresses 0,1,2,3 then you only see 4 total GPUs when there are 8.
    
    In this case we don't really need to create the MutableResourceInfo at all because we just want the sums for used and total so just remove the use of it.  The other uses of it are per Worker so those should be ok.
    
    ### Why are the changes needed?
    
    fix UI
    
    ### Does this PR introduce _any_ user-facing change?
    
    UI
    
    ### How was this patch tested?
    
    tested manually on standalone cluster with multiple workers and multiple GPUs and multiple fpgas
    
    Closes apache#29683 from tgravescs/SPARK-32823.
    
    Lead-authored-by: Thomas Graves <tgraves@nvidia.com>
    Co-authored-by: Thomas Graves <tgraves@apache.org>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    (cherry picked from commit 514bf56)
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    2 people authored and HyukjinKwon committed Sep 9, 2020
    Configuration menu
    Copy the full SHA
    86b9dd9 View commit details
    Browse the repository at this point in the history
  3. [SPARK-32813][SQL] Get default config of ParquetSource vectorized rea…

    …der if no active SparkSession
    
    ### What changes were proposed in this pull request?
    
    If no active SparkSession is available, let `FileSourceScanExec.needsUnsafeRowConversion` look at default SQL config of ParquetSource vectorized reader instead of failing the query execution.
    
    ### Why are the changes needed?
    
    Fix a bug that if no active SparkSession is available, file-based data source scan for Parquet Source will throw exception.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, this change fixes the bug.
    
    ### How was this patch tested?
    
    Unit test.
    
    Closes apache#29667 from viirya/SPARK-32813.
    
    Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    (cherry picked from commit de0dc52)
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    viirya authored and HyukjinKwon committed Sep 9, 2020
    Configuration menu
    Copy the full SHA
    4c0f9d8 View commit details
    Browse the repository at this point in the history
  4. [SPARK-32810][SQL][TESTS][FOLLOWUP][3.0] Check path globbing in JSON/…

    …CSV datasources v1 and v2
    
    ### What changes were proposed in this pull request?
    In the PR, I propose to move the test `SPARK-32810: CSV and JSON data sources should be able to read files with escaped glob metacharacter in the paths` from `DataFrameReaderWriterSuite` to `CSVSuite` and to `JsonSuite`. This will allow to run the same test in `CSVv1Suite`/`CSVv2Suite` and in `JsonV1Suite`/`JsonV2Suite`.
    
    ### Why are the changes needed?
    To improve test coverage by checking JSON/CSV datasources v1 and v2.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    By running affected test suites:
    ```
    $ build/sbt "sql/test:testOnly org.apache.spark.sql.execution.datasources.csv.*"
    $ build/sbt "sql/test:testOnly org.apache.spark.sql.execution.datasources.json.*"
    ```
    
    Closes apache#29690 from MaxGekk/globbing-paths-when-inferring-schema-dsv2-3.0.
    
    Authored-by: Max Gekk <max.gekk@gmail.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    MaxGekk authored and HyukjinKwon committed Sep 9, 2020
    Configuration menu
    Copy the full SHA
    837843b View commit details
    Browse the repository at this point in the history
  5. [SPARK-32794][SS] Fixed rare corner case error in micro-batch engine …

    …with some stateful queries + no-data-batches + V1 sources
    
    ### What changes were proposed in this pull request?
    
    Make MicroBatchExecution explicitly call `getBatch` when the start and end offsets are the same.
    
    ### Why are the changes needed?
    
    Structured Streaming micro-batch engine has the contract with V1 data sources that, after a restart, it will call `source.getBatch()` on the last batch attempted before the restart. However, a very rare combination of sequences violates this contract. It occurs only when
    - The streaming query has specific types of stateful operations with watermarks (e.g., aggregation in append, mapGroupsWithState with timeouts).
        - These queries can execute a batch even without new data when the previous updates the watermark and the stateful ops are such that the new watermark can cause new output/cleanup. Such batches are called no-data-batches.
    - The last batch before termination was an incomplete no-data-batch. Upon restart, the micro-batch engine fails to call `source.getBatch` when attempting to re-execute the incomplete no-data-batch.
    
    This occurs because no-data-batches has the same and end offsets, and when a batch is executed, if the start and end offset is same then calling `source.getBatch` is skipped as it is assumed the generated plan will be empty. This only affects V1 data sources which rely on this invariant to detect in the source whether the query is being started from scratch or restarted.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    New unit test with a mock v1 source that fails without the fix.
    
    Closes apache#29696 from tdas/SPARK-32794-3.0.
    
    Authored-by: Tathagata Das <tathagata.das1565@gmail.com>
    Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
    tdas committed Sep 9, 2020
    Configuration menu
    Copy the full SHA
    e632e7c View commit details
    Browse the repository at this point in the history

Commits on Sep 10, 2020

  1. [SPARK-32836][SS][TESTS] Fix DataStreamReaderWriterSuite to check wri…

    …ter options correctly
    
    ### What changes were proposed in this pull request?
    
    This PR aims to fix the test coverage at `DataStreamReaderWriterSuite`.
    
    ### Why are the changes needed?
    
    Currently, the test case checks `DataStreamReader` options instead of `DataStreamWriter` options.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Pass the revised test case.
    
    Closes apache#29701 from dongjoon-hyun/SPARK-32836.
    
    Authored-by: Dongjoon Hyun <dongjoon@apache.org>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    (cherry picked from commit 06a9945)
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    dongjoon-hyun committed Sep 10, 2020
    Configuration menu
    Copy the full SHA
    5a81f60 View commit details
    Browse the repository at this point in the history
  2. [SPARK-32832][SS] Use CaseInsensitiveMap for DataStreamReader/Writer …

    …options
    
    This PR aims to fix indeterministic behavior on DataStreamReader/Writer options like the following.
    ```scala
    scala> spark.readStream.format("parquet").option("paTh", "1").option("PATH", "2").option("Path", "3").option("patH", "4").option("path", "5").load()
    org.apache.spark.sql.AnalysisException: Path does not exist: 1;
    ```
    
    This will make the behavior deterministic.
    
    Yes, but the previous behavior is indeterministic.
    
    Pass the newly test cases.
    
    Closes apache#29702 from dongjoon-hyun/SPARK-32832.
    
    Authored-by: Dongjoon Hyun <dongjoon@apache.org>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    (cherry picked from commit 2f85f95)
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    dongjoon-hyun committed Sep 10, 2020
    Configuration menu
    Copy the full SHA
    44acb5a View commit details
    Browse the repository at this point in the history
  3. [SPARK-32819][SQL][3.0] ignoreNullability parameter should be effecti…

    …ve recursively
    
    ### What changes were proposed in this pull request?
    
    This patch proposes to check `ignoreNullability` parameter recursively in `equalsStructurally` method. This backports apache#29698 to branch-3.0.
    
    ### Why are the changes needed?
    
    `equalsStructurally` is used to check type equality. We can optionally ask to ignore nullability check. But the parameter `ignoreNullability` is not passed recursively down to nested types. So it produces weird error like:
    
    ```
    data type mismatch: argument 3 requires array<array<string>> type, however ... is of array<array<string>> type.
    ```
    
    when running the query `select aggregate(split('abcdefgh',''), array(array('')), (acc, x) -> array(array( x ) ) )`.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, fixed a bug when running user query.
    
    ### How was this patch tested?
    
    Unit tests.
    
    Closes apache#29705 from viirya/SPARK-32819-3.0.
    
    Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    viirya authored and HyukjinKwon committed Sep 10, 2020
    Configuration menu
    Copy the full SHA
    5708045 View commit details
    Browse the repository at this point in the history

Commits on Sep 11, 2020

  1. [SPARK-32840][SQL][3.0] Invalid interval value can happen to be just …

    …adhesive with the unit
    
    THIS PR backports apache#29708 to 3.0
    
    ### What changes were proposed in this pull request?
    In this PR, we add a checker for STRING form interval value ahead for parsing multiple units intervals and fail directly if the interval value contains alphabets to prevent correctness issues like `interval '1 day 2' day`=`3 days`.
    
    ### Why are the changes needed?
    
    fix correctness issue
    
    ### Does this PR introduce _any_ user-facing change?
    
    yes, in spark 3.0.0 `interval '1 day 2' day`=`3 days` but now we fail with ParseException
    ### How was this patch tested?
    
    add a test.
    
    Closes apache#29716 from yaooqinn/SPARK-32840-30.
    
    Authored-by: Kent Yao <yaooqinn@hotmail.com>
    Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
    yaooqinn authored and maropu committed Sep 11, 2020
    Configuration menu
    Copy the full SHA
    4fdd818 View commit details
    Browse the repository at this point in the history
  2. [SPARK-32677][SQL][DOCS][MINOR] Improve code comment in CreateFunctio…

    …nCommand
    
    ### What changes were proposed in this pull request?
    
    We made a mistake in apache#29502, as there is no code comment to explain why we can't load the UDF class when creating functions. This PR improves the code comment.
    
    ### Why are the changes needed?
    
    To avoid making the same mistake.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    N/A
    
    Closes apache#29713 from cloud-fan/comment.
    
    Authored-by: Wenchen Fan <wenchen@databricks.com>
    Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
    (cherry picked from commit 328d81a)
    Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
    cloud-fan authored and maropu committed Sep 11, 2020
    Configuration menu
    Copy the full SHA
    cf14897 View commit details
    Browse the repository at this point in the history
  3. [SPARK-32845][SS][TESTS] Add sinkParameter to check sink options robu…

    …stly in DataStreamReaderWriterSuite
    
    This PR aims to add `sinkParameter`  to check sink options robustly and independently in DataStreamReaderWriterSuite
    
    `LastOptions.parameters` is designed to catch three cases: `sourceSchema`, `createSource`, `createSink`. However, `StreamQuery.stop` invokes `queryExecutionThread.join`, `runStream`, `createSource` immediately and reset the stored options by `createSink`.
    
    To catch `createSink` options, currently, the test suite is trying a workaround pattern. However, we observed a flakiness in this pattern sometimes. If we split `createSink` option separately, we don't need this workaround and can eliminate this flakiness.
    
    ```scala
    val query = df.writeStream.
       ...
       .start()
    assert(LastOptions.paramters(..))
    query.stop()
    ```
    
    No. This is a test-only change.
    
    Pass the newly updated test case.
    
    Closes apache#29730 from dongjoon-hyun/SPARK-32845.
    
    Authored-by: Dongjoon Hyun <dongjoon@apache.org>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    (cherry picked from commit b4be6a6)
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    dongjoon-hyun committed Sep 11, 2020
    Configuration menu
    Copy the full SHA
    ec45d10 View commit details
    Browse the repository at this point in the history

Commits on Sep 12, 2020

  1. [SPARK-32779][SQL][FOLLOW-UP] Delete Unused code

    ### What changes were proposed in this pull request?
    Follow-up PR as per the review comments in [29649](https://github.com/apache/spark/pull/29649/files/8d45542e915bea1b321f42988b407091065a2539#r487140171)
    
    ### Why are the changes needed?
    Delete the un used code
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Existing UT
    
    Closes apache#29736 from sandeep-katta/deadlockfollowup.
    
    Authored-by: sandeep.katta <sandeep.katta2007@gmail.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    (cherry picked from commit 2009f95)
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    sandeep-katta authored and dongjoon-hyun committed Sep 12, 2020
    Configuration menu
    Copy the full SHA
    2e04689 View commit details
    Browse the repository at this point in the history

Commits on Sep 13, 2020

  1. [SPARK-32865][DOC] python section in quickstart page doesn't display …

    …SPARK_VERSION correctly
    
    ### What changes were proposed in this pull request?
    
    In https://github.com/apache/spark/blame/master/docs/quick-start.md#L402,it should be `{{site.SPARK_VERSION}}` rather than `{site.SPARK_VERSION}`
    
    ### Why are the changes needed?
    
    SPARK_VERSION isn't displayed correctly, as shown below
    
    ![image](https://user-images.githubusercontent.com/1892692/93006726-d03c8680-f514-11ea-85e3-1d7cfb682ef2.png)
    
    ### Does this PR introduce _any_ user-facing change?
    
    no
    
    ### How was this patch tested?
    
    tested locally, as shown below
    
    ![image](https://user-images.githubusercontent.com/1892692/93006712-a6835f80-f514-11ea-8d78-6831c9d65265.png)
    
    Closes apache#29738 from bowenli86/doc.
    
    Authored-by: bowen.li <bowenli86@gmail.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    (cherry picked from commit 0549c20)
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    bowenli86 authored and dongjoon-hyun committed Sep 13, 2020
    Configuration menu
    Copy the full SHA
    d4d2f5c View commit details
    Browse the repository at this point in the history

Commits on Sep 14, 2020

  1. [SPARK-32876][SQL] Change default fallback versions to 3.0.1 and 2.4.…

    …7 in HiveExternalCatalogVersionsSuite
    
    ### What changes were proposed in this pull request?
    
    The Jenkins job fails to get the versions. This was fixed by adding temporary fallbacks at apache#28536.
    This still doesn't work without the temporary fallbacks. See apache#29694
    
    This PR adds new fallbacks since 2.3 is EOL and Spark 3.0.1 and 2.4.7 are released.
    
    ### Why are the changes needed?
    
    To test correctly in Jenkins.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No, dev-only
    
    ### How was this patch tested?
    
    Jenkins and GitHub Actions builds should test.
    
    Closes apache#29748 from HyukjinKwon/SPARK-32876.
    
    Authored-by: HyukjinKwon <gurwls223@apache.org>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    (cherry picked from commit 0696f04)
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    HyukjinKwon authored and dongjoon-hyun committed Sep 14, 2020
    Configuration menu
    Copy the full SHA
    828603d View commit details
    Browse the repository at this point in the history
  2. [SPARK-32872][CORE] Prevent BytesToBytesMap at MAX_CAPACITY from exce…

    …eding growth threshold
    
    ### What changes were proposed in this pull request?
    
    When BytesToBytesMap is at `MAX_CAPACITY` and reaches its growth threshold, `numKeys >= growthThreshold` is true but `longArray.size() / 2 < MAX_CAPACITY` is false. This correctly prevents the map from growing, but `canGrowArray` incorrectly remains true. Therefore the map keeps accepting new keys and exceeds its growth threshold. If we attempt to spill the map in this state, the UnsafeKVExternalSorter will not be able to reuse the long array for sorting. By this point the task has typically consumed all available memory, so the allocation of the new pointer array is likely to fail.
    
    This PR fixes the issue by setting `canGrowArray` to false in this case. This prevents the map from accepting new elements when it cannot grow to accommodate them.
    
    ### Why are the changes needed?
    
    Without this change, hash aggregations will fail when the number of groups per task is greater than `MAX_CAPACITY / 2 = 2^28` (approximately 268 million), and when the grouping aggregation is the only memory-consuming operator in its stage.
    
    For example, the final aggregation in `SELECT COUNT(DISTINCT id) FROM tbl` fails when `tbl` contains 1 billion distinct values and when `spark.sql.shuffle.partitions=1`.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Reproducing this issue requires building a very large BytesToBytesMap. Because this is infeasible to do in a unit test, this PR was tested manually by adding the following test to AbstractBytesToBytesMapSuite. Before this PR, the test fails in 8.5 minutes. With this PR, the test passes in 1.5 minutes.
    
    ```java
    public abstract class AbstractBytesToBytesMapSuite {
      // ...
      Test
      public void respectGrowthThresholdAtMaxCapacity() {
        TestMemoryManager memoryManager2 =
            new TestMemoryManager(
                new SparkConf()
                .set(package$.MODULE$.MEMORY_OFFHEAP_ENABLED(), true)
                .set(package$.MODULE$.MEMORY_OFFHEAP_SIZE(), 25600 * 1024 * 1024L)
                .set(package$.MODULE$.SHUFFLE_SPILL_COMPRESS(), false)
                .set(package$.MODULE$.SHUFFLE_COMPRESS(), false));
        TaskMemoryManager taskMemoryManager2 = new TaskMemoryManager(memoryManager2, 0);
        final long pageSizeBytes = 8000000 + 8; // 8 bytes for end-of-page marker
        final BytesToBytesMap map = new BytesToBytesMap(taskMemoryManager2, 1024, pageSizeBytes);
    
        try {
          // Insert keys into the map until it stops accepting new keys.
          for (long i = 0; i < BytesToBytesMap.MAX_CAPACITY; i++) {
            if (i % (1024 * 1024) == 0) System.out.println("Inserting element " + i);
            final long[] value = new long[]{i};
            BytesToBytesMap.Location loc = map.lookup(value, Platform.LONG_ARRAY_OFFSET, 8);
            Assert.assertFalse(loc.isDefined());
            boolean success =
                loc.append(value, Platform.LONG_ARRAY_OFFSET, 8, value, Platform.LONG_ARRAY_OFFSET, 8);
            if (!success) break;
          }
    
          // The map should grow to its max capacity.
          long capacity = map.getArray().size() / 2;
          Assert.assertTrue(capacity == BytesToBytesMap.MAX_CAPACITY);
    
          // The map should stop accepting new keys once it has reached its growth
          // threshold, which is half the max capacity.
          Assert.assertTrue(map.numKeys() == BytesToBytesMap.MAX_CAPACITY / 2);
    
          map.free();
        } finally {
          map.free();
        }
      }
    }
    ```
    
    Closes apache#29744 from ankurdave/SPARK-32872.
    
    Authored-by: Ankur Dave <ankurdave@gmail.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    (cherry picked from commit 72550c3)
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    ankurdave authored and dongjoon-hyun committed Sep 14, 2020
    Configuration menu
    Copy the full SHA
    990d49a View commit details
    Browse the repository at this point in the history

Commits on Sep 15, 2020

  1. [SPARK-32715][CORE] Fix memory leak when failed to store pieces of br…

    …oadcast
    
    ### What changes were proposed in this pull request?
    In TorrentBroadcast.scala
    ```scala
    L133: if (!blockManager.putSingle(broadcastId, value, MEMORY_AND_DISK, tellMaster = false))
    L137: TorrentBroadcast.blockifyObject(value, blockSize, SparkEnv.get.serializer, compressionCodec)
    L147: if (!blockManager.putBytes(pieceId, bytes, MEMORY_AND_DISK_SER, tellMaster = true))
    ```
    After the original value is saved successfully(TorrentBroadcast.scala: L133), but the following `blockifyObject()`(L137) or store piece(L147) steps are failed. There is no opportunity to release broadcast from memory.
    
    This patch is to remove all pieces of the broadcast when failed to blockify or failed to store some pieces of a broadcast.
    
    ### Why are the changes needed?
    We use Spark thrift-server as a long-running service. A bad query submitted a heavy BroadcastNestLoopJoin operation and made driver full GC. We killed the bad query but we found the driver's memory usage was still high and full GCs were still frequent. By investigating with GC dump and log, we found the broadcast may memory leak.
    
    > 2020-08-19T18:54:02.824-0700: [Full GC (Allocation Failure)
    2020-08-19T18:54:02.824-0700: [Class Histogram (before full gc):
    116G->112G(170G), 184.9121920 secs]
    [Eden: 32.0M(7616.0M)->0.0B(8704.0M) Survivors: 1088.0M->0.0B Heap: 116.4G(170.0G)->112.9G(170.0G)], [Metaspace: 177285K->177270K(182272K)]
    1: 676531691 72035438432 [B
    2: 676502528 32472121344 org.apache.spark.sql.catalyst.expressions.UnsafeRow
    3: 99551 12018117568 [Ljava.lang.Object;
    4: 26570 4349629040 [I
    5: 6 3264536688 [Lorg.apache.spark.sql.catalyst.InternalRow;
    6: 1708819 256299456 [C
    7: 2338 179615208 [J
    8: 1703669 54517408 java.lang.String
    9: 103860 34896960 org.apache.spark.status.TaskDataWrapper
    10: 177396 25545024 java.net.URI
    ...
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Manually test. This UT is hard to write and the patch is straightforward.
    
    Closes apache#29558 from LantaoJin/SPARK-32715.
    
    Authored-by: LantaoJin <jinlantao@gmail.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    (cherry picked from commit 7a9b066)
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    LantaoJin authored and dongjoon-hyun committed Sep 15, 2020
    Configuration menu
    Copy the full SHA
    fe6ff15 View commit details
    Browse the repository at this point in the history

Commits on Sep 16, 2020

  1. [SPARK-32688][SQL][TEST] Add special values to LiteralGenerator for f…

    …loat and double
    
    ### What changes were proposed in this pull request?
    
    The `LiteralGenerator` for float and double datatypes was supposed to yield special values (NaN, +-inf) among others, but the `Gen.chooseNum` method does not yield values that are outside the defined range. The `Gen.chooseNum` for a wide range of floats and doubles does not yield values in the "everyday" range as stated in typelevel/scalacheck#113 .
    
    There is an similar class `RandomDataGenerator` that is used in some other tests. Added `-0.0` and `-0.0f` as special values to there too.
    
    These changes revealed an inconsistency with the equality check between `-0.0` and `0.0`.
    
    ### Why are the changes needed?
    
    The `LiteralGenerator` is mostly used in the `checkConsistencyBetweenInterpretedAndCodegen` method in `MathExpressionsSuite`. This change would have caught the bug fixed in apache#29495 .
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    Locally reverted apache#29495 and verified that the existing test cases caught the bug.
    
    Closes apache#29515 from tanelk/SPARK-32688.
    
    Authored-by: Tanel Kiis <tanel.kiis@gmail.com>
    Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
    (cherry picked from commit 6051755)
    Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
    tanelk authored and maropu committed Sep 16, 2020
    Configuration menu
    Copy the full SHA
    cb6a0d0 View commit details
    Browse the repository at this point in the history
  2. [SPARK-32888][DOCS] Add user document about header flag and RDD as pa…

    …th for reading CSV
    
    ### What changes were proposed in this pull request?
    
    This proposes to enhance user document of the API for loading a Dataset of strings storing CSV rows. If the header option is set to true, the API will remove all lines same with the header.
    
    ### Why are the changes needed?
    
    This behavior can confuse users. We should explicitly document it.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No. Only doc change.
    
    ### How was this patch tested?
    
    Only doc change.
    
    Closes apache#29765 from viirya/SPARK-32888.
    
    Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    (cherry picked from commit 550c1c9)
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    viirya authored and HyukjinKwon committed Sep 16, 2020
    Configuration menu
    Copy the full SHA
    75a225e View commit details
    Browse the repository at this point in the history
  3. [SPARK-32897][PYTHON] Don't show a deprecation warning at SparkSessio…

    …n.builder.getOrCreate
    
    ### What changes were proposed in this pull request?
    
    In PySpark shell, if you call `SparkSession.builder.getOrCreate` as below:
    
    ```python
    import warnings
    from pyspark.sql import SparkSession, SQLContext
    warnings.simplefilter('always', DeprecationWarning)
    spark.stop()
    SparkSession.builder.getOrCreate()
    ```
    
    it shows the deprecation warning as below:
    
    ```
    /.../spark/python/pyspark/sql/context.py:72: DeprecationWarning: Deprecated in 3.0.0. Use SparkSession.builder.getOrCreate() instead.
      DeprecationWarning)
    ```
    
    via https://github.com/apache/spark/blob/d3304268d3046116d39ec3d54a8e319dce188f36/python/pyspark/sql/session.py#L222
    
    We shouldn't print the deprecation warning from it. This is the only place ^.
    
    ### Why are the changes needed?
    
    To prevent to inform users that `SparkSession.builder.getOrCreate` is deprecated mistakenly.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, it won't show a deprecation warning to end users for calling `SparkSession.builder.getOrCreate`.
    
    ### How was this patch tested?
    
    Manually tested as above.
    
    Closes apache#29768 from HyukjinKwon/SPARK-32897.
    
    Authored-by: HyukjinKwon <gurwls223@apache.org>
    Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
    (cherry picked from commit 657e39a)
    Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
    HyukjinKwon authored and ueshin committed Sep 16, 2020
    Configuration menu
    Copy the full SHA
    aa9563e View commit details
    Browse the repository at this point in the history

Commits on Sep 17, 2020

  1. [SPARK-32900][CORE] Allow UnsafeExternalSorter to spill when there ar…

    …e nulls
    
    ### What changes were proposed in this pull request?
    
    This PR changes the way `UnsafeExternalSorter.SpillableIterator` checks whether it has spilled already, by checking whether `inMemSorter` is null. It also allows it to spill other `UnsafeSorterIterator`s than `UnsafeInMemorySorter.SortedIterator`.
    
    ### Why are the changes needed?
    
    Before this PR `UnsafeExternalSorter.SpillableIterator` could not spill when there are NULLs in the input and radix sorting is used. Currently, Spark determines whether UnsafeExternalSorter.SpillableIterator has not spilled yet by checking whether `upstream` is an instance of `UnsafeInMemorySorter.SortedIterator`. When radix sorting is used and there are NULLs in the input however, `upstream` will be an instance of `UnsafeExternalSorter.ChainedIterator` instead, and Spark will assume that the `SpillableIterator` iterator has spilled already, and therefore cannot spill again when it's supposed to spill.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    A test was added to `UnsafeExternalSorterSuite` (and therefore also to `UnsafeExternalSorterRadixSortSuite`). I manually confirmed that the test failed in `UnsafeExternalSorterRadixSortSuite` without this patch.
    
    Closes apache#29772 from tomvanbussel/SPARK-32900.
    
    Authored-by: Tom van Bussel <tom.vanbussel@databricks.com>
    Signed-off-by: herman <herman@databricks.com>
    (cherry picked from commit e5e54a3)
    Signed-off-by: herman <herman@databricks.com>
    tomvanbussel authored and hvanhovell committed Sep 17, 2020
    Configuration menu
    Copy the full SHA
    2e94d9a View commit details
    Browse the repository at this point in the history
  2. [SPARK-32887][DOC] Correct the typo for SHOW TABLE

    ### What changes were proposed in this pull request?
    Correct the typo in Show Table document
    
    ### Why are the changes needed?
    Current Document of Show Table returns in parse error, so it is misleading to users
    
    ### Does this PR introduce _any_ user-facing change?
    Yes, the document of show table is corrected now
    
    ### How was this patch tested?
    NA
    
    Closes apache#29758 from Udbhav30/showtable.
    
    Authored-by: Udbhav30 <u.agrawal30@gmail.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    (cherry picked from commit 88e87bc)
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    Udbhav30 authored and dongjoon-hyun committed Sep 17, 2020
    Configuration menu
    Copy the full SHA
    b3b6f38 View commit details
    Browse the repository at this point in the history
  3. [SPARK-32738][CORE][3.0] Should reduce the number of active threads i…

    …f fatal error happens in `Inbox.process`
    
    This is a backport for [pr#29580](apache#29580) to branch 3.0.
    
    ### What changes were proposed in this pull request?
    
    Processing for `ThreadSafeRpcEndpoint` is controlled by  `numActiveThreads` in `Inbox`. Now if any fatal error happens during `Inbox.process`, `numActiveThreads` is not reduced. Then other threads can not process messages in that inbox, which causes the endpoint to "hang". For other type of endpoints, we also should keep  `numActiveThreads` correct.
    
    This problem is more serious in previous Spark 2.x versions since the driver, executor and block manager endpoints are all thread safe endpoints.
    
    To fix this, we should reduce the number of active threads if fatal error happens in `Inbox.process`.
    
    ### Why are the changes needed?
    
    `numActiveThreads` is not correct when fatal error happens and will cause the described problem.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Add a new test.
    
    Closes apache#29763 from wzhfy/deal_with_fatal_error_3.0.
    
    Authored-by: Zhenhua Wang <wzh_zju@163.com>
    Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
    wzhfy authored and Mridul Muralidharan committed Sep 17, 2020
    Configuration menu
    Copy the full SHA
    17a5195 View commit details
    Browse the repository at this point in the history
  4. [SPARK-32635][SQL] Fix foldable propagation

    ### What changes were proposed in this pull request?
    This PR rewrites `FoldablePropagation` rule to replace attribute references in a node with foldables coming only from the node's children.
    
    Before this PR in the case of this example (with setting`spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation`):
    ```scala
    val a = Seq("1").toDF("col1").withColumn("col2", lit("1"))
    val b = Seq("2").toDF("col1").withColumn("col2", lit("2"))
    val aub = a.union(b)
    val c = aub.filter($"col1" === "2").cache()
    val d = Seq("2").toDF( "col4")
    val r = d.join(aub, $"col2" === $"col4").select("col4")
    val l = c.select("col2")
    val df = l.join(r, $"col2" === $"col4", "LeftOuter")
    df.show()
    ```
    foldable propagation happens incorrectly:
    ```
     Join LeftOuter, (col2#6 = col4#34)                                                              Join LeftOuter, (col2#6 = col4#34)
    !:- Project [col2#6]                                                                             :- Project [1 AS col2#6]
     :  +- InMemoryRelation [col1#4, col2#6], StorageLevel(disk, memory, deserialized, 1 replicas)   :  +- InMemoryRelation [col1#4, col2#6], StorageLevel(disk, memory, deserialized, 1 replicas)
     :        +- Union                                                                               :        +- Union
     :           :- *(1) Project [value#1 AS col1#4, 1 AS col2#6]                                    :           :- *(1) Project [value#1 AS col1#4, 1 AS col2#6]
     :           :  +- *(1) Filter (isnotnull(value#1) AND (value#1 = 2))                            :           :  +- *(1) Filter (isnotnull(value#1) AND (value#1 = 2))
     :           :     +- *(1) LocalTableScan [value#1]                                              :           :     +- *(1) LocalTableScan [value#1]
     :           +- *(2) Project [value#10 AS col1#13, 2 AS col2#15]                                 :           +- *(2) Project [value#10 AS col1#13, 2 AS col2#15]
     :              +- *(2) Filter (isnotnull(value#10) AND (value#10 = 2))                          :              +- *(2) Filter (isnotnull(value#10) AND (value#10 = 2))
     :                 +- *(2) LocalTableScan [value#10]                                             :                 +- *(2) LocalTableScan [value#10]
     +- Project [col4#34]                                                                            +- Project [col4#34]
        +- Join Inner, (col2#6 = col4#34)                                                               +- Join Inner, (col2#6 = col4#34)
           :- Project [value#31 AS col4#34]                                                                :- Project [value#31 AS col4#34]
           :  +- LocalRelation [value#31]                                                                  :  +- LocalRelation [value#31]
           +- Project [col2#6]                                                                             +- Project [col2#6]
              +- Union false, false                                                                           +- Union false, false
                 :- Project [1 AS col2#6]                                                                        :- Project [1 AS col2#6]
                 :  +- LocalRelation [value#1]                                                                   :  +- LocalRelation [value#1]
                 +- Project [2 AS col2#15]                                                                       +- Project [2 AS col2#15]
                    +- LocalRelation [value#10]                                                                     +- LocalRelation [value#10]
    
    ```
    and so the result is wrong:
    ```
    +----+----+
    |col2|col4|
    +----+----+
    |   1|null|
    +----+----+
    ```
    
    After this PR foldable propagation will not happen incorrectly and the result is correct:
    ```
    +----+----+
    |col2|col4|
    +----+----+
    |   2|   2|
    +----+----+
    ```
    
    ### Why are the changes needed?
    To fix a correctness issue.
    
    ### Does this PR introduce _any_ user-facing change?
    Yes, fixes a correctness issue.
    
    ### How was this patch tested?
    Existing and new UTs.
    
    Closes apache#29771 from peter-toth/SPARK-32635-fix-foldable-propagation.
    
    Authored-by: Peter Toth <peter.toth@gmail.com>
    Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
    (cherry picked from commit 4ced588)
    Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
    peter-toth authored and maropu committed Sep 17, 2020
    Configuration menu
    Copy the full SHA
    ecc2f5d View commit details
    Browse the repository at this point in the history

Commits on Sep 18, 2020

  1. [SPARK-32908][SQL] Fix target error calculation in percentile_approx()

    ### What changes were proposed in this pull request?
    1. Change the target error calculation according to the paper [Space-Efficient Online Computation of Quantile Summaries](http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf). It says that the error `e = max(gi, deltai)/2` (see the page 59). Also this has clear explanation [ε-approximate quantiles](http://www.mathcs.emory.edu/~cheung/Courses/584/Syllabus/08-Quantile/Greenwald.html#proofprop1).
    2. Added a test to check different accuracies.
    3. Added an input CSV file `percentile_approx-input.csv.bz2` to the resource folder `sql/catalyst/src/main/resources` for the test.
    
    ### Why are the changes needed?
    To fix incorrect percentile calculation, see an example in SPARK-32908.
    
    ### Does this PR introduce _any_ user-facing change?
    Yes
    
    ### How was this patch tested?
    - By running existing tests in `QuantileSummariesSuite` and in `ApproximatePercentileQuerySuite`.
    - Added new test `SPARK-32908: maximum target error in percentile_approx` to `ApproximatePercentileQuerySuite`.
    
    Closes apache#29784 from MaxGekk/fix-percentile_approx-2.
    
    Authored-by: Max Gekk <max.gekk@gmail.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    (cherry picked from commit 75dd864)
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    MaxGekk authored and HyukjinKwon committed Sep 18, 2020
    Configuration menu
    Copy the full SHA
    5581a92 View commit details
    Browse the repository at this point in the history
  2. [SPARK-32906][SQL] Struct field names should not change after normali…

    …zing floats
    
    ### What changes were proposed in this pull request?
    
    This PR intends to fix a minor bug when normalizing floats for struct types;
    ```
    scala> import org.apache.spark.sql.execution.aggregate.HashAggregateExec
    scala> val df = Seq(Tuple1(Tuple1(-0.0d)), Tuple1(Tuple1(0.0d))).toDF("k")
    scala> val agg = df.distinct()
    scala> agg.explain()
    == Physical Plan ==
    *(2) HashAggregate(keys=[k#40], functions=[])
    +- Exchange hashpartitioning(k#40, 200), true, [id=apache#62]
       +- *(1) HashAggregate(keys=[knownfloatingpointnormalized(if (isnull(k#40)) null else named_struct(col1, knownfloatingpointnormalized(normalizenanandzero(k#40._1)))) AS k#40], functions=[])
          +- *(1) LocalTableScan [k#40]
    
    scala> val aggOutput = agg.queryExecution.sparkPlan.collect { case a: HashAggregateExec => a.output.head }
    scala> aggOutput.foreach { attr => println(attr.prettyJson) }
    ### Final Aggregate ###
    [ {
      "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference",
      "num-children" : 0,
      "name" : "k",
      "dataType" : {
        "type" : "struct",
        "fields" : [ {
          "name" : "_1",
                    ^^^
          "type" : "double",
          "nullable" : false,
          "metadata" : { }
        } ]
      },
      "nullable" : true,
      "metadata" : { },
      "exprId" : {
        "product-class" : "org.apache.spark.sql.catalyst.expressions.ExprId",
        "id" : 40,
        "jvmId" : "a824e83f-933e-4b85-a1ff-577b5a0e2366"
      },
      "qualifier" : [ ]
    } ]
    
    ### Partial Aggregate ###
    [ {
      "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference",
      "num-children" : 0,
      "name" : "k",
      "dataType" : {
        "type" : "struct",
        "fields" : [ {
          "name" : "col1",
                    ^^^^
          "type" : "double",
          "nullable" : true,
          "metadata" : { }
        } ]
      },
      "nullable" : true,
      "metadata" : { },
      "exprId" : {
        "product-class" : "org.apache.spark.sql.catalyst.expressions.ExprId",
        "id" : 40,
        "jvmId" : "a824e83f-933e-4b85-a1ff-577b5a0e2366"
      },
      "qualifier" : [ ]
    } ]
    ```
    
    ### Why are the changes needed?
    
    bugfix.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Added tests.
    
    Closes apache#29780 from maropu/FixBugInNormalizedFloatingNumbers.
    
    Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
    Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
    (cherry picked from commit b49aaa3)
    Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
    maropu authored and viirya committed Sep 18, 2020
    Configuration menu
    Copy the full SHA
    2d55de5 View commit details
    Browse the repository at this point in the history
  3. [SPARK-32905][CORE][YARN] ApplicationMaster fails to receive UpdateDe…

    …legationTokens message
    
    ### What changes were proposed in this pull request?
    
    With a long-running application in kerberized mode, the AMEndpiont handles `UpdateDelegationTokens` message wrong, which is an OneWayMessage that should be handled in the `receive` function.
    
    ```java
    20-09-15 18:53:01 INFO yarn.YarnAllocator: Received 22 containers from YARN, launching executors on 0 of them.
    20-09-16 12:52:28 ERROR netty.Inbox: Ignoring error
    org.apache.spark.SparkException: NettyRpcEndpointRef(spark-client://YarnAM) does not implement 'receive'
    	at org.apache.spark.rpc.RpcEndpoint$$anonfun$receive$1.applyOrElse(RpcEndpoint.scala:70)
    	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
    	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203)
    	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
    	at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
    	at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    	at java.lang.Thread.run(Thread.java:748)
    20-09-17 06:52:28 ERROR netty.Inbox: Ignoring error
    org.apache.spark.SparkException: NettyRpcEndpointRef(spark-client://YarnAM) does not implement 'receive'
    	at org.apache.spark.rpc.RpcEndpoint$$anonfun$receive$1.applyOrElse(RpcEndpoint.scala:70)
    	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
    	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203)
    	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
    	at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
    	at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    	at java.lang.Thread.run(Thread.java:748)
    ```
    
    ### Why are the changes needed?
    
    bugfix, without a proper token refresher, the long-running apps are going to fail potentially in kerberized cluster
    
    ### Does this PR introduce _any_ user-facing change?
    
    no
    
    ### How was this patch tested?
    
    Passing jenkins
    
    and verify manually
    
    I am running the sub-module `kyuubi-spark-sql-engine` of https://github.com/yaooqinn/kyuubi
    
    The simplest way to reproduce the bug and verify this fix is to follow these steps
    
    #### 1 build the `kyuubi-spark-sql-engine` module
    ```
    mvn clean package -pl :kyuubi-spark-sql-engine
    ```
    #### 2. config the spark with Kerberos settings towards your secured cluster
    
    #### 3. start it in the background
    ```
    nohup bin/spark-submit --class org.apache.kyuubi.engine.spark.SparkSQLEngine ../kyuubi-spark-sql-engine-1.0.0-SNAPSHOT.jar > kyuubi.log &
    ```
    
    #### 4. check the AM log and see
    
    "Updating delegation tokens ..." for SUCCESS
    
    "Inbox: Ignoring error ...... does not implement 'receive'" for FAILURE
    
    Closes apache#29777 from yaooqinn/SPARK-32905.
    
    Authored-by: Kent Yao <yaooqinn@hotmail.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    (cherry picked from commit 9e9d4b6)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    yaooqinn authored and cloud-fan committed Sep 18, 2020
    Configuration menu
    Copy the full SHA
    ffcd757 View commit details
    Browse the repository at this point in the history
  4. [SPARK-32930][CORE] Replace deprecated isFile/isDirectory methods

    ### What changes were proposed in this pull request?
    
    This PR aims to replace deprecated `isFile` and `isDirectory` methods.
    
    ```diff
    - fs.isDirectory(hadoopPath)
    + fs.getFileStatus(hadoopPath).isDirectory
    ```
    
    ```diff
    - fs.isFile(new Path(inProgressLog))
    + fs.getFileStatus(new Path(inProgressLog)).isFile
    ```
    
    ### Why are the changes needed?
    
    It shows deprecation warnings.
    
    - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-3.2-hive-2.3/1244/consoleFull
    
    ```
    [warn] /home/jenkins/workspace/spark-master-test-sbt-hadoop-3.2-hive-2.3/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala:815: method isFile in class FileSystem is deprecated: see corresponding Javadoc for more information.
    [warn]             if (!fs.isFile(new Path(inProgressLog))) {
    ```
    
    ```
    [warn] /home/jenkins/workspace/spark-master-test-sbt-hadoop-3.2-hive-2.3/core/src/main/scala/org/apache/spark/SparkContext.scala:1884: method isDirectory in class FileSystem is deprecated: see corresponding Javadoc for more information.
    [warn]           if (fs.isDirectory(hadoopPath)) {
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Pass the Jenkins.
    
    Closes apache#29796 from williamhyun/filesystem.
    
    Authored-by: William Hyun <williamhyun3@gmail.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    (cherry picked from commit 7892887)
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    williamhyun authored and HyukjinKwon committed Sep 18, 2020
    Configuration menu
    Copy the full SHA
    20cd7bb View commit details
    Browse the repository at this point in the history
  5. [SPARK-32635][SQL][FOLLOW-UP] Add a new test case in catalyst module

    ### What changes were proposed in this pull request?
    This is a follow-up PR to apache#29771 and just adds a new test case.
    
    ### Why are the changes needed?
    To have better test coverage.
    
    ### Does this PR introduce _any_ user-facing change?
    No.
    
    ### How was this patch tested?
    New UT.
    
    Closes apache#29802 from peter-toth/SPARK-32635-fix-foldable-propagation-followup.
    
    Authored-by: Peter Toth <peter.toth@gmail.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    (cherry picked from commit 3309a2b)
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    peter-toth authored and dongjoon-hyun committed Sep 18, 2020
    Configuration menu
    Copy the full SHA
    7746c20 View commit details
    Browse the repository at this point in the history
  6. [SPARK-32898][CORE] Fix wrong executorRunTime when task killed before…

    … real start
    
    ### What changes were proposed in this pull request?
    
    Only calculate the executorRunTime when taskStartTimeNs > 0. Otherwise, set executorRunTime to 0.
    
    ### Why are the changes needed?
    
    bug fix.
    
    It's possible that a task be killed (e.g., by another successful attempt) before it reaches "taskStartTimeNs = System.nanoTime()". In this case, taskStartTimeNs is still 0 since it hasn't been really initialized. And we will get the wrong executorRunTime by calculating System.nanoTime() - taskStartTimeNs.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, users will see the correct executorRunTime.
    
    ### How was this patch tested?
    
    Pass existing tests.
    
    Closes apache#29789 from Ngone51/fix-SPARK-32898.
    
    Authored-by: yi.wu <yi.wu@databricks.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    (cherry picked from commit f1dc479)
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    Ngone51 authored and dongjoon-hyun committed Sep 18, 2020
    Configuration menu
    Copy the full SHA
    03fb144 View commit details
    Browse the repository at this point in the history

Commits on Sep 21, 2020

  1. [SPARK-32886][WEBUI] fix 'undefined' link in event timeline view

    ### What changes were proposed in this pull request?
    
    Fix ".../jobs/undefined" link from "Event Timeline" in jobs page. Job page link in "Event Timeline" view is constructed by fetching job page link defined in job list below. when job count exceeds page size of job table, only links of jobs in job table can be fetched from page. Other jobs' link would be 'undefined', and links of them in "Event Timeline" are broken, they are redirected to some wired URL like ".../jobs/undefined". This PR is fixing this wrong link issue. With this PR, job link in "Event Timeline" view would always redirect to correct job page.
    
    ### Why are the changes needed?
    
    Wrong link (".../jobs/undefined") in "Event Timeline" of jobs page. for example, the first job in below page is not in table below, as job count(116) exceeds page size(100). When clicking it's item in "Event Timeline", page is redirected to ".../jobs/undefined", which is wrong. Links in "Event Timeline" should always be correct.
    ![undefinedlink](https://user-images.githubusercontent.com/10524738/93184779-83fa6d80-f6f1-11ea-8a80-1a304ca9cbb2.JPG)
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    Manually tested.
    
    Closes apache#29757 from zhli1142015/fix-link-event-timeline-view.
    
    Authored-by: Zhen Li <zhli@microsoft.com>
    Signed-off-by: Sean Owen <srowen@gmail.com>
    (cherry picked from commit d01594e)
    Signed-off-by: Sean Owen <srowen@gmail.com>
    zhli1142015 authored and srowen committed Sep 21, 2020
    Configuration menu
    Copy the full SHA
    0a4b668 View commit details
    Browse the repository at this point in the history
  2. [SPARK-32718][SQL][3.0] Remove unnecessary keywords for interval units

    Backport apache#29560 to 3.0, as it's kind of a bug fix for the ANSI mode. People can't use `year`,  `month`, etc. functions under ANSI mode.
    
    Closes apache#29823 from cloud-fan/backport.
    
    Authored-by: Wenchen Fan <wenchen@databricks.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    cloud-fan authored and dongjoon-hyun committed Sep 21, 2020
    Configuration menu
    Copy the full SHA
    b27bbbb View commit details
    Browse the repository at this point in the history

Commits on Sep 22, 2020

  1. [SPARK-32659][SQL][FOLLOWUP][3.0] Broadcast Array instead of Set in I…

    …nSubqueryExec
    
    ### What changes were proposed in this pull request?
    
    This is a followup of apache#29475.
    
    This PR updates the code to broadcast the Array instead of Set, which was the behavior before apache#29475
    
    ### Why are the changes needed?
    
    The size of Set can be much bigger than Array. It's safer to keep the behavior the same as before and build the set at the executor side.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    existing tests
    
    Closes apache#29840 from cloud-fan/backport.
    
    Authored-by: Wenchen Fan <wenchen@databricks.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    cloud-fan authored and dongjoon-hyun committed Sep 22, 2020
    Configuration menu
    Copy the full SHA
    8a481d8 View commit details
    Browse the repository at this point in the history

Commits on Sep 23, 2020

  1. [MINOR][SQL][3.0] Improve examples for percentile_approx()

    ### What changes were proposed in this pull request?
    In the PR, I propose to replace current examples for `percentile_approx()` with **only one** input value by example **with multiple values** in the input column.
    
    ### Why are the changes needed?
    Current examples are pretty trivial, and don't demonstrate function's behaviour on a sequence of values.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    - by running `ExpressionInfoSuite`
    - `./dev/scalastyle`
    
    Authored-by: Max Gekk <max.gekkgmail.com>
    Signed-off-by: HyukjinKwon <gurwls223apache.org>
    (cherry picked from commit b53da23)
    Signed-off-by: Max Gekk <max.gekkgmail.com>
    
    Closes apache#29848 from MaxGekk/example-percentile_approx-3.0.
    
    Authored-by: Max Gekk <max.gekk@gmail.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    MaxGekk authored and HyukjinKwon committed Sep 23, 2020
    Configuration menu
    Copy the full SHA
    58124bd View commit details
    Browse the repository at this point in the history
  2. [SPARK-32306][SQL][DOCS][3.0] Clarify the result of `percentile_appro…

    …x()`
    
    ### What changes were proposed in this pull request?
    More precise description of the result of the `percentile_approx()` function and its synonym `approx_percentile()`. The proposed sentence clarifies that  the function returns **one of elements** (or array of elements) from the input column.
    
    ### Why are the changes needed?
    To improve Spark docs and avoid misunderstanding of the function behavior.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    `./dev/scalastyle`
    
    Authored-by: Max Gekk <max.gekkgmail.com>
    Signed-off-by: Liang-Chi Hsieh <viiryagmail.com>
    (cherry picked from commit 7c14f17)
    Signed-off-by: Max Gekk <max.gekkgmail.com>
    
    Closes apache#29845 from MaxGekk/doc-percentile_approx-3.0.
    
    Authored-by: Max Gekk <max.gekk@gmail.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    MaxGekk authored and HyukjinKwon committed Sep 23, 2020
    Configuration menu
    Copy the full SHA
    542dc97 View commit details
    Browse the repository at this point in the history

Commits on Sep 24, 2020

  1. [SPARK-32977][SQL][DOCS] Fix JavaDoc on Default Save Mode

    ### What changes were proposed in this pull request?
    
    The default is always ErrorsOnExist regardless of DataSource version. Fixing the JavaDoc to reflect this.
    
    ### Why are the changes needed?
    
    To fix documentation
    
    ### Does this PR introduce _any_ user-facing change?
    
    Doc change.
    
    ### How was this patch tested?
    
    Manual.
    
    Closes apache#29853 from RussellSpitzer/SPARK-32977.
    
    Authored-by: Russell Spitzer <russell.spitzer@gmail.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    (cherry picked from commit b3f0087)
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    RussellSpitzer authored and dongjoon-hyun committed Sep 24, 2020
    Configuration menu
    Copy the full SHA
    21b6b69 View commit details
    Browse the repository at this point in the history

Commits on Sep 25, 2020

  1. [SPARK-32877][SQL][TEST] Add test for Hive UDF complex decimal type

    ### What changes were proposed in this pull request?
    
    Add test to cover Hive UDF whose input contains complex decimal type.
    Add comment to explain why we can't make `HiveSimpleUDF` extend `ImplicitTypeCasts`.
    
    ### Why are the changes needed?
    
    For better test coverage with Hive which we compatible or not.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Add test.
    
    Closes apache#29863 from ulysses-you/SPARK-32877-test.
    
    Authored-by: ulysses <youxiduo@weidian.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    (cherry picked from commit f2fc966)
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    ulysses-you authored and dongjoon-hyun committed Sep 25, 2020
    Configuration menu
    Copy the full SHA
    4b84e57 View commit details
    Browse the repository at this point in the history

Commits on Sep 26, 2020

  1. [SPARK-32999][SQL] Use Utils.getSimpleName to avoid hitting Malformed…

    … class name in TreeNode
    
    ### What changes were proposed in this pull request?
    
    Use `Utils.getSimpleName` to avoid hitting `Malformed class name` error in `TreeNode`.
    
    ### Why are the changes needed?
    
    On older JDK versions (e.g. JDK8u), nested Scala classes may trigger `java.lang.Class.getSimpleName` to throw an `java.lang.InternalError: Malformed class name` error.
    
    Similar to apache#29050, we should use  Spark's `Utils.getSimpleName` utility function in place of `Class.getSimpleName` to avoid hitting the issue.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Fixes a bug that throws an error when invoking `TreeNode.nodeName`, otherwise no changes.
    
    ### How was this patch tested?
    
    Added new unit test case in `TreeNodeSuite`. Note that the test case assumes the test code can trigger the expected error, otherwise it'll skip the test safely, for compatibility with newer JDKs.
    
    Manually tested on JDK8u and JDK11u and observed expected behavior:
    - JDK8u: the test case triggers the "Malformed class name" issue and the fix works;
    - JDK11u: the test case does not trigger the "Malformed class name" issue, and the test case is safely skipped.
    
    Closes apache#29875 from rednaxelafx/spark-32999-getsimplename.
    
    Authored-by: Kris Mok <kris.mok@databricks.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    (cherry picked from commit 9a155d4)
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    rednaxelafx authored and dongjoon-hyun committed Sep 26, 2020
    Configuration menu
    Copy the full SHA
    4425c3a View commit details
    Browse the repository at this point in the history

Commits on Sep 29, 2020

  1. [SPARK-33015][SQL] Compute the current date only once

    ### What changes were proposed in this pull request?
    Compute the current date at the specified time zone using timestamp taken at the start of query evaluation.
    
    ### Why are the changes needed?
    According to the doc for [current_date()](http://spark.apache.org/docs/latest/api/sql/#current_date), the current date should be computed at the start of query evaluation but it can be computed multiple times. As a consequence of that, the function can return different values if the query is executed at the border of two dates.
    
    ### Does this PR introduce _any_ user-facing change?
    Yes
    
    ### How was this patch tested?
    By existing test suites `ComputeCurrentTimeSuite` and `DateExpressionsSuite`.
    
    Closes apache#29889 from MaxGekk/fix-current_date.
    
    Authored-by: Max Gekk <max.gekk@gmail.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    (cherry picked from commit 68cd567)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    MaxGekk authored and cloud-fan committed Sep 29, 2020
    Configuration menu
    Copy the full SHA
    424f16e View commit details
    Browse the repository at this point in the history
  2. [MINOR][DOCS] Document when current_date and current_timestamp ar…

    …e evaluated
    
    ### What changes were proposed in this pull request?
    Explicitly document that `current_date` and `current_timestamp` are executed at the start of query evaluation. And all calls of `current_date`/`current_timestamp` within the same query return the same value
    
    ### Why are the changes needed?
    Users could expect that `current_date` and `current_timestamp` return the current date/timestamp at the moment of query execution but in fact the functions are folded by the optimizer at the start of query evaluation:
    https://github.com/apache/spark/blob/0df8dd60733066076967f0525210bbdb5e12415a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala#L71-L91
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    by running `./dev/scalastyle`.
    
    Closes apache#29892 from MaxGekk/doc-current_date.
    
    Authored-by: Max Gekk <max.gekk@gmail.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    (cherry picked from commit 1b60ff5)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    MaxGekk authored and cloud-fan committed Sep 29, 2020
    Configuration menu
    Copy the full SHA
    118de10 View commit details
    Browse the repository at this point in the history
  3. [SPARK-33021][PYTHON][TESTS] Move functions related test cases into t…

    …est_functions.py
    
    Move functions related test cases from `test_context.py` to `test_functions.py`.
    
    To group the similar test cases.
    
    Nope, test-only.
    
    Jenkins and GitHub Actions should test.
    
    Closes apache#29898 from HyukjinKwon/SPARK-33021.
    
    Authored-by: HyukjinKwon <gurwls223@apache.org>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    HyukjinKwon committed Sep 29, 2020
    Configuration menu
    Copy the full SHA
    97d8634 View commit details
    Browse the repository at this point in the history
  4. [SPARK-33015][SQL][FOLLOWUP][3.0] Use millisToDays() in the ComputeCu…

    …rrentTime rule
    
    ### What changes were proposed in this pull request?
    Use `millisToDays()` instead of `microsToDays()` because the former one is not available in `branch-3.0`.
    
    ### Why are the changes needed?
    To fix the build failure:
    ```
    [ERROR] [Error] /home/jenkins/workspace/spark-branch-3.0-maven-snapshots/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala:85: value microsToDays is not a member of object org.apache.spark.sql.catalyst.util.DateTimeUtils
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    By running `./build/sbt clean package` and `ComputeCurrentTimeSuite`.
    
    Closes apache#29901 from MaxGekk/fix-current_date-3.0.
    
    Authored-by: Max Gekk <max.gekk@gmail.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    MaxGekk authored and HyukjinKwon committed Sep 29, 2020
    Configuration menu
    Copy the full SHA
    2160dc5 View commit details
    Browse the repository at this point in the history
  5. [SPARK-32901][CORE] Do not allocate memory while spilling UnsafeExter…

    …nalSorter
    
    ### What changes were proposed in this pull request?
    
    This PR changes `UnsafeExternalSorter` to no longer allocate any memory while spilling. In particular it removes the allocation of a new pointer array in `UnsafeInMemorySorter`. Instead the new pointer array is allocated whenever the next record is inserted into the sorter.
    
    ### Why are the changes needed?
    
    Without this change the `UnsafeExternalSorter` could throw an OOM while spilling. The following sequence of events would have triggered an OOM:
    
    1. `UnsafeExternalSorter` runs out of space in its pointer array and attempts to allocate a new large array to replace the old one.
    2. `TaskMemoryManager` tries to allocate the memory backing the new large array using `MemoryManager`, but `MemoryManager` is only willing to return most but not all of the memory requested.
    3. `TaskMemoryManager` asks `UnsafeExternalSorter` to spill, which causes `UnsafeExternalSorter` to spill the current run to disk, to free its record pages and to reset its `UnsafeInMemorySorter`.
    4. `UnsafeInMemorySorter` frees the old pointer array, and tries to allocate a new small pointer array.
    5. `TaskMemoryManager` tries to allocate the memory backing the small array using `MemoryManager`, but `MemoryManager` is unwilling to give it any memory, as the `TaskMemoryManager` is still holding on to the memory it got for the new large array.
    6. `TaskMemoryManager` again asks `UnsafeExternalSorter` to spill, but this time there is nothing to spill.
    7. `UnsafeInMemorySorter` receives less memory than it requested, and causes a `SparkOutOfMemoryError` to be thrown, which causes the current task to fail.
    
    With the changes in the PR the following will happen instead:
    
    1. `UnsafeExternalSorter` runs out of space in its pointer array and attempts to allocate a new large array to replace the old one.
    2. `TaskMemoryManager` tries to allocate the memory backing the new large array using `MemoryManager`, but `MemoryManager` is only willing to return most but not all of the memory requested.
    3. `TaskMemoryManager` asks `UnsafeExternalSorter` to spill, which causes `UnsafeExternalSorter` to spill the current run to disk, to free its record pages and to reset its `UnsafeInMemorySorter`.
    4. `UnsafeInMemorySorter` frees the old pointer array.
    5. `TaskMemoryManager` returns control to `UnsafeExternalSorter.growPointerArrayIfNecessary` (either by returning the the new large array or by throwing a `SparkOutOfMemoryError`).
    6. `UnsafeExternalSorter` either frees the new large array or it ignores the `SparkOutOfMemoryError` depending on what happened in the previous step.
    7. `UnsafeExternalSorter` successfully allocates a new small pointer array and operation continues as normal.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    Tests were added in `UnsafeExternalSorterSuite` and `UnsafeInMemorySorterSuite`.
    
    Closes apache#29785 from tomvanbussel/SPARK-32901.
    
    Authored-by: Tom van Bussel <tom.vanbussel@databricks.com>
    Signed-off-by: herman <herman@databricks.com>
    (cherry picked from commit f167002)
    Signed-off-by: herman <herman@databricks.com>
    tomvanbussel authored and hvanhovell committed Sep 29, 2020
    Configuration menu
    Copy the full SHA
    d3cc564 View commit details
    Browse the repository at this point in the history
  6. [MINOR][DOCS] Fixing log message for better clarity

    Fixing log message for better clarity.
    
    Closes apache#29870 from akshatb1/master.
    
    Lead-authored-by: Akshat Bordia <akshat.bordia31@gmail.com>
    Co-authored-by: Akshat Bordia <akshat.bordia@citrix.com>
    Signed-off-by: Sean Owen <srowen@gmail.com>
    (cherry picked from commit 7766fd1)
    Signed-off-by: Sean Owen <srowen@gmail.com>
    2 people authored and srowen committed Sep 29, 2020
    Configuration menu
    Copy the full SHA
    39bfae2 View commit details
    Browse the repository at this point in the history
  7. [SPARK-33018][SQL] Fix estimate statistics issue if child has 0 bytes

    ### What changes were proposed in this pull request?
    
    This pr fix estimate statistics issue if child has 0 bytes.
    
    ### Why are the changes needed?
    The `sizeInBytes` can be `0` when AQE and CBO are enabled(`spark.sql.adaptive.enabled`=true, `spark.sql.cbo.enabled`=true and `spark.sql.cbo.planStats.enabled`=true). This will generate incorrect BroadcastJoin, resulting in Driver OOM. For example:
    ![SPARK-33018](https://user-images.githubusercontent.com/5399861/94457606-647e3d00-01e7-11eb-85ee-812ae6efe7bb.jpg)
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Manual test.
    
    Closes apache#29894 from wangyum/SPARK-33018.
    
    Authored-by: Yuming Wang <yumwang@ebay.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    (cherry picked from commit 711d8dd)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    wangyum authored and cloud-fan committed Sep 29, 2020
    Configuration menu
    Copy the full SHA
    ae8b35a View commit details
    Browse the repository at this point in the history
  8. [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.al…

    …gorithm.version=1 by default
    
    ### What changes were proposed in this pull request?
    
    Apache Spark 3.1's default Hadoop profile is `hadoop-3.2`. Instead of having a warning documentation, this PR aims to use a consistent and safer version of Apache Hadoop file output committer algorithm which is `v1`. This will prevent a silent correctness regression during migration from Apache Spark 2.4/3.0 to Apache Spark 3.1.0. Of course, if there is a user-provided configuration, `spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2`, that will be used still.
    
    ### Why are the changes needed?
    
    Apache Spark provides multiple distributions with Hadoop 2.7 and Hadoop 3.2. `spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version` depends on the Hadoop version. Apache Hadoop 3.0 switches the default algorithm from `v1` to `v2` and now there exists a discussion to remove `v2`. We had better provide a consistent default behavior of `v1` across various Spark distributions.
    
    - [MAPREDUCE-7282](https://issues.apache.org/jira/browse/MAPREDUCE-7282) MR v2 commit algorithm should be deprecated and not the default
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes. This changes the default behavior. Users can override this conf.
    
    ### How was this patch tested?
    
    Manual.
    
    **BEFORE (spark-3.0.1-bin-hadoop3.2)**
    ```scala
    scala> sc.version
    res0: String = 3.0.1
    
    scala> sc.hadoopConfiguration.get("mapreduce.fileoutputcommitter.algorithm.version")
    res1: String = 2
    ```
    
    **AFTER**
    ```scala
    scala> sc.hadoopConfiguration.get("mapreduce.fileoutputcommitter.algorithm.version")
    res0: String = 1
    ```
    
    Closes apache#29895 from dongjoon-hyun/SPARK-DEFAUT-COMMITTER.
    
    Authored-by: Dongjoon Hyun <dhyun@apple.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    (cherry picked from commit cc06266)
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    dongjoon-hyun committed Sep 29, 2020
    Configuration menu
    Copy the full SHA
    f3b80f8 View commit details
    Browse the repository at this point in the history

Commits on Sep 30, 2020

  1. [SPARK-31753][SQL][DOCS][FOLLOW-UP] Add missing keywords in the SQL docs

    ### What changes were proposed in this pull request?
    update sql-ref docs, the following key words will be added in this PR.
    
    CLUSTERED BY
    SORTED BY
    INTO num_buckets BUCKETS
    
    ### Why are the changes needed?
    let more users know the sql key words usage
    
    ### Does this PR introduce _any_ user-facing change?
    No
    ![image](https://user-images.githubusercontent.com/46367746/94428281-0a6b8080-01c3-11eb-9ff3-899f8da602ca.png)
    ![image](https://user-images.githubusercontent.com/46367746/94428285-0d667100-01c3-11eb-8a54-90e7641d917b.png)
    ![image](https://user-images.githubusercontent.com/46367746/94428288-0f303480-01c3-11eb-9e1d-023538aa6e2d.png)
    
    ### How was this patch tested?
    generate html test
    
    Closes apache#29883 from GuoPhilipse/add-sql-missing-keywords.
    
    Lead-authored-by: GuoPhilipse <46367746+GuoPhilipse@users.noreply.github.com>
    Co-authored-by: GuoPhilipse <guofei_ok@126.com>
    Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
    (cherry picked from commit 3bdbb55)
    Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
    2 people authored and maropu committed Sep 30, 2020
    Configuration menu
    Copy the full SHA
    db6ba04 View commit details
    Browse the repository at this point in the history

Commits on Oct 1, 2020

  1. [SQL][DOC][MINOR] Corrects input table names in the examples of CREAT…

    …E FUNCTION doc
    
    ### What changes were proposed in this pull request?
    Fix Typo
    
    ### Why are the changes needed?
    To maintain consistency.
    Correct table name should be used for SELECT command.
    
    ### Does this PR introduce _any_ user-facing change?
    Yes. Now CREATE FUNCTION doc will show the correct name of table.
    
    ### How was this patch tested?
    Manually. Doc changes.
    
    Closes apache#29920 from iRakson/fixTypo.
    
    Authored-by: iRakson <raksonrakesh@gmail.com>
    Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
    (cherry picked from commit d3dbe1a)
    Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
    iRakson authored and maropu committed Oct 1, 2020
    Configuration menu
    Copy the full SHA
    bc29602 View commit details
    Browse the repository at this point in the history

Commits on Oct 2, 2020

  1. [SPARK-32996][WEB-UI][3.0] Handle empty ExecutorMetrics in ExecutorMe…

    …tricsJsonSerializer
    
    ### What changes were proposed in this pull request?
    This is a backport PR for branch-3.0. This change was raised to `master` branch in `apache#29872
    
    When `peakMemoryMetrics` in `ExecutorSummary` is `Option.empty`, then the `ExecutorMetricsJsonSerializer#serialize` method does not execute the `jsonGenerator.writeObject` method. This causes the json to be generated with `peakMemoryMetrics` key added to the serialized string, but no corresponding value.
    This causes an error to be thrown when it is the next key `attributes` turn to be added to the json:
    `com.fasterxml.jackson.core.JsonGenerationException: Can not write a field name, expecting a value
    `
    
    ### Why are the changes needed?
    At the start of the Spark job, if `peakMemoryMetrics` is `Option.empty`, then it causes
    a `com.fasterxml.jackson.core.JsonGenerationException` to be thrown when we navigate to the Executors tab in Spark UI.
    Complete stacktrace:
    
    > com.fasterxml.jackson.core.JsonGenerationException: Can not write a field name, expecting a value
    > 	at com.fasterxml.jackson.core.JsonGenerator._reportError(JsonGenerator.java:2080)
    > 	at com.fasterxml.jackson.core.json.WriterBasedJsonGenerator.writeFieldName(WriterBasedJsonGenerator.java:161)
    > 	at com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:725)
    > 	at com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:721)
    > 	at com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:166)
    > 	at com.fasterxml.jackson.databind.ser.std.CollectionSerializer.serializeContents(CollectionSerializer.java:145)
    > 	at com.fasterxml.jackson.module.scala.ser.IterableSerializer.serializeContents(IterableSerializerModule.scala:26)
    > 	at com.fasterxml.jackson.module.scala.ser.IterableSerializer.serializeContents$(IterableSerializerModule.scala:25)
    > 	at com.fasterxml.jackson.module.scala.ser.UnresolvedIterableSerializer.serializeContents(IterableSerializerModule.scala:54)
    > 	at com.fasterxml.jackson.module.scala.ser.UnresolvedIterableSerializer.serializeContents(IterableSerializerModule.scala:54)
    > 	at com.fasterxml.jackson.databind.ser.std.AsArraySerializerBase.serialize(AsArraySerializerBase.java:250)
    > 	at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider._serialize(DefaultSerializerProvider.java:480)
    > 	at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:319)
    > 	at com.fasterxml.jackson.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:4094)
    > 	at com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:3404)
    > 	at org.apache.spark.ui.exec.ExecutorsPage.allExecutorsDataScript$1(ExecutorsTab.scala:64)
    > 	at org.apache.spark.ui.exec.ExecutorsPage.render(ExecutorsTab.scala:76)
    > 	at org.apache.spark.ui.WebUI.$anonfun$attachPage$1(WebUI.scala:89)
    > 	at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:80)
    > 	at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
    > 	at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
    > 	at org.sparkproject.jetty.servlet.ServletHolder.handle(ServletHolder.java:873)
    > 	at org.sparkproject.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1623)
    > 	at org.apache.spark.ui.HttpSecurityFilter.doFilter(HttpSecurityFilter.scala:95)
    > 	at org.sparkproject.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)
    > 	at org.sparkproject.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540)
    > 	at org.sparkproject.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
    > 	at org.sparkproject.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1345)
    > 	at org.sparkproject.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
    > 	at org.sparkproject.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480)
    > 	at org.sparkproject.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
    > 	at org.sparkproject.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1247)
    > 	at org.sparkproject.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
    > 	at org.sparkproject.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:753)
    > 	at org.sparkproject.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220)
    > 	at org.sparkproject.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
    > 	at org.sparkproject.jetty.server.Server.handle(Server.java:505)
    > 	at org.sparkproject.jetty.server.HttpChannel.handle(HttpChannel.java:370)
    > 	at org.sparkproject.jetty.server.HttpConnection.onFillable(HttpConnection.java:267)
    > 	at org.sparkproject.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)
    > 	at org.sparkproject.jetty.io.FillInterest.fillable(FillInterest.java:103)
    > 	at org.sparkproject.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
    > 	at org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
    > 	at org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
    > 	at org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
    > 	at org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
    > 	at org.sparkproject.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
    > 	at org.sparkproject.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:698)
    > 	at org.sparkproject.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:804)
    > 	at java.base/java.lang.Thread.run(Thread.java:834)
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Unit test
    
    Closes apache#29914 from shrutig/SPARK-32996-3.0.
    
    Authored-by: Shruti Gumma <shruti_gumma@apple.com>
    Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
    shrutig authored and viirya committed Oct 2, 2020
    Configuration menu
    Copy the full SHA
    41e1919 View commit details
    Browse the repository at this point in the history
  2. [SPARK-33051][INFRA][R] Uses setup-r to install R in GitHub Actions b…

    …uild
    
    ### What changes were proposed in this pull request?
    
    At SPARK-32493, the R installation was switched to manual installation because setup-r was broken. This seems fixed in the upstream so we should better switch it back.
    
    ### Why are the changes needed?
    
    To avoid maintaining the installation steps by ourselve.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No, dev-only.
    
    ### How was this patch tested?
    
    GitHub Actions build in this PR should test it.
    
    Closes apache#29931 from HyukjinKwon/recover-r-build.
    
    Authored-by: HyukjinKwon <gurwls223@apache.org>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    (cherry picked from commit b205be5)
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    HyukjinKwon committed Oct 2, 2020
    Configuration menu
    Copy the full SHA
    31684d6 View commit details
    Browse the repository at this point in the history

Commits on Oct 3, 2020

  1. [SPARK-33043][ML] Handle spark.driver.maxResultSize=0 in RowMatrix he…

    …uristic computation
    
    ### What changes were proposed in this pull request?
    
    RowMatrix contains a computation based on spark.driver.maxResultSize. However, when this value is set to 0, the computation fails (log of 0). The fix is simply to correctly handle this setting, which means unlimited result size, by using a tree depth of 1 in the RowMatrix method.
    
    ### Why are the changes needed?
    
    Simple bug fix to make several Spark ML functions which use RowMatrix run correctly in this case.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Not other than the bug fix of course.
    
    ### How was this patch tested?
    
    Existing RowMatrix tests plus a new test.
    
    Closes apache#29925 from srowen/SPARK-33043.
    
    Authored-by: Sean Owen <srowen@gmail.com>
    Signed-off-by: Sean Owen <srowen@gmail.com>
    (cherry picked from commit f86171a)
    Signed-off-by: Sean Owen <srowen@gmail.com>
    srowen committed Oct 3, 2020
    Configuration menu
    Copy the full SHA
    c9b6271 View commit details
    Browse the repository at this point in the history

Commits on Oct 4, 2020

  1. [SPARK-33065][TESTS] Expand the stack size of a thread in a test in L…

    …ocalityPlacementStrategySuite for Java 11 with sbt
    
    ### What changes were proposed in this pull request?
    
    This PR fixes an issue that a test in `LocalityPlacementStrategySuite` fails with Java 11 due to `StackOverflowError`.
    
    ```
    [info] - handle large number of containers and tasks (SPARK-18750) *** FAILED *** (170 milliseconds)
    [info]   StackOverflowError should not be thrown; however, got:
    [info]
    [info]   java.lang.StackOverflowError
    [info]          at java.base/java.util.concurrent.ConcurrentHashMap.putVal(ConcurrentHashMap.java:1012)
    [info]          at java.base/java.util.concurrent.ConcurrentHashMap.putIfAbsent(ConcurrentHashMap.java:1541)
    [info]          at java.base/java.lang.ClassLoader.getClassLoadingLock(ClassLoader.java:668)
    [info]          at java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:591)
    [info]          at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:579)
    [info]          at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
    [info]          at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
    ```
    
    The solution is to expand the stack size of a thread in the test from 32KB to 256KB.
    Currently, the stack size is specified as 32KB but the actual stack size can be greater than 32KB.
    According to the code of Hotspot, the minimum stack size is prefer to the specified size.
    
    Java 8: https://hg.openjdk.java.net/jdk8u/jdk8u/hotspot/file/c92ba514724d/src/os/linux/vm/os_linux.cpp#l900
    Java 11: https://hg.openjdk.java.net/jdk-updates/jdk11u/file/73edf743a93a/src/hotspot/os/posix/os_posix.cpp#l1555
    
    For Linux on x86_64, the minimum stack size seems to be 224KB and 136KB for Java 8 and Java 11 respectively. So, the actual stack size should be 224KB rather than 32KB for Java 8 on x86_64/Linux.
    As the test passes for Java 8 but doesn't for Java 11, 224KB is enough while 136KB is not.
    So I think specifing 256KB is reasonable for the new stack size.
    
    ### Why are the changes needed?
    
    To pass the test for Java 11.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Following command with Java 11.
    ```
    build/sbt -Pyarn clean package "testOnly org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite"
    ```
    
    Closes apache#29943 from sarutak/fix-stack-size.
    
    Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    (cherry picked from commit fab5321)
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    sarutak authored and dongjoon-hyun committed Oct 4, 2020
    Configuration menu
    Copy the full SHA
    75003fc View commit details
    Browse the repository at this point in the history

Commits on Oct 6, 2020

  1. [SPARK-33069][INFRA] Skip test result report if no JUnit XML files ar…

    …e found
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to skip test reporting ("Report test results") if there are no JUnit XML files are found.
    
    Currently, we're running and skipping the tests dynamically. For example,
    - if there are only changes in SparkR at the underlying commit, it only runs the SparkR tests, and skip the other tests and generate JUnit XML files for SparkR test cases.
    - if there are only changes in `docs` at the underlying commit, the build skips all tests except linters and do not generate any JUnit XML files.
    
    When test reporting ("Report test results") job is triggered after the main build ("Build and test
    ") is finished, and there are no JUnit XML files found, it reports the case as a failure. See https://github.com/apache/spark/runs/1196184007 as an example.
    
    This PR works around it by simply skipping the testing report when there are no JUnit XML files are found.
    Please see apache#29906 (comment) for more details.
    
    ### Why are the changes needed?
    
    To avoid false alarm for test results.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No, dev-only.
    
    ### How was this patch tested?
    
    Manually tested in my fork.
    
    Positive case:
    
    https://github.com/HyukjinKwon/spark/runs/1208624679?check_suite_focus=true
    https://github.com/HyukjinKwon/spark/actions/runs/288996327
    
    Negative case:
    
    https://github.com/HyukjinKwon/spark/runs/1208229838?check_suite_focus=true
    https://github.com/HyukjinKwon/spark/actions/runs/289000058
    
    Closes apache#29946 from HyukjinKwon/test-junit-files.
    
    Authored-by: HyukjinKwon <gurwls223@apache.org>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    (cherry picked from commit a0aa8f3)
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    HyukjinKwon committed Oct 6, 2020
    Configuration menu
    Copy the full SHA
    46a62ca View commit details
    Browse the repository at this point in the history
  2. [SPARK-33073][PYTHON] Improve error handling on Pandas to Arrow conve…

    …rsion failures
    
    ### What changes were proposed in this pull request?
    
    This improves error handling when a failure in conversion from Pandas to Arrow occurs. And fixes tests to be compatible with upcoming Arrow 2.0.0 release.
    
    ### Why are the changes needed?
    
    Current tests will fail with Arrow 2.0.0 because of a change in error message when the schema is invalid. For these cases, the current error message also includes information on disabling safe conversion config, which is mainly meant for floating point truncation and overflow. The tests have been updated to use a message that is show for past Arrow versions, and upcoming.
    
    If the user enters an invalid schema, the error produced by pyarrow is not consistent and either `TypeError` or `ArrowInvalid`, with the latter being caught, and raised as a `RuntimeError` with the extra info.
    
    The error handling is improved by:
    
    - narrowing the exception type to `TypeError`s, which `ArrowInvalid` is a subclass and what is raised on safe conversion failures.
    - The exception is only raised with additional information on disabling "spark.sql.execution.pandas.convertToArrowArraySafely" if it is enabled in the first place.
    - The original exception is chained to better show it to the user.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, the error re-raised changes from a RuntimeError to a ValueError, which better categorizes this type of error and in-line with the original Arrow error.
    
    ### How was this patch tested?
    
    Existing tests, using pyarrow 1.0.1 and 2.0.0-snapshot
    
    Closes apache#29951 from BryanCutler/arrow-better-handle-pandas-errors-SPARK-33073.
    
    Authored-by: Bryan Cutler <cutlerb@gmail.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    (cherry picked from commit 0812d6c)
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    BryanCutler authored and HyukjinKwon committed Oct 6, 2020
    Configuration menu
    Copy the full SHA
    4f71231 View commit details
    Browse the repository at this point in the history
  3. [SPARK-27428][CORE][TEST] Increase receive buffer size used in Statsd…

    …SinkSuite
    
    ### What changes were proposed in this pull request?
    
    Increase size of socket receive buffer in these tests.
    
    ### Why are the changes needed?
    
    The socket receive buffer size set in this test was too small for
    the StatsdSinkSuite tests to run reliably on some systems. For a
    test in this suite to run reliably the buffer needs to be large
    enough to hold all the data in the packets being sent in a test
    along with any additional kernel or protocol overhead. The amount
    of kernel overhead per packet can vary from system to system but is
    typically far higher than the protocol overhead.
    
    If the receive buffer is too small and fills up then packets are
    silently dropped. This leads to the test failing with a timeout.
    
    If the socket defaults to a larger receive buffer (normally true)
    then we should keep that size.
    
    As well as increasing the minimum buffer size I've also decoupled
    the datagram packet buffer size from the receive buffer size. The
    receive buffer should in general be far larger to account for the
    fact that multiple packets might be buffered, as well as the
    aforementioned overhead. Any truncated data in individual packets
    will be picked up by the tests.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No, this only affects the tests.
    
    ### How was this patch tested?
    Existing tests on IBM Z and x86.
    
    Closes apache#29819 from mundaym/fix-statsd.
    
    Authored-by: Michael Munday <mike.munday@ibm.com>
    Signed-off-by: Sean Owen <srowen@gmail.com>
    (cherry picked from commit b5e4b8c)
    Signed-off-by: Sean Owen <srowen@gmail.com>
    mundaym authored and srowen committed Oct 6, 2020
    Configuration menu
    Copy the full SHA
    d51b8d6 View commit details
    Browse the repository at this point in the history

Commits on Oct 7, 2020

  1. Revert "[SPARK-33073][PYTHON] Improve error handling on Pandas to Arr…

    …ow conversion failures"
    
    This reverts commit 4f71231.
    HyukjinKwon committed Oct 7, 2020
    Configuration menu
    Copy the full SHA
    2076abc View commit details
    Browse the repository at this point in the history
  2. [SPARK-33035][SQL][3.0] Updates the obsoleted entries of attribute ma…

    …pping in QueryPlan#transformUpWithNewOutput
    
    ### What changes were proposed in this pull request?
    
    This PR intends to fix corner-case bugs in the `QueryPlan#transformUpWithNewOutput` that is used to propagate updated `ExprId`s in a bottom-up way. Let's say we have a rule to simply assign new `ExprId`s in a projection list like this;
    ```scala
    case class TestRule extends Rule[LogicalPlan] {
      override def apply(plan: LogicalPlan): LogicalPlan = plan.transformUpWithNewOutput {
        case p  Project(projList, _) =>
          val newPlan = p.copy(projectList = projList.map { _.transform {
            // Assigns a new `ExprId` for references
            case a: AttributeReference => Alias(a, a.name)()
          }}.asInstanceOf[Seq[NamedExpression]])
    
          val attrMapping = p.output.zip(newPlan.output)
          newPlan -> attrMapping
      }
    }
    ```
    Then, this rule is applied into a plan below;
    ```scala
    (3) Project [a#5, b#6]
    +- (2) Project [a#5, b#6]
       +- (1) Project [a#5, b#6]
          +- LocalRelation <empty>, [a#5, b#6]
    ```
    In the first transformation, the rule assigns new `ExprId`s in `(1) Project` (e.g., a#5 AS a#7, b#6 AS b#8). In the second transformation, the rule corrects the input references of `(2) Project`  first by using attribute mapping given from `(1) Project` (a#5->a#7 and b#6->b#8) and then assigns new `ExprId`s (e.g., a#7 AS a#9, b#8 AS b#10). But, in the third transformation, the rule fails because it tries to correct the references of `(3) Project` by using incorrect attribute mapping (a#7->a#9 and b#8->b#10) even though the correct one is a#5->a#9 and b#6->b#10. To fix this issue, this PR modified the code to update the attribute mapping entries that are obsoleted by generated entries in a given rule.
    
    This is the backport of apache#29911.
    
    ### Why are the changes needed?
    
    bugfix.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Added tests in `QueryPlanSuite`.
    
    Closes apache#29953 from maropu/SPARK-33035-FOLLOWUP.
    
    Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    maropu authored and dongjoon-hyun committed Oct 7, 2020
    Configuration menu
    Copy the full SHA
    23207fc View commit details
    Browse the repository at this point in the history
  3. [SPARK-33073][PYTHON][3.0] Improve error handling on Pandas to Arrow …

    …conversion failures
    
    ### What changes were proposed in this pull request?
    
    This improves error handling when a failure in conversion from Pandas to Arrow occurs. And fixes tests to be compatible with upcoming Arrow 2.0.0 release.
    
    ### Why are the changes needed?
    
    Current tests will fail with Arrow 2.0.0 because of a change in error message when the schema is invalid. For these cases, the current error message also includes information on disabling safe conversion config, which is mainly meant for floating point truncation and overflow. The tests have been updated to use a message that is show for past Arrow versions, and upcoming.
    
    If the user enters an invalid schema, the error produced by pyarrow is not consistent and either `TypeError` or `ArrowInvalid`, with the latter being caught, and raised as a `RuntimeError` with the extra info.
    
    The error handling is improved by:
    
    - narrowing the exception type to `TypeError`s, which `ArrowInvalid` is a subclass and what is raised on safe conversion failures.
    - The exception is only raised with additional information on disabling "spark.sql.execution.pandas.convertToArrowArraySafely" if it is enabled in the first place.
    - The original exception is chained to better show it to the user (only for Spark 3.1+ which requires Python 3)
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, the error re-raised changes from a RuntimeError to a ValueError, which better categorizes this type of error and in-line with the original Arrow error.
    
    ### How was this patch tested?
    
    Existing tests, using pyarrow 1.0.1 and 2.0.0-snapshot, and Python 2 with 0.15.1
    
    Closes apache#29962 from BryanCutler/arrow-better-handle-pandas-errors-SPARK-33073-branch-3.0.
    
    Authored-by: Bryan Cutler <cutlerb@gmail.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    BryanCutler authored and HyukjinKwon committed Oct 7, 2020
    Configuration menu
    Copy the full SHA
    7981f67 View commit details
    Browse the repository at this point in the history
  4. [SPARK-32067][K8S] Use unique ConfigMap name for executor pod template

    ### What changes were proposed in this pull request?
    
    The pod template configmap always had the same name. This PR makes it unique.
    
    ### Why are the changes needed?
    
    If you scheduled 2 spark jobs they will both use the same configmap name this will result in conflicts. This PR fixes that
    
    **BEFORE**
    ```
    $ kubectl get cm --all-namespaces -w | grep podspec
    podspec-configmap                              1      65s
    ```
    
    **AFTER**
    ```
    $ kubectl get cm --all-namespaces -w | grep podspec
    aaece65ef82e4a30b7b7800aad600d4f   spark-test-app-aac9f37502b2ca55-driver-podspec-conf-map   1      0s
    ```
    
    This can be seen when running the integration tests
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    Unit tests and the integration tests test if this works
    
    Closes apache#29934 from stijndehaes/bugfix/SPARK-32067-unique-name-for-template-configmap.
    
    Authored-by: Stijn De Haes <stijndehaes@gmail.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    (cherry picked from commit 3099fd9)
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    stijndehaes authored and dongjoon-hyun committed Oct 7, 2020
    Configuration menu
    Copy the full SHA
    45475af View commit details
    Browse the repository at this point in the history

Commits on Oct 8, 2020

  1. [SPARK-33089][SQL] make avro format propagate Hadoop config from DS o…

    …ptions to underlying HDFS file system
    
    ### What changes were proposed in this pull request?
    
    In `AvroUtils`'s `inferSchema()`, propagate Hadoop config from DS options to underlying HDFS file system.
    
    ### Why are the changes needed?
    
    There is a bug that when running:
    ```scala
    spark.read.format("avro").options(conf).load(path)
    ```
    The underlying file system will not receive the `conf` options.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    unit test added
    
    Closes apache#29971 from yuningzh-db/avro_options.
    
    Authored-by: Yuning Zhang <yuning.zhang@databricks.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    (cherry picked from commit bbc887b)
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    yuningzh-db authored and HyukjinKwon committed Oct 8, 2020
    Configuration menu
    Copy the full SHA
    a7e4318 View commit details
    Browse the repository at this point in the history
  2. [SPARK-33091][SQL] Avoid using map instead of foreach to avoid potent…

    …ial side effect at callers of OrcUtils.readCatalystSchema
    
    ### What changes were proposed in this pull request?
    
    This is a kind of a followup of SPARK-32646. New JIRA was filed to control the fixed versions properly.
    
    When you use `map`, it might be lazily evaluated and not executed. To avoid this,  we should better use `foreach`. See also SPARK-16694. Current codes look not causing any bug for now but it should be best to fix to avoid potential issues.
    
    ### Why are the changes needed?
    
    To avoid potential issues from `map` being lazy and not executed.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Ran related tests. CI in this PR should verify.
    
    Closes apache#29974 from HyukjinKwon/SPARK-32646.
    
    Authored-by: HyukjinKwon <gurwls223@apache.org>
    Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
    (cherry picked from commit 5effa8e)
    Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
    HyukjinKwon authored and maropu committed Oct 8, 2020
    Configuration menu
    Copy the full SHA
    782ab8e View commit details
    Browse the repository at this point in the history
  3. [SPARK-33096][K8S] Use LinkedHashMap instead of Map for newlyCreatedE…

    …xecutors
    
    ### What changes were proposed in this pull request?
    
    This PR aims to use `LinkedHashMap` instead of `Map` for `newlyCreatedExecutors`.
    
    ### Why are the changes needed?
    
    This makes log messages (INFO/DEBUG) more readable. This is helpful when `spark.kubernetes.allocation.batch.size` is large and especially when K8s dynamic allocation is used.
    
    **BEFORE**
    ```
    20/10/08 10:24:21 DEBUG ExecutorPodsAllocator: Executor with id 8 was not found in the Kubernetes cluster since it was created 0 milliseconds ago.
    20/10/08 10:24:21 DEBUG ExecutorPodsAllocator: Executor with id 2 was not found in the Kubernetes cluster since it was created 0 milliseconds ago.
    20/10/08 10:24:21 DEBUG ExecutorPodsAllocator: Executor with id 5 was not found in the Kubernetes cluster since it was created 0 milliseconds ago.
    20/10/08 10:24:21 DEBUG ExecutorPodsAllocator: Executor with id 4 was not found in the Kubernetes cluster since it was created 0 milliseconds ago.
    20/10/08 10:24:21 DEBUG ExecutorPodsAllocator: Executor with id 7 was not found in the Kubernetes cluster since it was created 0 milliseconds ago.
    20/10/08 10:24:21 DEBUG ExecutorPodsAllocator: Executor with id 10 was not found in the Kubernetes cluster since it was created 0 milliseconds ago.
    20/10/08 10:24:21 DEBUG ExecutorPodsAllocator: Executor with id 9 was not found in the Kubernetes cluster since it was created 0 milliseconds ago.
    20/10/08 10:24:21 DEBUG ExecutorPodsAllocator: Executor with id 3 was not found in the Kubernetes cluster since it was created 0 milliseconds ago.
    20/10/08 10:24:21 DEBUG ExecutorPodsAllocator: Executor with id 6 was not found in the Kubernetes cluster since it was created 0 milliseconds ago.
    20/10/08 10:24:21 INFO ExecutorPodsAllocator: Deleting 9 excess pod requests (5,10,6,9,2,7,3,8,4).
    ```
    
    **AFTER**
    ```
    20/10/08 10:25:17 DEBUG ExecutorPodsAllocator: Executor with id 2 was not found in the Kubernetes cluster since it was created 0 milliseconds ago.
    20/10/08 10:25:17 DEBUG ExecutorPodsAllocator: Executor with id 3 was not found in the Kubernetes cluster since it was created 0 milliseconds ago.
    20/10/08 10:25:17 DEBUG ExecutorPodsAllocator: Executor with id 4 was not found in the Kubernetes cluster since it was created 0 milliseconds ago.
    20/10/08 10:25:17 DEBUG ExecutorPodsAllocator: Executor with id 5 was not found in the Kubernetes cluster since it was created 0 milliseconds ago.
    20/10/08 10:25:17 DEBUG ExecutorPodsAllocator: Executor with id 6 was not found in the Kubernetes cluster since it was created 0 milliseconds ago.
    20/10/08 10:25:17 DEBUG ExecutorPodsAllocator: Executor with id 7 was not found in the Kubernetes cluster since it was created 0 milliseconds ago.
    20/10/08 10:25:17 DEBUG ExecutorPodsAllocator: Executor with id 8 was not found in the Kubernetes cluster since it was created 0 milliseconds ago.
    20/10/08 10:25:17 DEBUG ExecutorPodsAllocator: Executor with id 9 was not found in the Kubernetes cluster since it was created 0 milliseconds ago.
    20/10/08 10:25:17 DEBUG ExecutorPodsAllocator: Executor with id 10 was not found in the Kubernetes cluster since it was created 0 milliseconds ago.
    20/10/08 10:25:17 INFO ExecutorPodsAllocator: Deleting 9 excess pod requests (2,3,4,5,6,7,8,9,10).
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Pass the CI or `build/sbt -Pkubernetes "kubernetes/test"`
    
    Closes apache#29979 from dongjoon-hyun/SPARK-K8S-LOG.
    
    Authored-by: Dongjoon Hyun <dhyun@apple.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    (cherry picked from commit 4987db8)
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    dongjoon-hyun committed Oct 8, 2020
    Configuration menu
    Copy the full SHA
    c1b660e View commit details
    Browse the repository at this point in the history

Commits on Oct 9, 2020

  1. [SPARK-33101][ML][3.0] Make LibSVM format propagate Hadoop config fro…

    …m DS options to underlying HDFS file system
    
    ### What changes were proposed in this pull request?
    Propagate LibSVM options to Hadoop configs in the LibSVM datasource.
    
    ### Why are the changes needed?
    There is a bug that when running:
    ```scala
    spark.read.format("libsvm").options(conf).load(path)
    ```
    The underlying file system will not receive the `conf` options.
    
    ### Does this PR introduce _any_ user-facing change?
    Yes. After the changes, for example, users should read files from Azure Data Lake successfully:
    ```scala
    def hadoopConf1() = Map[String, String](
      s"fs.adl.oauth2.access.token.provider.type" -> "ClientCredential",
      s"fs.adl.oauth2.client.id" -> dbutils.secrets.get(scope = "...", key = "..."),
      s"fs.adl.oauth2.credential" -> dbutils.secrets.get(scope = "...", key = "..."),
      s"fs.adl.oauth2.refresh.url" -> s"https://login.microsoftonline.com/.../oauth2/token")
    val df = spark.read.format("libsvm").options(hadoopConf1).load("adl://....azuredatalakestore.net/foldersp1/...")
    ```
    and not get the following exception because the settings above are not propagated to the filesystem:
    ```java
    java.lang.IllegalArgumentException: No value for fs.adl.oauth2.access.token.provider found in conf file.
    	at ....adl.AdlFileSystem.getNonEmptyVal(AdlFileSystem.java:820)
    	at ....adl.AdlFileSystem.getCustomAccessTokenProvider(AdlFileSystem.java:220)
    	at ....adl.AdlFileSystem.getAccessTokenProvider(AdlFileSystem.java:257)
    	at ....adl.AdlFileSystem.initialize(AdlFileSystem.java:164)
    	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
    ```
    
    ### How was this patch tested?
    Added UT to `LibSVMRelationSuite`.
    
    Authored-by: Max Gekk <max.gekkgmail.com>
    Signed-off-by: Dongjoon Hyun <dhyunapple.com>
    (cherry picked from commit 1234c66)
    Signed-off-by: Max Gekk <max.gekkgmail.com>
    
    Closes apache#29986 from MaxGekk/ml-option-propagation-3.0.
    
    Authored-by: Max Gekk <max.gekk@gmail.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    MaxGekk authored and HyukjinKwon committed Oct 9, 2020
    Configuration menu
    Copy the full SHA
    dcffa56 View commit details
    Browse the repository at this point in the history
  2. [SPARK-33094][SQL][3.0] Make ORC format propagate Hadoop config from …

    …DS options to underlying HDFS file system
    
    ### What changes were proposed in this pull request?
    Propagate ORC options to Hadoop configs in Hive `OrcFileFormat` and in the regular ORC datasource.
    
    ### Why are the changes needed?
    There is a bug that when running:
    ```scala
    spark.read.format("orc").options(conf).load(path)
    ```
    The underlying file system will not receive the conf options.
    
    ### Does this PR introduce _any_ user-facing change?
    Yes
    
    ### How was this patch tested?
    Added UT to `OrcSourceSuite`.
    
    Authored-by: Max Gekk <max.gekkgmail.com>
    Signed-off-by: Dongjoon Hyun <dhyunapple.com>
    (cherry picked from commit c5f6af9)
    Signed-off-by: Max Gekk <max.gekkgmail.com>
    
    Closes apache#29985 from MaxGekk/orc-option-propagation-3.0.
    
    Authored-by: Max Gekk <max.gekk@gmail.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    MaxGekk authored and HyukjinKwon committed Oct 9, 2020
    Configuration menu
    Copy the full SHA
    9892b3e View commit details
    Browse the repository at this point in the history

Commits on Oct 12, 2020

  1. [SPARK-33118][SQL] CREATE TEMPORARY TABLE fails with location

    We have a problem when you use CREATE TEMPORARY TABLE with LOCATION
    
    ```scala
    spark.range(3).write.parquet("/tmp/testspark1")
    
    sql("CREATE TEMPORARY TABLE t USING parquet OPTIONS (path '/tmp/testspark1')")
    sql("CREATE TEMPORARY TABLE t USING parquet LOCATION '/tmp/testspark1'")
    ```
    ```scala
    org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
      at org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:200)
      at scala.Option.getOrElse(Option.scala:189)
      at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:200)
      at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408)
      at org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:94)
      at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
      at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
      at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
      at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229)
      at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618)
      at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
      at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
      at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
      at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
      at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
      at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616)
      at org.apache.spark.sql.Dataset.<init>(Dataset.scala:229)
      at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
      at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
      at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
      at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607)
      at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
      at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602)
    ```
    This bug was introduced by SPARK-30507.
    sparksqlparser --> visitCreateTable --> visitCreateTableClauses --> cleanTableOptions extract the path from the options but in this case CreateTempViewUsing need the path in the options map.
    
    To fix the problem
    
    No
    
    Unit testing and manual testing
    
    Closes apache#30014 from planga82/bugfix/SPARK-33118_create_temp_table_location.
    
    Authored-by: Pablo <pablo.langa@stratio.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    (cherry picked from commit 819f12e)
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    pablolanga-stratio authored and dongjoon-hyun committed Oct 12, 2020
    Configuration menu
    Copy the full SHA
    0601fc7 View commit details
    Browse the repository at this point in the history

Commits on Oct 13, 2020

  1. [SPARK-33115][BUILD][DOCS] Fix javadoc errors in kvstore and `unsaf…

    …e` modules
    
    ### What changes were proposed in this pull request?
    
    Fix Javadoc generation errors in `kvstore` and `unsafe` modules according to error message hints.
    
    ### Why are the changes needed?
    
    Fixes `doc` task failures which prevented other tasks successful executions (eg `publishLocal` task depends on `doc` task).
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    Meaning of text in Javadoc is stayed the same.
    
    ### How was this patch tested?
    
    Run `build/sbt kvstore/Compile/doc`, `build/sbt unsafe/Compile/doc` and `build/sbt doc` without errors.
    
    Closes apache#30007 from gemelen/feature/doc-task-fix.
    
    Authored-by: Denis Pyshev <git@gemelen.net>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    (cherry picked from commit 1b0875b)
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    gemelen authored and HyukjinKwon committed Oct 13, 2020
    Configuration menu
    Copy the full SHA
    9430ae6 View commit details
    Browse the repository at this point in the history

Commits on Oct 14, 2020

  1. [SPARK-33134][SQL][3.0] Return partial results only for root JSON obj…

    …ects
    
    ### What changes were proposed in this pull request?
    In the PR, I propose to restrict the partial result feature only by root JSON objects. JSON datasource as well as `from_json()` will return `null` for malformed nested JSON objects.
    
    ### Why are the changes needed?
    1. To not raise exception to users in the PERMISSIVE mode
    2. To fix a regression and to have the same behavior as Spark 2.4.x has
    3. Current implementation of partial result is supposed to work only for root (top-level) JSON objects, and not tested for bad nested complex JSON fields.
    
    ### Does this PR introduce _any_ user-facing change?
    Yes. Before the changes, the code below:
    ```scala
        val pokerhand_raw = Seq("""[{"cards": [19], "playerId": 123456}]""").toDF("events")
        val event = new StructType().add("playerId", LongType).add("cards", ArrayType(new StructType().add("id", LongType).add("rank", StringType)))
        val pokerhand_events = pokerhand_raw.select(from_json($"events", ArrayType(event)).as("event"))
        pokerhand_events.show
    ```
    throws the exception even in the default **PERMISSIVE** mode:
    ```java
    java.lang.ClassCastException: java.lang.Long cannot be cast to org.apache.spark.sql.catalyst.util.ArrayData
      at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48)
      at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray$(rows.scala:48)
      at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:195)
    ```
    
    After the changes:
    ```
    +-----+
    |event|
    +-----+
    | null|
    +-----+
    ```
    
    ### How was this patch tested?
    Added a test to `JsonFunctionsSuite`.
    
    Closes apache#30032 from MaxGekk/json-skip-row-wrong-schema-3.0.
    
    Authored-by: Max Gekk <max.gekk@gmail.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    MaxGekk authored and HyukjinKwon committed Oct 14, 2020
    Configuration menu
    Copy the full SHA
    205b65e View commit details
    Browse the repository at this point in the history
  2. [SPARK-33136][SQL] Fix mistakenly swapped parameter in V2WriteCommand…

    ….outputResolved
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to fix a bug on calling `DataType.equalsIgnoreCompatibleNullability` with mistakenly swapped parameters in `V2WriteCommand.outputResolved`. The order of parameters for `DataType.equalsIgnoreCompatibleNullability` are `from` and `to`, which says that the right order of matching variables are `inAttr` and `outAttr`.
    
    ### Why are the changes needed?
    
    Spark throws AnalysisException due to unresolved operator in v2 write, while the operator is unresolved due to a bug that parameters to call `DataType.equalsIgnoreCompatibleNullability` in `outputResolved` have been swapped.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, end users no longer suffer on unresolved operator in v2 write if they're trying to write dataframe containing non-nullable complex types against table matching complex types as nullable.
    
    ### How was this patch tested?
    
    New UT added.
    
    Closes apache#30033 from HeartSaVioR/SPARK-33136.
    
    Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    (cherry picked from commit 8e5cb1d)
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    HeartSaVioR authored and dongjoon-hyun committed Oct 14, 2020
    Configuration menu
    Copy the full SHA
    2ebea13 View commit details
    Browse the repository at this point in the history

Commits on Oct 15, 2020

  1. [SPARK-33146][CORE] Check for non-fatal errors when loading new appli…

    …cations in SHS
    
    ### What changes were proposed in this pull request?
    
    Adds an additional check for non-fatal errors when attempting to add a new entry to the history server application listing.
    
    ### Why are the changes needed?
    
    A bad rolling event log folder (missing appstatus file or no log files) would cause no applications to be loaded by the Spark history server. Figuring out why invalid event log folders are created in the first place will be addressed in separate issues, this just lets the history server skip the invalid folder and successfully load all the valid applications.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    New UT
    
    Closes apache#30037 from Kimahriman/bug/rolling-log-crashing-history.
    
    Authored-by: Adam Binford <adam.binford@radiantsolutions.com>
    Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
    (cherry picked from commit 9ab0ec4)
    Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
    Adam Binford authored and HeartSaVioR committed Oct 15, 2020
    Configuration menu
    Copy the full SHA
    d9669bd View commit details
    Browse the repository at this point in the history
  2. [SPARK-33153][SQL][TESTS] Ignore Spark 2.4 in HiveExternalCatalogVers…

    …ionsSuite on Python 3.8/3.9
    
    ### What changes were proposed in this pull request?
    
    This PR aims to ignore Apache Spark 2.4.x distribution in HiveExternalCatalogVersionsSuite if Python version is 3.8 or 3.9.
    
    ### Why are the changes needed?
    
    Currently, `HiveExternalCatalogVersionsSuite` is broken on the latest OS like `Ubuntu 20.04` because its default Python version is 3.8. PySpark 2.4.x doesn't work on Python 3.8 due to SPARK-29536.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Manually.
    ```
    $ python3 --version
    Python 3.8.5
    
    $ build/sbt "hive/testOnly *.HiveExternalCatalogVersionsSuite"
    ...
    [info] All tests passed.
    [info] Passed: Total 1, Failed 0, Errors 0, Passed 1
    ```
    
    Closes apache#30044 from dongjoon-hyun/SPARK-33153.
    
    Authored-by: Dongjoon Hyun <dhyun@apple.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    (cherry picked from commit ec34a00)
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    dongjoon-hyun committed Oct 15, 2020
    Configuration menu
    Copy the full SHA
    0b7b811 View commit details
    Browse the repository at this point in the history
  3. Revert "[SPARK-33146][CORE] Check for non-fatal errors when loading n…

    …ew applications in SHS"
    
    This reverts commit d9669bd.
    dongjoon-hyun committed Oct 15, 2020
    Configuration menu
    Copy the full SHA
    e40c147 View commit details
    Browse the repository at this point in the history

Commits on Oct 16, 2020

  1. [SPARK-33163][SQL][TESTS] Check the metadata key 'org.apache.spark.le…

    …gacyDateTime' in Avro/Parquet files
    
    ### What changes were proposed in this pull request?
    Added a couple tests to `AvroSuite` and to `ParquetIOSuite` to check that the metadata key 'org.apache.spark.legacyDateTime' is written correctly depending on the SQL configs:
    - spark.sql.legacy.avro.datetimeRebaseModeInWrite
    - spark.sql.legacy.parquet.datetimeRebaseModeInWrite
    
    This is a follow up apache#28137.
    
    ### Why are the changes needed?
    1. To improve test coverage
    2. To make sure that the metadata key is actually saved to Avro/Parquet files
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    By running the added tests:
    ```
    $ build/sbt "testOnly org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite"
    $ build/sbt "avro/test:testOnly org.apache.spark.sql.avro.AvroV1Suite"
    $ build/sbt "avro/test:testOnly org.apache.spark.sql.avro.AvroV2Suite"
    ```
    
    Closes apache#30061 from MaxGekk/parquet-test-metakey.
    
    Authored-by: Max Gekk <max.gekk@gmail.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    (cherry picked from commit 38c05af)
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    MaxGekk authored and HyukjinKwon committed Oct 16, 2020
    Configuration menu
    Copy the full SHA
    d0f1120 View commit details
    Browse the repository at this point in the history
  2. [SPARK-33165][SQL][TEST] Remove dependencies(scalatest,scalactic) fro…

    …m Benchmark
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to remove `assert` from `Benchmark` for making it easier to run benchmark codes via `spark-submit`.
    
    ### Why are the changes needed?
    
    Since the current `Benchmark` (`master` and `branch-3.0`) has `assert`, we need to pass the proper jars of `scalatest` and `scalactic`;
     - scalatest-core_2.12-3.2.0.jar
     - scalatest-compatible-3.2.0.jar
     - scalactic_2.12-3.0.jar
    ```
    ./bin/spark-submit --jars scalatest-core_2.12-3.2.0.jar,scalatest-compatible-3.2.0.jar,scalactic_2.12-3.0.jar,./sql/catalyst/target/spark-catalyst_2.12-3.1.0-SNAPSHOT-tests.jar,./core/target/spark-core_2.12-3.1.0-SNAPSHOT-tests.jar --class org.apache.spark.sql.execution.benchmark.TPCDSQueryBenchmark ./sql/core/target/spark-sql_2.12-3.1.0-SNAPSHOT-tests.jar --data-location /tmp/tpcds-sf1
    ```
    
    This update can make developers submit benchmark codes without these dependencies;
    ```
    ./bin/spark-submit --jars ./sql/catalyst/target/spark-catalyst_2.12-3.1.0-SNAPSHOT-tests.jar,./core/target/spark-core_2.12-3.1.0-SNAPSHOT-tests.jar --class org.apache.spark.sql.execution.benchmark.TPCDSQueryBenchmark ./sql/core/target/spark-sql_2.12-3.1.0-SNAPSHOT-tests.jar --data-location /tmp/tpcds-sf1
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Manually checked.
    
    Closes apache#30064 from maropu/RemoveDepInBenchmark.
    
    Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    (cherry picked from commit a5c17de)
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    maropu authored and HyukjinKwon committed Oct 16, 2020
    Configuration menu
    Copy the full SHA
    160f458 View commit details
    Browse the repository at this point in the history
  3. [SPARK-32761][SQL][3.0] Allow aggregating multiple foldable distinct …

    …expressions
    
    ### What changes were proposed in this pull request?
    For queries with multiple foldable distinct columns, since they will be eliminated during
    execution, it's not mandatory to let `RewriteDistinctAggregates` handle this case. And
    in the current code, `RewriteDistinctAggregates` *dose* miss some "aggregating with
    multiple foldable distinct expressions" cases.
    For example: `select count(distinct 2), count(distinct 2, 3)` will be missed.
    
    But in the planner, this will trigger an error that "multiple distinct expressions" are not allowed.
    As the foldable distinct columns can be eliminated finally, we can allow this in the aggregation
    planner check.
    
    ### Why are the changes needed?
    bug fix
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    added test case
    
    Authored-by: Linhong Liu <linhong.liudatabricks.com>
    Signed-off-by: Wenchen Fan <wenchendatabricks.com>
    (cherry picked from commit a410658)
    
    Closes apache#30052 from linhongliu-db/SPARK-32761-3.0.
    
    Authored-by: Linhong Liu <linhong.liu@databricks.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    linhongliu-db authored and cloud-fan committed Oct 16, 2020
    Configuration menu
    Copy the full SHA
    37d6b3c View commit details
    Browse the repository at this point in the history
  4. [SPARK-33165][SQL][TESTS][FOLLOW-UP] Use scala.Predef.assert instead

    ### What changes were proposed in this pull request?
    
    This PR proposes to use `scala.Predef.assert` instead of `org.scalatest.Assertions.assert` removed at apache#30064
    
    ### Why are the changes needed?
    
    Just to keep the same behaviour.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No, dev-only
    
    ### How was this patch tested?
    
    Recover the existing asserts.
    
    Closes apache#30065 from HyukjinKwon/SPARK-33165.
    
    Authored-by: HyukjinKwon <gurwls223@apache.org>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    HyukjinKwon committed Oct 16, 2020
    Configuration menu
    Copy the full SHA
    698ac6a View commit details
    Browse the repository at this point in the history
  5. [SPARK-33171][INFRA] Mark ParquetV*FilterSuite/ParquetV*SchemaPruning…

    …Suite as ExtendedSQLTest
    
    ### What changes were proposed in this pull request?
    
    This PR aims to mark ParquetV1FilterSuite and ParquetV2FilterSuite as `ExtendedSQLTest`.
    - ParquetV1FilterSuite/ParquetV2FilterSuite
    - ParquetV1SchemaPruningSuite/ParquetV2SchemaPruningSuite
    
    ### Why are the changes needed?
    
    Currently, `sql - other tests` is the longest job. This PR will move the above tests to `sql - slow tests` job.
    
    **BEFORE**
    - https://github.com/apache/spark/runs/1264150802 (1 hour 37 minutes)
    
    **AFTER**
    - https://github.com/apache/spark/pull/30068/checks?check_run_id=1265879896 (1 hour 21 minutes)
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Pass the Github Action with the reduced time.
    
    Closes apache#30068 from dongjoon-hyun/MOVE3.
    
    Lead-authored-by: Dongjoon Hyun <dongjoon@apache.org>
    Co-authored-by: Dongjoon Hyun <dhyun@apple.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    dongjoon-hyun and dongjoon-hyun committed Oct 16, 2020
    Configuration menu
    Copy the full SHA
    b66bd79 View commit details
    Browse the repository at this point in the history
  6. [SPARK-32436][CORE] Initialize numNonEmptyBlocks in HighlyCompressedM…

    …apStatus.readExternal
    
    ### What changes were proposed in this pull request?
    
    This PR aims to initialize `numNonEmptyBlocks` in `HighlyCompressedMapStatus.readExternal`.
    
    In Scala 2.12, this is initialized to `-1` via the following.
    ```scala
    protected def this() = this(null, -1, null, -1, null, -1)  // For deserialization only
    ```
    
    ### Why are the changes needed?
    
    In Scala 2.13, this causes several UT failures because `HighlyCompressedMapStatus.readExternal` doesn't initialize this field. The following is one example.
    
    - org.apache.spark.scheduler.MapStatusSuite
    ```
    MapStatusSuite:
    - compressSize
    - decompressSize
    *** RUN ABORTED ***
      java.lang.NoSuchFieldError: numNonEmptyBlocks
      at org.apache.spark.scheduler.HighlyCompressedMapStatus.<init>(MapStatus.scala:181)
      at org.apache.spark.scheduler.HighlyCompressedMapStatus$.apply(MapStatus.scala:281)
      at org.apache.spark.scheduler.MapStatus$.apply(MapStatus.scala:73)
      at org.apache.spark.scheduler.MapStatusSuite.$anonfun$new$8(MapStatusSuite.scala:64)
      at scala.runtime.java8.JFunction1$mcVD$sp.apply(JFunction1$mcVD$sp.scala:18)
      at scala.collection.immutable.List.foreach(List.scala:333)
      at org.apache.spark.scheduler.MapStatusSuite.$anonfun$new$7(MapStatusSuite.scala:61)
      at scala.runtime.java8.JFunction1$mcVJ$sp.apply(JFunction1$mcVJ$sp.scala:18)
      at scala.collection.immutable.List.foreach(List.scala:333)
      at org.apache.spark.scheduler.MapStatusSuite.$anonfun$new$6(MapStatusSuite.scala:60)
      ...
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    
    No. This is a private class.
    
    ### How was this patch tested?
    
    1. Pass the GitHub Action or Jenkins with the existing tests.
    2. Test with Scala-2.13 with `MapStatusSuite`.
    ```
    $ dev/change-scala-version.sh 2.13
    $ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none -DwildcardSuites=org.apache.spark.scheduler.MapStatusSuite
    ...
    MapStatusSuite:
    - compressSize
    - decompressSize
    - MapStatus should never report non-empty blocks' sizes as 0
    - large tasks should use org.apache.spark.scheduler.HighlyCompressedMapStatus
    - HighlyCompressedMapStatus: estimated size should be the average non-empty block size
    - SPARK-22540: ensure HighlyCompressedMapStatus calculates correct avgSize
    - RoaringBitmap: runOptimize succeeded
    - RoaringBitmap: runOptimize failed
    - Blocks which are bigger than SHUFFLE_ACCURATE_BLOCK_THRESHOLD should not be underestimated.
    - SPARK-21133 HighlyCompressedMapStatus#writeExternal throws NPE
    Run completed in 7 seconds, 971 milliseconds.
    Total number of tests run: 10
    Suites: completed 2, aborted 0
    Tests: succeeded 10, failed 0, canceled 0, ignored 0, pending 0
    All tests passed.
    ```
    
    Closes apache#29231 from dongjoon-hyun/SPARK-32436.
    
    Authored-by: Dongjoon Hyun <dongjoon@apache.org>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    (cherry picked from commit f9f1867)
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    dongjoon-hyun committed Oct 16, 2020
    Configuration menu
    Copy the full SHA
    1bec8a3 View commit details
    Browse the repository at this point in the history

Commits on Oct 17, 2020

  1. [SPARK-33131][SQL][3.0] Fix grouping sets with having clause can not …

    …resolve qualified col name
    
    This is [apache#30029](apache#30029) backport for branch-3.0.
    
    ### What changes were proposed in this pull request?
    
    Correct the resolution of having clause.
    
    ### Why are the changes needed?
    
    Grouping sets construct new aggregate lost the qualified name of grouping expression. Here is a example:
    ```
    -- Works resolved by `ResolveReferences`
    select c1 from values (1) as t1(c1) group by grouping sets(t1.c1) having c1 = 1
    
    -- Works because of the extra expression c1
    select c1 as c2 from values (1) as t1(c1) group by grouping sets(t1.c1) having t1.c1 = 1
    
    -- Failed
    select c1 from values (1) as t1(c1) group by grouping sets(t1.c1) having t1.c1 = 1
    ```
    
    It wroks with `Aggregate` without grouping sets through `ResolveReferences`, but Grouping sets not works since the exprId has been changed.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, bug fix.
    
    ### How was this patch tested?
    
    add test.
    
    Closes apache#30077 from ulysses-you/SPARK-33131-branch-3.0.
    
    Authored-by: ulysses <youxiduo@weidian.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    ulysses-you authored and dongjoon-hyun committed Oct 17, 2020
    Configuration menu
    Copy the full SHA
    fab10f0 View commit details
    Browse the repository at this point in the history

Commits on Oct 18, 2020

  1. [SPARK-33170][SQL] Add SQL config to control fast-fail behavior in Fi…

    …leFormatWriter
    
    ### What changes were proposed in this pull request?
    
    This patch proposes to add a config we can control fast-fail behavior in FileFormatWriter and set it false by default.
    
    ### Why are the changes needed?
    
    In SPARK-29649, we catch `FileAlreadyExistsException` in `FileFormatWriter` and fail fast for the task set to prevent task retry.
    
    Due to latest discussion, it is important to be able to keep original behavior that is to retry tasks even `FileAlreadyExistsException` is thrown, because `FileAlreadyExistsException` could be recoverable in some cases.
    
    We are going to add a config we can control this behavior and set it false for fast-fail by default.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes. By default the task in FileFormatWriter will retry even if `FileAlreadyExistsException` is thrown. This is the behavior before Spark 3.0. User can control fast-fail behavior by enabling it.
    
    ### How was this patch tested?
    
    Unit test.
    
    Closes apache#30073 from viirya/SPARK-33170.
    
    Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    (cherry picked from commit 3010e90)
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    viirya authored and dongjoon-hyun committed Oct 18, 2020
    Configuration menu
    Copy the full SHA
    56a60ca View commit details
    Browse the repository at this point in the history
  2. [MINOR][DOCS][EXAMPLE] Fix the Python manual_load_options_csv example

    ### What changes were proposed in this pull request?
    This pull request changes the `sep` parameter's value from `:` to `;` in the example of `examples/src/main/python/sql/datasource.py`. This code snippet is shown on the Spark SQL Guide documentation. The `sep` parameter's value should be `;` since the data in https://github.com/apache/spark/blob/master/examples/src/main/resources/people.csv is separated by `;`.
    
    ### Why are the changes needed?
    To fix the example code so that it can be executed properly.
    
    ### Does this PR introduce _any_ user-facing change?
    Yes.
    This code snippet is shown on the Spark SQL Guide documentation: https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html#manually-specifying-options
    
    ### How was this patch tested?
    By building the documentation and checking the Spark SQL Guide documentation manually in the local environment.
    
    Closes apache#30082 from kjmrknsn/fix-example-python-datasource.
    
    Authored-by: Keiji Yoshida <kjmrknsn@gmail.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    kjmrknsn authored and HyukjinKwon committed Oct 18, 2020
    Configuration menu
    Copy the full SHA
    7e65b12 View commit details
    Browse the repository at this point in the history
  3. [SPARK-33176][K8S] Use 11-jre-slim as default in K8s Dockerfile

    ### What changes were proposed in this pull request?
    
    This PR aims to use `openjdk:11-jre-slim` as default in K8s Dockerfile.
    
    ### Why are the changes needed?
    
    Although Apache Spark supports both Java8/Java11, there is a difference.
    
    1. Java8-built distribution can run both Java8/Java11
    2. Java11-built distribution can run on Java11, but not Java8.
    
    In short, we had better use Java11 in Dockerfile to embrace both cases without any issues.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes. This will remove the change of user frustration when they build with JDK11 and build the image without overriding Java base image.
    
    ### How was this patch tested?
    
    Pass the K8s IT.
    
    Closes apache#30083 from dongjoon-hyun/SPARK-33176.
    
    Authored-by: Dongjoon Hyun <dongjoon@apache.org>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    dongjoon-hyun committed Oct 18, 2020
    Configuration menu
    Copy the full SHA
    05fbbb1 View commit details
    Browse the repository at this point in the history

Commits on Oct 19, 2020

  1. [SPARK-33123][INFRA] Ignore GitHub only changes in Amplab Jenkins build

    ### What changes were proposed in this pull request?
    
    This PR aims to ignore GitHub only changes in Amplab Jenkins build.
    
    ### Why are the changes needed?
    
    This will save server resources.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No, this is a dev-only change.
    
    ### How was this patch tested?
    
    Manually. I used the following doctest during testing and removed it at the clean-up.
    
    E2E tests:
    
    ```
    cd dev
    cat test.py
    ```
    
    ```python
    import importlib
    runtests = importlib.import_module("run-tests")
    print([x.name for x in runtests.determine_modules_for_files([".github/workflows/build_and_test.yml"])])
    ```
    
    ```python
    $ GITHUB_ACTIONS=1 python test.py
    ['root']
    $ python test.py
    []
    ```
    
    Unittests:
    
    ```bash
    $ GITHUN_ACTIONS=1 python3 -m doctest dev/run-tests.py
    $ python3 -m doctest dev/run-tests.py
    ```
    
    Closes apache#30020 from williamhyun/SPARK-33123.
    
    Lead-authored-by: William Hyun <williamhyun3@gmail.com>
    Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    (cherry picked from commit e6c53c2)
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    williamhyun and HyukjinKwon committed Oct 19, 2020
    Configuration menu
    Copy the full SHA
    0bff1f6 View commit details
    Browse the repository at this point in the history
  2. [SPARK-32557][CORE] Logging and swallowing the exception per entry in…

    … History server
    
    ### What changes were proposed in this pull request?
    This PR adds a try catch wrapping the History server scan logic to log and swallow the exception per entry.
    
    ### Why are the changes needed?
    As discussed in apache#29350 , one entry failure shouldn't affect others.
    
    ### Does this PR introduce _any_ user-facing change?
    No.
    
    ### How was this patch tested?
    Manually tested.
    
    Closes apache#29374 from yanxiaole/SPARK-32557.
    
    Authored-by: Yan Xiaole <xiaole.yan@gmail.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    yanxiaole authored and HeartSaVioR committed Oct 19, 2020
    Configuration menu
    Copy the full SHA
    15ed312 View commit details
    Browse the repository at this point in the history
  3. Revert "Revert "[SPARK-33146][CORE] Check for non-fatal errors when l…

    …oading new applications in SHS""
    
    This reverts commit e40c147.
    HeartSaVioR committed Oct 19, 2020
    Configuration menu
    Copy the full SHA
    02f80cf View commit details
    Browse the repository at this point in the history
  4. Revert "[SPARK-33069][INFRA] Skip test result report if no JUnit XML …

    …files are found"
    
    This reverts commit 46a62ca.
    HyukjinKwon committed Oct 19, 2020
    Configuration menu
    Copy the full SHA
    b1d5a08 View commit details
    Browse the repository at this point in the history

Commits on Oct 20, 2020

  1. [SPARK-33181][SQL][DOCS] Document Load Table Directly from File in SQ…

    …L Select Reference
    
    ### What changes were proposed in this pull request?
    
    Add the link to the feature: "Run SQL on files directly" to SQL reference documentation page
    
    ### Why are the changes needed?
    
    To make SQL Reference complete
    
    ### Does this PR introduce _any_ user-facing change?
    
    yes. Previously, reading in sql from file directly is not included in the documentation: https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select.html, not listed in from_items. The new link is added to the select statement documentation, like the below:
    
    ![image](https://user-images.githubusercontent.com/16770242/96517999-c34f3900-121e-11eb-8d56-c4ba0432855e.png)
    ![image](https://user-images.githubusercontent.com/16770242/96518808-8126f700-1220-11eb-8c98-fb398eee0330.png)
    
    ### How was this patch tested?
    
    Manually built and tested
    
    Closes apache#30095 from liaoaoyuan97/master.
    
    Authored-by: liaoaoyuan97 <al3468@columbia.edu>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    (cherry picked from commit f65a244)
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    liaoaoyuan97 authored and HyukjinKwon committed Oct 20, 2020
    Configuration menu
    Copy the full SHA
    c3af7c6 View commit details
    Browse the repository at this point in the history
  2. [SPARK-33190][INFRA][TESTS] Set upper bound of PyArrow version in Git…

    …Hub Actions
    
    PyArrow is uploaded into PyPI today (https://pypi.org/project/pyarrow/), and some tests fail with PyArrow 2.0.0+:
    
    ```
    ======================================================================
    ERROR [0.774s]: test_grouped_over_window_with_key (pyspark.sql.tests.test_pandas_grouped_map.GroupedMapInPandasTests)
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File "/__w/spark/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line 595, in test_grouped_over_window_with_key
        .select('id', 'result').collect()
      File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line 588, in collect
        sock_info = self._jdf.collectToPython()
      File "/__w/spark/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
        answer, self.gateway_client, self.target_id, self.name)
      File "/__w/spark/spark/python/pyspark/sql/utils.py", line 117, in deco
        raise converted from None
    pyspark.sql.utils.PythonException:
      An exception was thrown from the Python worker. Please see the stack trace below.
    Traceback (most recent call last):
      File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 601, in main
        process()
      File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 593, in process
        serializer.dump_stream(out_iter, outfile)
      File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 255, in dump_stream
        return ArrowStreamSerializer.dump_stream(self, init_stream_yield_batches(), stream)
      File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 81, in dump_stream
        for batch in iterator:
      File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 248, in init_stream_yield_batches
        for series in iterator:
      File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 426, in mapper
        return f(keys, vals)
      File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 170, in <lambda>
        return lambda k, v: [(wrapped(k, v), to_arrow_type(return_type))]
      File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 158, in wrapped
        result = f(key, pd.concat(value_series, axis=1))
      File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/util.py", line 68, in wrapper
        return f(*args, **kwargs)
      File "/__w/spark/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line 590, in f
        "{} != {}".format(expected_key[i][1], window_range)
    AssertionError: {'start': datetime.datetime(2018, 3, 15, 0, 0), 'end': datetime.datetime(2018, 3, 20, 0, 0)} != {'start': datetime.datetime(2018, 3, 15, 0, 0, tzinfo=<StaticTzInfo 'Etc/UTC'>), 'end': datetime.datetime(2018, 3, 20, 0, 0, tzinfo=<StaticTzInfo 'Etc/UTC'>)}
    ```
    
    https://github.com/apache/spark/runs/1278917457
    
    This PR proposes to set the upper bound of PyArrow in GitHub Actions build. This should be removed when we properly support PyArrow 2.0.0+ (SPARK-33189).
    
    To make build pass.
    
    No, dev-only.
    
    GitHub Actions in this build will test it out.
    
    Closes apache#30098 from HyukjinKwon/hot-fix-test.
    
    Authored-by: HyukjinKwon <gurwls223@apache.org>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    (cherry picked from commit eb9966b)
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    HyukjinKwon committed Oct 20, 2020
    Configuration menu
    Copy the full SHA
    3b5b533 View commit details
    Browse the repository at this point in the history
  3. [MINOR][DOCS] Fix the description about to_avro and from_avro functions

    ### What changes were proposed in this pull request?
    This pull request changes the description about `to_avro` and `from_avro` functions to include Python as a supported language as the functions have been supported in Python since Apache Spark 3.0.0 [[SPARK-26856](https://issues.apache.org/jira/browse/SPARK-26856)].
    
    ### Why are the changes needed?
    Same as above.
    
    ### Does this PR introduce _any_ user-facing change?
    Yes. The description changed by this pull request is on https://spark.apache.org/docs/latest/sql-data-sources-avro.html#to_avro-and-from_avro.
    
    ### How was this patch tested?
    Tested manually by building and checking the document in the local environment.
    
    Closes apache#30105 from kjmrknsn/fix-docs-sql-data-sources-avro.
    
    Authored-by: Keiji Yoshida <kjmrknsn@gmail.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    (cherry picked from commit 46ad325)
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    kjmrknsn authored and HyukjinKwon committed Oct 20, 2020
    Configuration menu
    Copy the full SHA
    4373c71 View commit details
    Browse the repository at this point in the history

Commits on Oct 21, 2020

  1. [SPARK-33189][PYTHON][TESTS] Add env var to tests for legacy nested t…

    …imestamps in pyarrow
    
    Add an environment variable `PYARROW_IGNORE_TIMEZONE` to pyspark tests in run-tests.py to use legacy nested timestamp behavior. This means that when converting arrow to pandas, nested timestamps with timezones will have the timezone localized during conversion.
    
    The default behavior was changed in PyArrow 2.0.0 to propagate timezone information. Using the environment variable enables testing with newer versions of pyarrow until the issue can be fixed in SPARK-32285.
    
    No
    
    Existing tests
    
    Closes apache#30111 from BryanCutler/arrow-enable-legacy-nested-timestamps-SPARK-33189.
    
    Authored-by: Bryan Cutler <cutlerb@gmail.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    (cherry picked from commit 47a6568)
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    BryanCutler authored and HyukjinKwon committed Oct 21, 2020
    Configuration menu
    Copy the full SHA
    5e33155 View commit details
    Browse the repository at this point in the history
  2. [SPARK-32785][SQL][DOCS][FOLLOWUP][3.0] Update migration guide for in…

    …complete interval literals
    
    ### What changes were proposed in this pull request?
    
    Address comments  apache#29635 (comment) to improve migration guide
    
    ### Why are the changes needed?
    
    improve migration guide
    
    ### Does this PR introduce _any_ user-facing change?
    
    NO,only doc update
    
    ### How was this patch tested?
    
    passing GitHub action
    
    Closes apache#30117 from yaooqinn/SPARK-32785-F30.
    
    Authored-by: Kent Yao <yaooqinn@hotmail.com>
    Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
    yaooqinn authored and maropu committed Oct 21, 2020
    Configuration menu
    Copy the full SHA
    a36b3c4 View commit details
    Browse the repository at this point in the history

Commits on Oct 22, 2020

  1. [SPARK-33189][FOLLOWUP][3.0] Fix syntax error in python/run-tests.py

    ### What changes were proposed in this pull request?
    
    This PR aims to fix syntax error.
    
    ### Why are the changes needed?
    
    ```
    ========================================================================
    Running Python style checks
    ========================================================================
    pycodestyle checks failed.
    *** Error compiling './python/run-tests.py'...
      File "./python/run-tests.py", line 80
        'PYARROW_IGNORE_TIMEZONE': '1',
                                ^
    SyntaxError: invalid syntax
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Pass the Jenkins.
    
    Closes apache#30125 from dongjoon-hyun/SPARK-33189-2.
    
    Authored-by: Dongjoon Hyun <dhyun@apple.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    dongjoon-hyun authored and HyukjinKwon committed Oct 22, 2020
    Configuration menu
    Copy the full SHA
    e31fe6c View commit details
    Browse the repository at this point in the history
  2. [SPARK-32247][INFRA] Install and test scipy with PyPy in GitHub Actions

    ### What changes were proposed in this pull request?
    
    This PR proposes to install `scipy` as well in PyPy. It will test several ML specific test cases in PyPy as well. For example, https://github.com/apache/spark/blob/31a16fbb405a19dc3eb732347e0e1f873b16971d/python/pyspark/mllib/tests/test_linalg.py#L487
    
    It was not installed when GitHub Actions build was added because it failed to install for an unknown reason. Seems like it's fixed in the latest scipy.
    
    ### Why are the changes needed?
    
    To improve test coverage.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No, dev-only.
    
    ### How was this patch tested?
    
    GitHub Actions build in this PR will test it out.
    
    Closes apache#30054 from HyukjinKwon/SPARK-32247.
    
    Authored-by: HyukjinKwon <gurwls223@apache.org>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    HyukjinKwon committed Oct 22, 2020
    Configuration menu
    Copy the full SHA
    933dc6c View commit details
    Browse the repository at this point in the history

Commits on Oct 24, 2020

  1. [SPARK-30821][K8S] Handle executor failure with multiple containers

    Handle executor failure with multiple containers
    
    Added a spark property spark.kubernetes.executor.checkAllContainers,
    with default being false. When it's true, the executor snapshot will
    take all containers in the executor into consideration when deciding
    whether the executor is in "Running" state, if the pod restart policy is
    "Never". Also, added the new spark property to the doc.
    
    ### What changes were proposed in this pull request?
    
    Checking of all containers in the executor pod when reporting executor status, if the `spark.kubernetes.executor.checkAllContainers` property is set to true.
    
    ### Why are the changes needed?
    
    Currently, a pod remains "running" as long as there is at least one running container. This prevents Spark from noticing when a container has failed in an executor pod with multiple containers. With this change, user can configure the behavior to be different. Namely, if any container in the executor pod has failed, either the executor process or one of its sidecars, the pod is considered to be failed, and it will be rescheduled.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, new spark property added.
    User is now able to choose whether to turn on this feature using the `spark.kubernetes.executor.checkAllContainers` property.
    
    ### How was this patch tested?
    
    Unit test was added and all passed.
    I tried to run integration test by following the instruction [here](https://spark.apache.org/developer-tools.html) (section "Testing K8S") and also [here](https://github.com/apache/spark/blob/master/resource-managers/kubernetes/integration-tests/README.md), but I wasn't able to run it smoothly as it fails to talk with minikube cluster. Maybe it's because my minikube version is too new (I'm using v1.13.1)...? Since I've been trying it for two days and still can't make it work, I decided to submit this PR and hopefully the Jenkins test will pass.
    
    Closes apache#29924 from huskysun/exec-sidecar-failure.
    
    Authored-by: Shiqi Sun <s.sun@salesforce.com>
    Signed-off-by: Holden Karau <hkarau@apple.com>
    (cherry picked from commit f659527)
    Signed-off-by: Holden Karau <hkarau@apple.com>
    huskysun authored and holdenk committed Oct 24, 2020
    Configuration menu
    Copy the full SHA
    f7c7f4f View commit details
    Browse the repository at this point in the history

Commits on Oct 25, 2020

  1. [SPARK-33228][SQL] Don't uncache data when replacing a view having th…

    …e same logical plan
    
    ### What changes were proposed in this pull request?
    
    SPARK-30494's updated the `CreateViewCommand` code to implicitly drop cache when replacing an existing view. But, this change drops cache even when replacing a view having the same logical plan. A sequence of queries to reproduce this as follows;
    ```
    // Spark v2.4.6+
    scala> val df = spark.range(1).selectExpr("id a", "id b")
    scala> df.cache()
    scala> df.explain()
    == Physical Plan ==
    *(1) ColumnarToRow
    +- InMemoryTableScan [a#2L, b#3L]
          +- InMemoryRelation [a#2L, b#3L], StorageLevel(disk, memory, deserialized, 1 replicas)
                +- *(1) Project [id#0L AS a#2L, id#0L AS b#3L]
                   +- *(1) Range (0, 1, step=1, splits=4)
    
    scala> df.createOrReplaceTempView("t")
    scala> sql("select * from t").explain()
    == Physical Plan ==
    *(1) ColumnarToRow
    +- InMemoryTableScan [a#2L, b#3L]
          +- InMemoryRelation [a#2L, b#3L], StorageLevel(disk, memory, deserialized, 1 replicas)
                +- *(1) Project [id#0L AS a#2L, id#0L AS b#3L]
                   +- *(1) Range (0, 1, step=1, splits=4)
    
    // If one re-runs the same query `df.createOrReplaceTempView("t")`, the cache's swept away
    scala> df.createOrReplaceTempView("t")
    scala> sql("select * from t").explain()
    == Physical Plan ==
    *(1) Project [id#0L AS a#2L, id#0L AS b#3L]
    +- *(1) Range (0, 1, step=1, splits=4)
    
    // Until v2.4.6
    scala> val df = spark.range(1).selectExpr("id a", "id b")
    scala> df.cache()
    scala> df.createOrReplaceTempView("t")
    scala> sql("select * from t").explain()
    20/10/23 22:33:42 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
    == Physical Plan ==
    *(1) InMemoryTableScan [a#2L, b#3L]
       +- InMemoryRelation [a#2L, b#3L], StorageLevel(disk, memory, deserialized, 1 replicas)
             +- *(1) Project [id#0L AS a#2L, id#0L AS b#3L]
                +- *(1) Range (0, 1, step=1, splits=4)
    
    scala> df.createOrReplaceTempView("t")
    scala> sql("select * from t").explain()
    == Physical Plan ==
    *(1) InMemoryTableScan [a#2L, b#3L]
       +- InMemoryRelation [a#2L, b#3L], StorageLevel(disk, memory, deserialized, 1 replicas)
             +- *(1) Project [id#0L AS a#2L, id#0L AS b#3L]
                +- *(1) Range (0, 1, step=1, splits=4)
    ```
    
    ### Why are the changes needed?
    
    bugfix.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Added tests.
    
    Closes apache#30140 from maropu/FixBugInReplaceView.
    
    Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    (cherry picked from commit 87b4984)
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    maropu authored and dongjoon-hyun committed Oct 25, 2020
    Configuration menu
    Copy the full SHA
    80716d1 View commit details
    Browse the repository at this point in the history

Commits on Oct 26, 2020

  1. [SPARK-33197][SQL] Make changes to spark.sql.analyzer.maxIterations t…

    …ake effect at runtime
    
    ### What changes were proposed in this pull request?
    
    Make changes to `spark.sql.analyzer.maxIterations` take effect at runtime.
    
    ### Why are the changes needed?
    
    `spark.sql.analyzer.maxIterations` is not a static conf. However, before this patch, changing `spark.sql.analyzer.maxIterations` at runtime does not take effect.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes. Before this patch, changing `spark.sql.analyzer.maxIterations` at runtime does not take effect.
    
    ### How was this patch tested?
    
    modified unit test
    
    Closes apache#30108 from yuningzh-db/dynamic-analyzer-max-iterations.
    
    Authored-by: Yuning Zhang <yuning.zhang@databricks.com>
    Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
    (cherry picked from commit a21945c)
    Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
    yuningzh-db authored and maropu committed Oct 26, 2020
    Configuration menu
    Copy the full SHA
    590ccb3 View commit details
    Browse the repository at this point in the history
  2. [SPARK-33230][SQL] Hadoop committers to get unique job ID in "spark.s…

    …ql.sources.writeJobUUID"
    
    ### What changes were proposed in this pull request?
    
    This reinstates the old option `spark.sql.sources.write.jobUUID` to set a unique jobId in the jobconf so that hadoop MR committers have a unique ID which is (a) consistent across tasks and workers and (b) not brittle compared to generated-timestamp job IDs. The latter matches that of what JobID requires, but as they are generated per-thread, may not always be unique within a cluster.
    
    ### Why are the changes needed?
    
    If a committer (e.g s3a staging committer) uses job-attempt-ID as a unique ID then any two jobs started within the same second have the same ID, so can clash.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Good Q. It is "developer-facing" in the context of anyone writing a committer. But it reinstates a property which was in Spark 1.x and "went away"
    
    ### How was this patch tested?
    
    Testing: no test here. You'd have to create a new committer which extracted the value in both job and task(s) and verified consistency. That is possible (with a task output whose records contained the UUID), but it would be pretty convoluted and a high maintenance cost.
    
    Because it's trying to address a race condition, it's hard to regenerate the problem downstream and so verify a fix in a test run...I'll just look at the logs to see what temporary dir is being used in the cluster FS and verify it's a UUID
    
    Closes apache#30141 from steveloughran/SPARK-33230-jobId.
    
    Authored-by: Steve Loughran <stevel@cloudera.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    (cherry picked from commit 02fa19f)
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    steveloughran authored and dongjoon-hyun committed Oct 26, 2020
    Configuration menu
    Copy the full SHA
    22392be View commit details
    Browse the repository at this point in the history

Commits on Oct 27, 2020

  1. [SPARK-33260][SQL] Fix incorrect results from SortExec when sortOrder…

    … is Stream
    
    ### What changes were proposed in this pull request?
    
    The following query produces incorrect results. The query has two essential features: (1) it contains a string aggregate, resulting in a `SortExec` node, and (2) it contains a duplicate grouping key, causing `RemoveRepetitionFromGroupExpressions` to produce a sort order stored as a `Stream`.
    
    ```sql
    SELECT bigint_col_1, bigint_col_9, MAX(CAST(bigint_col_1 AS string))
    FROM table_4
    GROUP BY bigint_col_1, bigint_col_9, bigint_col_9
    ```
    
    When the sort order is stored as a `Stream`, the line `ordering.map(_.child.genCode(ctx))` in `GenerateOrdering#createOrderKeys()` produces unpredictable side effects to `ctx`. This is because `genCode(ctx)` modifies `ctx`. When ordering is a `Stream`, the modifications will not happen immediately as intended, but will instead occur lazily when the returned `Stream` is used later.
    
    Similar bugs have occurred at least three times in the past: https://issues.apache.org/jira/browse/SPARK-24500, https://issues.apache.org/jira/browse/SPARK-25767, https://issues.apache.org/jira/browse/SPARK-26680.
    
    The fix is to check if `ordering` is a `Stream` and force the modifications to happen immediately if so.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Added a unit test for `SortExec` where `sortOrder` is a `Stream`. The test previously failed and now passes.
    
    Closes apache#30160 from ankurdave/SPARK-33260.
    
    Authored-by: Ankur Dave <ankurdave@gmail.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    (cherry picked from commit 3f2a2b5)
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    ankurdave authored and dongjoon-hyun committed Oct 27, 2020
    Configuration menu
    Copy the full SHA
    c95d925 View commit details
    Browse the repository at this point in the history
  2. [SPARK-33246][SQL][DOCS] Correct documentation for null semantics of …

    …"NULL AND False"
    
    ### What changes were proposed in this pull request?
    
    The documentation of the Spark SQL null semantics states that "NULL AND False" yields NULL.  This is incorrect.  "NULL AND False" yields False.
    
    ```
    Seq[(java.lang.Boolean, java.lang.Boolean)](
      (null, false)
    )
      .toDF("left_operand", "right_operand")
      .withColumn("AND", 'left_operand && 'right_operand)
      .show(truncate = false)
    
    +------------+-------------+-----+
    |left_operand|right_operand|AND  |
    +------------+-------------+-----+
    |null        |false        |false|
    +------------+-------------+-----+
    ```
    
    I propose the documentation be updated to reflect that "NULL AND False" yields False.
    
    This contribution is my original work and I license it to the project under the project’s open source license.
    
    ### Why are the changes needed?
    
    This change improves the accuracy of the documentation.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes.  This PR introduces a fix to the documentation.
    
    ### How was this patch tested?
    
    Since this is only a documentation change, no tests were added.
    
    Closes apache#30161 from stwhit/SPARK-33246.
    
    Authored-by: Stuart White <stuart@spotright.com>
    Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
    (cherry picked from commit 7d11d97)
    Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
    Stuart White authored and maropu committed Oct 27, 2020
    Configuration menu
    Copy the full SHA
    e37859a View commit details
    Browse the repository at this point in the history
  3. [SPARK-32090][SQL] Improve UserDefinedType.equal() to make it be symm…

    …etrical
    
    ### What changes were proposed in this pull request?
    
    This PR fix `UserDefinedType.equal()` by comparing the UDT class instead of checking `acceptsType()`.
    
    ### Why are the changes needed?
    
    It's weird that equality comparison between two UDT types can have different result by switching the order:
    
    ```scala
    // ExampleSubTypeUDT.userClass is a subclass of ExampleBaseTypeUDT.userClass
    val udt1 = new ExampleBaseTypeUDT
    val udt2 = new ExampleSubTypeUDT
    println(udt1 == udt2) // true
    println(udt2 == udt1) // false
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes.
    
    Before:
    ```scala
    // ExampleSubTypeUDT.userClass is a subclass of ExampleBaseTypeUDT.userClass
    val udt1 = new ExampleBaseTypeUDT
    val udt2 = new ExampleSubTypeUDT
    println(udt1 == udt2) // true
    println(udt2 == udt1) // false
    ```
    
    After:
    ```scala
    // ExampleSubTypeUDT.userClass is a subclass of ExampleBaseTypeUDT.userClass
    val udt1 = new ExampleBaseTypeUDT
    val udt2 = new ExampleSubTypeUDT
    println(udt1 == udt2) // false
    println(udt2 == udt1) // false
    ```
    
    ### How was this patch tested?
    
    Added a unit test.
    
    Closes apache#28923 from Ngone51/fix-udt-equal.
    
    Authored-by: yi.wu <yi.wu@databricks.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    Ngone51 authored and maropu committed Oct 27, 2020
    Configuration menu
    Copy the full SHA
    737a850 View commit details
    Browse the repository at this point in the history

Commits on Oct 28, 2020

  1. [SPARK-33264][SQL][DOCS] Add a dedicated page for SQL-on-file in SQL …

    …documents
    
    ### What changes were proposed in this pull request?
    
    This PR intends to add a dedicated page for SQL-on-file in SQL documents.
    This comes from the comment: https://github.com/apache/spark/pull/30095/files#r508965149
    
    ### Why are the changes needed?
    
    For better documentations.
    
    ### Does this PR introduce _any_ user-facing change?
    
    <img width="544" alt="Screen Shot 2020-10-28 at 9 56 59" src="https://user-images.githubusercontent.com/692303/97378051-c1fbcb80-1904-11eb-86c0-a88c5269d41c.png">
    
    ### How was this patch tested?
    
    N/A
    
    Closes apache#30165 from maropu/DocForFile.
    
    Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
    Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
    (cherry picked from commit c2bea04)
    Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
    maropu committed Oct 28, 2020
    Configuration menu
    Copy the full SHA
    ba2a113 View commit details
    Browse the repository at this point in the history
  2. [SPARK-33208][SQL] Update the document of SparkSession#sql

    Change-Id: I82db1f9e8f667573aa3a03e05152cbed0ea7686b
    
    ### What changes were proposed in this pull request?
    Update the document of SparkSession#sql, mention that this API eagerly runs DDL/DML commands, but not for SELECT queries.
    
    ### Why are the changes needed?
    To clarify the behavior of SparkSession#sql.
    
    ### Does this PR introduce _any_ user-facing change?
    No.
    
    ### How was this patch tested?
    No needed.
    
    Closes apache#30168 from waitinfuture/master.
    
    Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    (cherry picked from commit b26ae98)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    waitinfuture authored and cloud-fan committed Oct 28, 2020
    Configuration menu
    Copy the full SHA
    f6c72e6 View commit details
    Browse the repository at this point in the history
  3. [SPARK-33267][SQL] Fix NPE issue on 'In' filter when one of values co…

    …ntains null
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to fix the NPE issue on `In` filter when one of values contain null. In real case, you can trigger this issue when you try to push down the filter with `in (..., null)` against V2 source table. `DataSourceStrategy` caches the mapping (filter instance -> expression) in HashMap, which leverages hash code on the key, hence it could trigger the NPE issue.
    
    ### Why are the changes needed?
    
    This is an obvious bug as `In` filter doesn't care about null value when calculating hash code.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, previously the query with having `null` in "in" condition against data source V2 source table supporting push down filter failed with NPE, whereas after the PR the query will not fail.
    
    ### How was this patch tested?
    
    UT added. The new UT fails without the PR and passes with the PR.
    
    Closes apache#30170 from HeartSaVioR/SPARK-33267.
    
    Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    (cherry picked from commit a744fea)
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    HeartSaVioR authored and dongjoon-hyun committed Oct 28, 2020
    Configuration menu
    Copy the full SHA
    3ce335d View commit details
    Browse the repository at this point in the history
  4. [SPARK-32119][CORE][3.0] ExecutorPlugin doesn't work with Standalone …

    …Cluster and Kubernetes with --jars
    
    ### What changes were proposed in this pull request?
    
    This is a backport PR for branch-3.0.
    
    This PR changes Executor to load jars and files added by --jars and --files on Executor initialization.
    To avoid downloading those jars/files twice, they are assosiated with `startTime` as their uploaded timestamp.
    
    ### Why are the changes needed?
    
    ExecutorPlugin can't work with Standalone Cluster and Kubernetes
    when a jar which contains plugins and files used by the plugins are added by --jars and --files option with spark-submit.
    
    This is because jars and files added by --jars and --files are not loaded on Executor initialization.
    I confirmed it works with YARN because jars/files are distributed as distributed cache.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes. jars/files added by --jars and --files are downloaded on each executor on initialization.
    
    ### How was this patch tested?
    
    Added a new testcase.
    
    Closes apache#29621 from sarutak/fix-plugin-issue-3.0.
    
    Lead-authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
    Co-authored-by: Luca Canali <luca.canali@cern.ch>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    2 people authored and dongjoon-hyun committed Oct 28, 2020
    Configuration menu
    Copy the full SHA
    f5dc06e View commit details
    Browse the repository at this point in the history

Commits on Oct 29, 2020

  1. [SQL][MINOR] Update from_unixtime doc

    This PR fixes from_unixtime documentation to show that fmt is optional parameter.
    
    Yes, documentation update.
    **Before change:**
    ![image](https://user-images.githubusercontent.com/4176173/97497659-18c6cc80-1928-11eb-93d8-453ef627ac7c.png)
    
    **After change:**
    ![image](https://user-images.githubusercontent.com/4176173/97496153-c5537f00-1925-11eb-8102-457e85e019d5.png)
    
    Style check using: ./dev/run-tests
    Manual check and screenshotting with: ./sql/create-docs.sh
    Manual verification of behavior with latest spark-sql binary.
    
    Closes apache#30176 from Obbay2/from_unixtime_doc.
    
    Authored-by: Nathan Wreggit <obbay2@hotmail.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    Obbay2 authored and HyukjinKwon committed Oct 29, 2020
    Configuration menu
    Copy the full SHA
    f03bca8 View commit details
    Browse the repository at this point in the history

Commits on Oct 30, 2020

  1. [SPARK-33292][SQL] Make Literal ArrayBasedMapData string representati…

    …on disambiguous
    
    ### What changes were proposed in this pull request?
    
    This PR aims to wrap `ArrayBasedMapData` literal representation with `map(...)`.
    
    ### Why are the changes needed?
    
    Literal ArrayBasedMapData has inconsistent string representation from `LogicalPlan` to `Optimized Logical Plan/Physical Plan`. Also, the representation at `Optimized Logical Plan` and `Physical Plan` is ambiguous like `[1 AS a#0, keys: [key1], values: [value1] AS b#1]`.
    
    **BEFORE**
    ```scala
    scala> spark.version
    res0: String = 2.4.7
    
    scala> sql("SELECT 1 a, map('key1', 'value1') b").explain(true)
    == Parsed Logical Plan ==
    'Project [1 AS a#0, 'map(key1, value1) AS b#1]
    +- OneRowRelation
    
    == Analyzed Logical Plan ==
    a: int, b: map<string,string>
    Project [1 AS a#0, map(key1, value1) AS b#1]
    +- OneRowRelation
    
    == Optimized Logical Plan ==
    Project [1 AS a#0, keys: [key1], values: [value1] AS b#1]
    +- OneRowRelation
    
    == Physical Plan ==
    *(1) Project [1 AS a#0, keys: [key1], values: [value1] AS b#1]
    +- Scan OneRowRelation[]
    ```
    
    **AFTER**
    ```scala
    scala> spark.version
    res0: String = 3.1.0-SNAPSHOT
    
    scala> sql("SELECT 1 a, map('key1', 'value1') b").explain(true)
    == Parsed Logical Plan ==
    'Project [1 AS a#4, 'map(key1, value1) AS b#5]
    +- OneRowRelation
    
    == Analyzed Logical Plan ==
    a: int, b: map<string,string>
    Project [1 AS a#4, map(key1, value1) AS b#5]
    +- OneRowRelation
    
    == Optimized Logical Plan ==
    Project [1 AS a#4, map(keys: [key1], values: [value1]) AS b#5]
    +- OneRowRelation
    
    == Physical Plan ==
    *(1) Project [1 AS a#4, map(keys: [key1], values: [value1]) AS b#5]
    +- *(1) Scan OneRowRelation[]
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes. This changes the query plan's string representation in `explain` command and UI. However, this is a bug fix.
    
    ### How was this patch tested?
    
    Pass the CI with the newly added test case.
    
    Closes apache#30190 from dongjoon-hyun/SPARK-33292.
    
    Authored-by: Dongjoon Hyun <dhyun@apple.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    (cherry picked from commit 838791b)
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    dongjoon-hyun committed Oct 30, 2020
    Configuration menu
    Copy the full SHA
    563a678 View commit details
    Browse the repository at this point in the history
  2. [SPARK-33268][SQL][PYTHON][3.0] Fix bugs for casting data from/to Pyt…

    …honUserDefinedType
    
    ### What changes were proposed in this pull request?
    
    This PR intends to fix bugs for casting data from/to PythonUserDefinedType. A sequence of queries to reproduce this issue is as follows;
    ```
    >>> from pyspark.sql import Row
    >>> from pyspark.sql.functions import col
    >>> from pyspark.sql.types import *
    >>> from pyspark.testing.sqlutils import *
    >>>
    >>> row = Row(point=ExamplePoint(1.0, 2.0))
    >>> df = spark.createDataFrame([row])
    >>> df.select(col("point").cast(PythonOnlyUDT()))
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/Users/maropu/Repositories/spark/spark-master/python/pyspark/sql/dataframe.py", line 1402, in select
        jdf = self._jdf.select(self._jcols(*cols))
      File "/Users/maropu/Repositories/spark/spark-master/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
      File "/Users/maropu/Repositories/spark/spark-master/python/pyspark/sql/utils.py", line 111, in deco
        return f(*a, **kw)
      File "/Users/maropu/Repositories/spark/spark-master/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
    py4j.protocol.Py4JJavaError: An error occurred while calling o44.select.
    : java.lang.NullPointerException
    	at org.apache.spark.sql.types.UserDefinedType.acceptsType(UserDefinedType.scala:84)
    	at org.apache.spark.sql.catalyst.expressions.Cast$.canCast(Cast.scala:96)
    	at org.apache.spark.sql.catalyst.expressions.CastBase.checkInputDataTypes(Cast.scala:267)
    	at org.apache.spark.sql.catalyst.expressions.CastBase.resolved$lzycompute(Cast.scala:290)
    	at org.apache.spark.sql.catalyst.expressions.CastBase.resolved(Cast.scala:290)
    ```
    A root cause of this issue is that, since `PythonUserDefinedType#userClassis` always null, `isAssignableFrom` in `UserDefinedType#acceptsType` throws a null exception. To fix it, this PR defines  `acceptsType` in `PythonUserDefinedType` and filters out the null case in `UserDefinedType#acceptsType`.
    
    This backport comes from apache#30169.
    
    ### Why are the changes needed?
    
    Bug fixes.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Added tests.
    
    Closes apache#30191 from maropu/SPARK-33268-BRANCH3.0.
    
    Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    maropu authored and dongjoon-hyun committed Oct 30, 2020
    Configuration menu
    Copy the full SHA
    8f57603 View commit details
    Browse the repository at this point in the history
  3. [SPARK-33183][SQL][3.0] Fix Optimizer rule EliminateSorts and add a p…

    …hysical rule to remove redundant sorts
    
    Backport apache#30093 for branch-3.0. I've updated the configuration version to 2.4.8.
    
    ### What changes were proposed in this pull request?
    This PR aims to fix a correctness bug in the optimizer rule EliminateSorts. It also adds a new physical rule to remove redundant sorts that cannot be eliminated in the Optimizer rule after the bugfix.
    
    ### Why are the changes needed?
    A global sort should not be eliminated even if its child is ordered since we don't know if its child ordering is global or local. For example, in the following scenario, the first sort shouldn't be removed because it has a stronger guarantee than the second sort even if the sort orders are the same for both sorts.
    ```
    Sort(orders, global = True, ...)
      Sort(orders, global = False, ...)
    ```
    Since there is no straightforward way to identify whether a node's output ordering is local or global, we should not remove a global sort even if its child is already ordered.
    
    ### Does this PR introduce any user-facing change?
    Yes
    
    ### How was this patch tested?
    Unit tests
    
    Closes apache#30195 from allisonwang-db/SPARK-33183-branch-3.0.
    
    Authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    allisonwang-db authored and cloud-fan committed Oct 30, 2020
    Configuration menu
    Copy the full SHA
    83f259f View commit details
    Browse the repository at this point in the history

Commits on Oct 31, 2020

  1. [SPARK-33290][SQL] REFRESH TABLE should invalidate cache even though …

    …the table itself may not be cached
    
    ### What changes were proposed in this pull request?
    
    In `CatalogImpl.refreshTable`, this moves the `uncacheQuery` call out of the condition `if (cache.nonEmpty)` so that it will be called whether the table itself is cached or not.
    
    ### Why are the changes needed?
    
    In the case like the following:
    ```sql
    CREATE TABLE t ...;
    CREATE VIEW t1 AS SELECT * FROM t;
    REFRESH TABLE t;
    ```
    
    If the table `t` is refreshed, the view `t1` which is depending on `t` will not be invalidated. This could lead to incorrect result and is similar to [SPARK-19765](https://issues.apache.org/jira/browse/SPARK-19765).
    
    On the other hand, if we have:
    
    ```sql
    CREATE TABLE t ...;
    CACHE TABLE t;
    CREATE VIEW t1 AS SELECT * FROM t;
    REFRESH TABLE t;
    ```
    
    Then the view `t1` will be refreshed. The behavior is somewhat inconsistent.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, with the change any cache that are depending on the table refreshed will be invalidated with the change. Previously this only happens if the table itself is cached.
    
    ### How was this patch tested?
    
    Added a new UT for the case.
    
    Closes apache#30187 from sunchao/SPARK-33290.
    
    Authored-by: Chao Sun <sunchao@apple.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    (cherry picked from commit 32b78d3)
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    sunchao authored and dongjoon-hyun committed Oct 31, 2020
    Configuration menu
    Copy the full SHA
    fc10531 View commit details
    Browse the repository at this point in the history
  2. [SPARK-33306][SQL] Timezone is needed when cast date to string

    ### What changes were proposed in this pull request?
    When `spark.sql.legacy.typeCoercion.datetimeToString.enabled` is enabled, spark will cast date to string when compare date with string. In Spark3, timezone is needed when casting date to string as https://github.com/apache/spark/blob/72ad9dcd5d484a8dd64c08889de85ef9de2a6077/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L309.
    
    Howerver, the timezone may not be set because `CastBase.needsTimeZone` returns false for this kind of casting.
    
    A simple way to reproduce this is
    ```
    spark-shell --conf spark.sql.legacy.typeCoercion.datetimeToString.enabled=true
    
    ```
    when we execute the following sql,
    ```
    select a.d1 from
    (select to_date(concat('2000-01-0', id)) as d1 from range(1, 2)) a
    join
    (select concat('2000-01-0', id) as d2 from range(1, 2)) b
    on a.d1 = b.d2
    ```
    it will throw
    ```
    java.util.NoSuchElementException: None.get
      at scala.None$.get(Option.scala:529)
      at scala.None$.get(Option.scala:527)
      at org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId(datetimeExpressions.scala:56)
      at org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId$(datetimeExpressions.scala:56)
      at org.apache.spark.sql.catalyst.expressions.CastBase.zoneId$lzycompute(Cast.scala:253)
      at org.apache.spark.sql.catalyst.expressions.CastBase.zoneId(Cast.scala:253)
      at org.apache.spark.sql.catalyst.expressions.CastBase.dateFormatter$lzycompute(Cast.scala:287)
      at org.apache.spark.sql.catalyst.expressions.CastBase.dateFormatter(Cast.scala:287)
    ```
    
    ### Why are the changes needed?
    As described above, it's a bug here.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Add more UT
    
    Closes apache#30213 from WangGuangxin/SPARK-33306.
    
    Authored-by: wangguangxin.cn <wangguangxin.cn@gmail.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    (cherry picked from commit 69c27f4)
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    WangGuangxin authored and dongjoon-hyun committed Oct 31, 2020
    Configuration menu
    Copy the full SHA
    49e9575 View commit details
    Browse the repository at this point in the history

Commits on Nov 2, 2020

  1. [SPARK-33277][PYSPARK][SQL][3.0] Use ContextAwareIterator to stop con…

    …suming after the task ends
    
    ### What changes were proposed in this pull request?
    
    This is a backport of apache#30177.
    
    As the Python evaluation consumes the parent iterator in a separate thread, it could consume more data from the parent even after the task ends and the parent is closed. Thus, we should use `ContextAwareIterator` to stop consuming after the task ends.
    
    ### Why are the changes needed?
    
    Python/Pandas UDF right after off-heap vectorized reader could cause executor crash.
    
    E.g.,:
    
    ```py
    spark.range(0, 100000, 1, 1).write.parquet(path)
    
    spark.conf.set("spark.sql.columnVector.offheap.enabled", True)
    
    def f(x):
        return 0
    
    fUdf = udf(f, LongType())
    
    spark.read.parquet(path).select(fUdf('id')).head()
    ```
    
    This is because, the Python evaluation consumes the parent iterator in a separate thread and it consumes more data from the parent even after the task ends and the parent is closed. If an off-heap column vector exists in the parent iterator, it could cause segmentation fault which crashes the executor.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Added tests, and manually.
    
    Closes apache#30217 from ueshin/issues/SPARK-33277/3.0/python_pandas_udf.
    
    Authored-by: Takuya UESHIN <ueshin@databricks.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    ueshin authored and HyukjinKwon committed Nov 2, 2020
    Configuration menu
    Copy the full SHA
    92ba08d View commit details
    Browse the repository at this point in the history
  2. [SPARK-33313][TESTS][R][3.0][2.4] Add testthat 3.x support

    ### What changes were proposed in this pull request?
    
    This PR proposes to port back apache#30219 but keeps testthat 1.x support in other branches.
    
    This PR modifies `R/pkg/tests/run-all.R` by:
    
    - ~Removing `testthat` 1.x support, as Jenkins has been upgraded to 2.x with SPARK-30637 and this code is no longer relevant.~
    - Add `testthat` 3.x support to avoid CI failures.
    
    ### Why are the changes needed?
    
    Currently used internal API has been removed in the latest `testthat` release.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Tests executed against `testthat == 2.3.2` and `testthat == 3.0.0`
    
    Closes apache#30220 from HyukjinKwon/SPARK-33313.
    
    Lead-authored-by: HyukjinKwon <gurwls223@apache.org>
    Co-authored-by: zero323 <mszymkiewicz@gmail.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    HyukjinKwon and zero323 committed Nov 2, 2020
    Configuration menu
    Copy the full SHA
    131179a View commit details
    Browse the repository at this point in the history

Commits on Nov 3, 2020

  1. [SPARK-33156][INFRA][3.0] Upgrade GithubAction image from 18.04 to 20.04

    ### What changes were proposed in this pull request?
    
    This PR aims to upgrade `Github Action` runner image from `Ubuntu 18.04 (LTS)` to `Ubuntu 20.04 (LTS)`.
    
    ### Why are the changes needed?
    
    `ubuntu-latest` in `GitHub Action` is still `Ubuntu 18.04 (LTS)`.
    - https://github.com/actions/virtual-environments#available-environments
    
    This upgrade will prepare AmbLab Jenkins upgrade.
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Pass the `Github Action` in this PR.
    
    Closes apache#30231 from dongjoon-hyun/SPARK-33156-3.0.
    
    Authored-by: Dongjoon Hyun <dhyun@apple.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    dongjoon-hyun committed Nov 3, 2020
    Configuration menu
    Copy the full SHA
    71ef48e View commit details
    Browse the repository at this point in the history
  2. [SPARK-24266][K8S][3.0] Restart the watcher when we receive a version…

    … changed from k8s
    
    ### What changes were proposed in this pull request?
    
    This is a straight application of apache#28423 onto branch-3.0
    
    Restart the watcher when it failed with a HTTP_GONE code from the kubernetes api. Which means a resource version has changed.
    
    For more relevant information see here: fabric8io/kubernetes-client#1075
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    This was tested in apache#28423 by running spark-submit to a k8s cluster.
    
    Closes apache#29533 from jkleckner/backport-SPARK-24266-to-branch-3.0.
    
    Authored-by: Stijn De Haes <stijndehaes@gmail.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    stijndehaes authored and dongjoon-hyun committed Nov 3, 2020
    Configuration menu
    Copy the full SHA
    d99ff20 View commit details
    Browse the repository at this point in the history
  3. [SPARK-33284][WEB-UI] In the Storage UI page, clicking any field to s…

    …ort the table will cause the header content to be lost
    
    ### What changes were proposed in this pull request?
    In the old version of spark in the storage UI page, the sorting function is normal, but sorting in the new version will cause the header content to be lost, So I try to fix the bug.
    
    ### Why are the changes needed?
    
    The header field of the table on the page is similar to the following, **note that each th contains the span attribute**:
    
    ```html
    <thead>
        <tr>
            ....
            <th width="" class="">
                  <span data-toggle="tooltip" title="" data-original-title="StorageLevel displays where the persisted RDD is stored, format of the persisted RDD (serialized or de-serialized) andreplication factor of the persisted RDD">
                    Storage Level
                  </span>
            </th>
           .....
        </tr>
    </thead>
    ```
    
    Since  [PR#26136](apache#26136), if the `th` in the table itself contains the `span` attribute, the `span` will be deleted directly after clicking the sort, and the original header content will be lost.
    
    There are three problems  in `sorttable.js`:
    
    1. `sortrevind.class = "sorttable_sortrevind"` in  [sorttab.js#107](https://github.com/apache/spark/blob/9d5e48ea95d1c3017a51ff69584f32a18901b2b5/core/src/main/resources/org/apache/spark/ui/static/sorttable.js#L107) and  `sortfwdind.class = "sorttable_sortfwdind"` in  [sorttab.js#125](https://github.com/apache/spark/blob/9d5e48ea95d1c3017a51ff69584f32a18901b2b5/core/src/main/resources/org/apache/spark/ui/static/sorttable.js#L125)
    sorttable_xx attribute should be assigned to`className` instead of `class`, as javascript uses `rowlists[j].className.search` rather than `rowlists[j].class.search` to determine whether the component has a sorting flag or not.
    2.  `rowlists[j].className.search(/\sorttable_sortrevind\b/)` in  [sorttab.js#120](https://github.com/apache/spark/blob/9d5e48ea95d1c3017a51ff69584f32a18901b2b5/core/src/main/resources/org/apache/spark/ui/static/sorttable.js#L120) was wrong. The original intention is to search whether `className` contains  the word `sorttable_sortrevind` , but the expression is wrong,  it should be `\bsorttable_sortrevind\b` instead of `\sorttable_sortrevind\b`
    3. The if check statement in the following code snippet ([sorttab.js#141](https://github.com/apache/spark/blob/9d5e48ea95d1c3017a51ff69584f32a18901b2b5/core/src/main/resources/org/apache/spark/ui/static/sorttable.js#L141)) was wrong. **If the `search` function does not find the target, it will return -1, but Boolean(-1) is actually equals true**. This statement will cause span to be deleted even if it does not contain `sorttable_sortfwdind` or `sorttable_sortrevind`.
    ```javascript
    rowlists = this.parentNode.getElementsByTagName("span");
    for (var j=0; j < rowlists.length; j++) {
                  if (rowlists[j].className.search(/\bsorttable_sortfwdind\b/)
                      || rowlists[j].className.search(/\sorttable_sortrevind\b/) ) {
                      rowlists[j].parentNode.removeChild(rowlists[j]);
                  }
              }
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    NO.
    
    ### How was this patch tested?
    The manual test result of the ui page is as below:
    
    ![fix sorted](https://user-images.githubusercontent.com/52202080/97543194-daeaa680-1a02-11eb-8b11-8109c3e4e9a3.gif)
    
    Closes apache#30182 from akiyamaneko/ui_storage_sort_error.
    
    Authored-by: neko <echohlne@gmail.com>
    Signed-off-by: Sean Owen <srowen@gmail.com>
    (cherry picked from commit 56c623e)
    Signed-off-by: Sean Owen <srowen@gmail.com>
    echohlne authored and srowen committed Nov 3, 2020
    Configuration menu
    Copy the full SHA
    55105a0 View commit details
    Browse the repository at this point in the history

Commits on Nov 4, 2020

  1. [SPARK-33333][BUILD][3.0] Upgrade Jetty to 9.4.28.v20200408

    ### What changes were proposed in this pull request?
    
    This PR aims to upgrade Jetty to 9.4.28.v20200408 in `branch-3.0` for `Apache Spark 3.0.2` like `master` branch `Apache Spark 3.1`.
    
    ### Why are the changes needed?
    
    To bring the bug fixes.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Pass the CIs.
    
    Closes apache#30240 from dongjoon-hyun/SPARK-33333.
    
    Authored-by: Dongjoon Hyun <dhyun@apple.com>
    Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
    dongjoon-hyun authored and viirya committed Nov 4, 2020
    Configuration menu
    Copy the full SHA
    5dd36f3 View commit details
    Browse the repository at this point in the history
  2. [SPARK-33338][SQL] GROUP BY using literal map should not fail

    ### What changes were proposed in this pull request?
    
    This PR aims to fix `semanticEquals` works correctly on `GetMapValue` expressions having literal maps with `ArrayBasedMapData` and `GenericArrayData`.
    
    ### Why are the changes needed?
    
    This is a regression from Apache Spark 1.6.x.
    ```scala
    scala> sc.version
    res1: String = 1.6.3
    
    scala> sqlContext.sql("SELECT map('k1', 'v1')[k] FROM t GROUP BY map('k1', 'v1')[k]").show
    +---+
    |_c0|
    +---+
    | v1|
    +---+
    ```
    
    Apache Spark 2.x ~ 3.0.1 raise`RuntimeException` for the following queries.
    ```sql
    CREATE TABLE t USING ORC AS SELECT map('k1', 'v1') m, 'k1' k
    SELECT map('k1', 'v1')[k] FROM t GROUP BY 1
    SELECT map('k1', 'v1')[k] FROM t GROUP BY map('k1', 'v1')[k]
    SELECT map('k1', 'v1')[k] a FROM t GROUP BY a
    ```
    
    **BEFORE**
    ```scala
    Caused by: java.lang.RuntimeException: Couldn't find k#3 in [keys: [k1], values: [v1][k#3]apache#6]
    	at scala.sys.package$.error(package.scala:27)
    	at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:85)
    	at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:79)
    	at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
    ```
    
    **AFTER**
    ```sql
    spark-sql> SELECT map('k1', 'v1')[k] FROM t GROUP BY 1;
    v1
    Time taken: 1.278 seconds, Fetched 1 row(s)
    spark-sql> SELECT map('k1', 'v1')[k] FROM t GROUP BY map('k1', 'v1')[k];
    v1
    Time taken: 0.313 seconds, Fetched 1 row(s)
    spark-sql> SELECT map('k1', 'v1')[k] a FROM t GROUP BY a;
    v1
    Time taken: 0.265 seconds, Fetched 1 row(s)
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Pass the CIs with the newly added test case.
    
    Closes apache#30246 from dongjoon-hyun/SPARK-33338.
    
    Authored-by: Dongjoon Hyun <dhyun@apple.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    (cherry picked from commit 42c0b17)
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    dongjoon-hyun committed Nov 4, 2020
    Configuration menu
    Copy the full SHA
    e7a6211 View commit details
    Browse the repository at this point in the history

Commits on Nov 5, 2020

  1. [SPARK-33162][INFRA][3.0] Use pre-built image at GitHub Action PySpar…

    …k jobs
    
    ### What changes were proposed in this pull request?
    
    This is a backport of apache#30059 .
    
    This PR aims to use `pre-built image` at Github Action PySpark jobs. To isolate the changes, `pyspark` jobs are split from the main job. The docker image is built by the following.
    
    | Item                   | URL                |
    | --------------- | ------------- |
    | Dockerfile         | https://github.com/dongjoon-hyun/ApacheSparkGitHubActionImage/blob/main/Dockerfile |
    | Builder               | https://github.com/dongjoon-hyun/ApacheSparkGitHubActionImage/blob/main/.github/workflows/build.yml |
    | Image Location | https://hub.docker.com/r/dongjoon/apache-spark-github-action-image |
    
    Please note that.
    1. The community still will use `build_and_test.yml` to add new features like as we did until now. The `Dockerfile` will be updated regularly.
    2. When Apache Spark gets an official docker repository location, we will use it.
    3. Also, it's the best if we keep this docker file and builder script at a new Apache Spark dev branch instead of outside GitHub repository.
    
    ### Why are the changes needed?
    
    This will reduce the Python and its package installation time.
    
    **BEFORE (branch-3.0)**
    ![Screen Shot 2020-11-04 at 2 28 49 PM](https://user-images.githubusercontent.com/9700541/98174664-17f2e500-1eaa-11eb-9222-018eead9c418.png)
    
    **AFTER (branch-3.0)**
    ![Screen Shot 2020-11-04 at 2 29 43 PM](https://user-images.githubusercontent.com/9700541/98174758-378a0d80-1eaa-11eb-8e6a-929158c2fea3.png)
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Pass the GitHub Action on this PR without `package installation steps`.
    
    Closes apache#30253 from dongjoon-hyun/GHA-3.0.
    
    Authored-by: Dongjoon Hyun <dhyun@apple.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    dongjoon-hyun committed Nov 5, 2020
    Configuration menu
    Copy the full SHA
    b43572e View commit details
    Browse the repository at this point in the history
  2. [SPARK-33239][INFRA][3.0] Use pre-built image at GitHub Action SparkR…

    … job
    
    ### What changes were proposed in this pull request?
    
    This is a backport of apache#30066 .
    
    This PR aims to use a pre-built image for Github Action SparkR job.
    
    ### Why are the changes needed?
    
    This will reduce the execution time and the flakiness.
    
    **BEFORE (branch-3.0: 21 minutes 7 seconds)**
    ![Screen Shot 2020-11-04 at 8 53 50 PM](https://user-images.githubusercontent.com/9700541/98199386-e39a1b80-1edf-11eb-8dec-c6819ebb3f0d.png)
    
    **AFTER**
    No R and R package installation steps.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Pass the GitHub Action `sparkr` job in this PR.
    
    Closes apache#30258 from dongjoon-hyun/SPARK-33239-3.0.
    
    Authored-by: Dongjoon Hyun <dhyun@apple.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    dongjoon-hyun committed Nov 5, 2020
    Configuration menu
    Copy the full SHA
    14eb8b1 View commit details
    Browse the repository at this point in the history
  3. Revert "[SPARK-33277][PYSPARK][SQL][3.0] Use ContextAwareIterator to …

    …stop consuming after the task ends"
    
    This reverts commit 92ba08d.
    HyukjinKwon committed Nov 5, 2020
    Configuration menu
    Copy the full SHA
    74d8eac View commit details
    Browse the repository at this point in the history
  4. [MINOR][SS][DOCS] Update join type in stream static joins code examples

    ### What changes were proposed in this pull request?
    Update join type in stream static joins code examples in structured streaming programming guide.
    1) Scala, Java and Python examples have a common issue.
        The join keyword is "right_join", it should be "left_outer".
    
        _Reasons:_
        a) This code snippet is an example of "left outer join" as the streaming df is on left and static df is on right. Also, right outer    join between stream df(left) and static df(right) is not supported.
        b) The keyword "right_join/left_join" is unsupported and it should be "right_outer/left_outer".
    
    So, all of these code snippets have been updated to "left_outer".
    
    2) R exmaple is correct, but the example is of "right_outer" with static df (left) and streaming df(right).
    It is changed to "left_outer" to make it consistent with other three examples of scala, java and python.
    
    ### Why are the changes needed?
    To fix the mistake in example code of documentation.
    
    ### Does this PR introduce _any_ user-facing change?
    Yes, it is a user-facing change (but documentation update only).
    
    **Screenshots 1: Scala/Java/python example (similar issue)**
    _Before:_
    <img width="941" alt="Screenshot 2020-11-05 at 12 16 09 AM" src="https://user-images.githubusercontent.com/62717942/98155351-19e59400-1efc-11eb-8142-e6a25a5e6497.png">
    
    _After:_
    <img width="922" alt="Screenshot 2020-11-05 at 12 17 12 AM" src="https://user-images.githubusercontent.com/62717942/98155503-5d400280-1efc-11eb-96e1-5ba0f3c35c82.png">
    
    **Screenshots 2: R example (Make it consistent with above change)**
    _Before:_
    <img width="896" alt="Screenshot 2020-11-05 at 12 19 57 AM" src="https://user-images.githubusercontent.com/62717942/98155685-ac863300-1efc-11eb-93bc-b7ca4dd34634.png">
    
    _After:_
    <img width="919" alt="Screenshot 2020-11-05 at 12 20 51 AM" src="https://user-images.githubusercontent.com/62717942/98155739-c0ca3000-1efc-11eb-8f95-a7538fa784b7.png">
    
    ### How was this patch tested?
    The change was tested locally.
    1) cd docs/
        SKIP_API=1 jekyll build
    2) Verify docs/_site/structured-streaming-programming-guide.html file in browser.
    
    Closes apache#30252 from sarveshdave1/doc-update-stream-static-joins.
    
    Authored-by: Sarvesh Dave <sarveshdave1@gmail.com>
    Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
    (cherry picked from commit e66201b)
    Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
    sarveshdave1 authored and HeartSaVioR committed Nov 5, 2020
    Configuration menu
    Copy the full SHA
    c43231c View commit details
    Browse the repository at this point in the history
  5. [SPARK-33362][SQL] skipSchemaResolution should still require query to…

    … be resolved
    
    ### What changes were proposed in this pull request?
    
    Fix a small bug in `V2WriteCommand.resolved`. It should always require the `table` and `query` to be resolved.
    
    ### Why are the changes needed?
    
    To prevent potential bugs that we skip resolve the input query.
    
    ### Does this PR introduce _any_ user-facing change?
    
    no
    
    ### How was this patch tested?
    
    a new test
    
    Closes apache#30265 from cloud-fan/ds-minor-2.
    
    Authored-by: Wenchen Fan <wenchen@databricks.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    (cherry picked from commit 26ea417)
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    cloud-fan authored and dongjoon-hyun committed Nov 5, 2020
    Configuration menu
    Copy the full SHA
    6da60bf View commit details
    Browse the repository at this point in the history

Commits on Nov 8, 2020

  1. [SPARK-32860][DOCS][SQL] Updating documentation about map support in …

    …Encoders
    
    ### What changes were proposed in this pull request?
    
    Javadocs updated for the encoder to include maps as a collection type
    
    ### Why are the changes needed?
    
    The javadocs were not updated with fix SPARK-16706
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, the javadocs are updated
    
    ### How was this patch tested?
    
    sbt was run to ensure it meets scalastyle
    
    Closes apache#30274 from hannahkamundson/SPARK-32860.
    
    Lead-authored-by: Hannah Amundson <amundson.hannah@heb.com>
    Co-authored-by: Hannah <48397717+hannahkamundson@users.noreply.github.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    (cherry picked from commit 1090b1b)
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    2 people authored and HyukjinKwon committed Nov 8, 2020
    Configuration menu
    Copy the full SHA
    3223e3e View commit details
    Browse the repository at this point in the history

Commits on Nov 9, 2020

  1. [SPARK-33371][PYTHON][3.0] Update setup.py and tests for Python 3.9

    ### What changes were proposed in this pull request?
    
    This PR is a backport of apache#30277
    
    This PR proposes to fix PySpark to officially support Python 3.9. The main codes already work. We should just note that we support Python 3.9.
    
    Also, this PR fixes some minor fixes into the test codes.
    - `Thread.isAlive` is removed in Python 3.9, and `Thread.is_alive` exists in Python 3.6+, see https://docs.python.org/3/whatsnew/3.9.html#removed
    - Fixed `TaskContextTestsWithWorkerReuse.test_barrier_with_python_worker_reuse` and `TaskContextTests.test_barrier` to be less flaky. This becomes more flaky in Python 3.9 for some reasons.
    
    NOTE that PyArrow does not support Python 3.9 yet.
    
    ### Why are the changes needed?
    
    To officially support Python 3.9.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, it officially supports Python 3.9.
    
    ### How was this patch tested?
    
    Manually ran the tests:
    
    ```
    $  ./run-tests --python-executable=python
    Running PySpark tests. Output is in /.../spark/python/unit-tests.log
    Will test against the following Python executables: ['python']
    Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-resource', 'pyspark-sql', 'pyspark-streaming']
    python python_implementation is CPython
    python version is: Python 3.9.0
    Starting test(python): pyspark.ml.tests.test_base
    Starting test(python): pyspark.ml.tests.test_evaluation
    Starting test(python): pyspark.ml.tests.test_algorithms
    Starting test(python): pyspark.ml.tests.test_feature
    Finished test(python): pyspark.ml.tests.test_base (12s)
    Starting test(python): pyspark.ml.tests.test_image
    Finished test(python): pyspark.ml.tests.test_evaluation (15s)
    Starting test(python): pyspark.ml.tests.test_linalg
    Finished test(python): pyspark.ml.tests.test_feature (25s)
    Starting test(python): pyspark.ml.tests.test_param
    Finished test(python): pyspark.ml.tests.test_image (17s)
    Starting test(python): pyspark.ml.tests.test_persistence
    Finished test(python): pyspark.ml.tests.test_param (17s)
    Starting test(python): pyspark.ml.tests.test_pipeline
    Finished test(python): pyspark.ml.tests.test_linalg (30s)
    Starting test(python): pyspark.ml.tests.test_stat
    Finished test(python): pyspark.ml.tests.test_pipeline (6s)
    Starting test(python): pyspark.ml.tests.test_training_summary
    Finished test(python): pyspark.ml.tests.test_stat (12s)
    Starting test(python): pyspark.ml.tests.test_tuning
    Finished test(python): pyspark.ml.tests.test_algorithms (68s)
    Starting test(python): pyspark.ml.tests.test_wrapper
    Finished test(python): pyspark.ml.tests.test_persistence (51s)
    Starting test(python): pyspark.mllib.tests.test_algorithms
    Finished test(python): pyspark.ml.tests.test_training_summary (33s)
    Starting test(python): pyspark.mllib.tests.test_feature
    Finished test(python): pyspark.ml.tests.test_wrapper (19s)
    Starting test(python): pyspark.mllib.tests.test_linalg
    Finished test(python): pyspark.mllib.tests.test_feature (26s)
    Starting test(python): pyspark.mllib.tests.test_stat
    Finished test(python): pyspark.mllib.tests.test_stat (22s)
    Starting test(python): pyspark.mllib.tests.test_streaming_algorithms
    Finished test(python): pyspark.mllib.tests.test_algorithms (53s)
    Starting test(python): pyspark.mllib.tests.test_util
    Finished test(python): pyspark.mllib.tests.test_linalg (54s)
    Starting test(python): pyspark.sql.tests.test_arrow
    Finished test(python): pyspark.sql.tests.test_arrow (0s) ... 61 tests were skipped
    Starting test(python): pyspark.sql.tests.test_catalog
    Finished test(python): pyspark.mllib.tests.test_util (11s)
    Starting test(python): pyspark.sql.tests.test_column
    Finished test(python): pyspark.sql.tests.test_catalog (16s)
    Starting test(python): pyspark.sql.tests.test_conf
    Finished test(python): pyspark.sql.tests.test_column (17s)
    Starting test(python): pyspark.sql.tests.test_context
    Finished test(python): pyspark.sql.tests.test_context (6s) ... 3 tests were skipped
    Starting test(python): pyspark.sql.tests.test_dataframe
    Finished test(python): pyspark.sql.tests.test_conf (11s)
    Starting test(python): pyspark.sql.tests.test_datasources
    Finished test(python): pyspark.sql.tests.test_datasources (19s)
    Starting test(python): pyspark.sql.tests.test_functions
    Finished test(python): pyspark.sql.tests.test_dataframe (35s) ... 3 tests were skipped
    Starting test(python): pyspark.sql.tests.test_group
    Finished test(python): pyspark.sql.tests.test_functions (32s)
    Starting test(python): pyspark.sql.tests.test_pandas_cogrouped_map
    Finished test(python): pyspark.sql.tests.test_pandas_cogrouped_map (1s) ... 15 tests were skipped
    Starting test(python): pyspark.sql.tests.test_pandas_grouped_map
    Finished test(python): pyspark.sql.tests.test_group (19s)
    Starting test(python): pyspark.sql.tests.test_pandas_map
    Finished test(python): pyspark.sql.tests.test_pandas_grouped_map (0s) ... 21 tests were skipped
    Starting test(python): pyspark.sql.tests.test_pandas_udf
    Finished test(python): pyspark.sql.tests.test_pandas_map (0s) ... 6 tests were skipped
    Starting test(python): pyspark.sql.tests.test_pandas_udf_grouped_agg
    Finished test(python): pyspark.sql.tests.test_pandas_udf (0s) ... 6 tests were skipped
    Starting test(python): pyspark.sql.tests.test_pandas_udf_scalar
    Finished test(python): pyspark.sql.tests.test_pandas_udf_grouped_agg (0s) ... 13 tests were skipped
    Starting test(python): pyspark.sql.tests.test_pandas_udf_typehints
    Finished test(python): pyspark.sql.tests.test_pandas_udf_scalar (0s) ... 50 tests were skipped
    Starting test(python): pyspark.sql.tests.test_pandas_udf_window
    Finished test(python): pyspark.sql.tests.test_pandas_udf_typehints (0s) ... 10 tests were skipped
    Starting test(python): pyspark.sql.tests.test_readwriter
    Finished test(python): pyspark.sql.tests.test_pandas_udf_window (0s) ... 14 tests were skipped
    Starting test(python): pyspark.sql.tests.test_serde
    Finished test(python): pyspark.sql.tests.test_serde (19s)
    Starting test(python): pyspark.sql.tests.test_session
    Finished test(python): pyspark.mllib.tests.test_streaming_algorithms (120s)
    Starting test(python): pyspark.sql.tests.test_streaming
    Finished test(python): pyspark.sql.tests.test_readwriter (25s)
    Starting test(python): pyspark.sql.tests.test_types
    Finished test(python): pyspark.ml.tests.test_tuning (208s)
    Starting test(python): pyspark.sql.tests.test_udf
    Finished test(python): pyspark.sql.tests.test_session (31s)
    Starting test(python): pyspark.sql.tests.test_utils
    Finished test(python): pyspark.sql.tests.test_streaming (35s)
    Starting test(python): pyspark.streaming.tests.test_context
    Finished test(python): pyspark.sql.tests.test_types (34s)
    Starting test(python): pyspark.streaming.tests.test_dstream
    Finished test(python): pyspark.sql.tests.test_utils (14s)
    Starting test(python): pyspark.streaming.tests.test_kinesis
    Finished test(python): pyspark.streaming.tests.test_kinesis (0s) ... 2 tests were skipped
    Starting test(python): pyspark.streaming.tests.test_listener
    Finished test(python): pyspark.streaming.tests.test_listener (11s)
    Starting test(python): pyspark.tests.test_appsubmit
    Finished test(python): pyspark.sql.tests.test_udf (39s)
    Starting test(python): pyspark.tests.test_broadcast
    Finished test(python): pyspark.streaming.tests.test_context (23s)
    Starting test(python): pyspark.tests.test_conf
    Finished test(python): pyspark.tests.test_conf (15s)
    Starting test(python): pyspark.tests.test_context
    Finished test(python): pyspark.tests.test_broadcast (33s)
    Starting test(python): pyspark.tests.test_daemon
    Finished test(python): pyspark.tests.test_daemon (5s)
    Starting test(python): pyspark.tests.test_install_spark
    Finished test(python): pyspark.tests.test_context (44s)
    Starting test(python): pyspark.tests.test_join
    Finished test(python): pyspark.tests.test_appsubmit (68s)
    Starting test(python): pyspark.tests.test_profiler
    Finished test(python): pyspark.tests.test_join (7s)
    Starting test(python): pyspark.tests.test_rdd
    Finished test(python): pyspark.tests.test_profiler (9s)
    Starting test(python): pyspark.tests.test_rddbarrier
    Finished test(python): pyspark.tests.test_rddbarrier (7s)
    Starting test(python): pyspark.tests.test_readwrite
    Finished test(python): pyspark.streaming.tests.test_dstream (107s)
    Starting test(python): pyspark.tests.test_serializers
    Finished test(python): pyspark.tests.test_serializers (8s)
    Starting test(python): pyspark.tests.test_shuffle
    Finished test(python): pyspark.tests.test_readwrite (14s)
    Starting test(python): pyspark.tests.test_taskcontext
    Finished test(python): pyspark.tests.test_install_spark (65s)
    Starting test(python): pyspark.tests.test_util
    Finished test(python): pyspark.tests.test_shuffle (8s)
    Starting test(python): pyspark.tests.test_worker
    Finished test(python): pyspark.tests.test_util (5s)
    Starting test(python): pyspark.accumulators
    Finished test(python): pyspark.accumulators (5s)
    Starting test(python): pyspark.broadcast
    Finished test(python): pyspark.broadcast (6s)
    Starting test(python): pyspark.conf
    Finished test(python): pyspark.tests.test_worker (14s)
    Starting test(python): pyspark.context
    Finished test(python): pyspark.conf (4s)
    Starting test(python): pyspark.ml.classification
    Finished test(python): pyspark.tests.test_rdd (60s)
    Starting test(python): pyspark.ml.clustering
    Finished test(python): pyspark.context (21s)
    Starting test(python): pyspark.ml.evaluation
    Finished test(python): pyspark.tests.test_taskcontext (69s)
    Starting test(python): pyspark.ml.feature
    Finished test(python): pyspark.ml.evaluation (26s)
    Starting test(python): pyspark.ml.fpm
    Finished test(python): pyspark.ml.clustering (45s)
    Starting test(python): pyspark.ml.functions
    Finished test(python): pyspark.ml.fpm (24s)
    Starting test(python): pyspark.ml.image
    Finished test(python): pyspark.ml.functions (17s)
    Starting test(python): pyspark.ml.linalg.__init__
    Finished test(python): pyspark.ml.linalg.__init__ (0s)
    Starting test(python): pyspark.ml.recommendation
    Finished test(python): pyspark.ml.classification (74s)
    Starting test(python): pyspark.ml.regression
    Finished test(python): pyspark.ml.image (8s)
    Starting test(python): pyspark.ml.stat
    Finished test(python): pyspark.ml.stat (29s)
    Starting test(python): pyspark.ml.tuning
    Finished test(python): pyspark.ml.regression (53s)
    Starting test(python): pyspark.mllib.classification
    Finished test(python): pyspark.ml.tuning (35s)
    Starting test(python): pyspark.mllib.clustering
    Finished test(python): pyspark.ml.feature (103s)
    Starting test(python): pyspark.mllib.evaluation
    Finished test(python): pyspark.mllib.classification (33s)
    Starting test(python): pyspark.mllib.feature
    Finished test(python): pyspark.mllib.evaluation (21s)
    Starting test(python): pyspark.mllib.fpm
    Finished test(python): pyspark.ml.recommendation (103s)
    Starting test(python): pyspark.mllib.linalg.__init__
    Finished test(python): pyspark.mllib.linalg.__init__ (1s)
    Starting test(python): pyspark.mllib.linalg.distributed
    Finished test(python): pyspark.mllib.feature (26s)
    Starting test(python): pyspark.mllib.random
    Finished test(python): pyspark.mllib.fpm (23s)
    Starting test(python): pyspark.mllib.recommendation
    Finished test(python): pyspark.mllib.clustering (50s)
    Starting test(python): pyspark.mllib.regression
    Finished test(python): pyspark.mllib.random (13s)
    Starting test(python): pyspark.mllib.stat.KernelDensity
    Finished test(python): pyspark.mllib.stat.KernelDensity (1s)
    Starting test(python): pyspark.mllib.stat._statistics
    Finished test(python): pyspark.mllib.linalg.distributed (42s)
    Starting test(python): pyspark.mllib.tree
    Finished test(python): pyspark.mllib.stat._statistics (19s)
    Starting test(python): pyspark.mllib.util
    Finished test(python): pyspark.mllib.regression (33s)
    Starting test(python): pyspark.profiler
    Finished test(python): pyspark.mllib.recommendation (36s)
    Starting test(python): pyspark.rdd
    Finished test(python): pyspark.profiler (9s)
    Starting test(python): pyspark.resource.tests.test_resources
    Finished test(python): pyspark.mllib.tree (19s)
    Starting test(python): pyspark.serializers
    Finished test(python): pyspark.mllib.util (21s)
    Starting test(python): pyspark.shuffle
    Finished test(python): pyspark.resource.tests.test_resources (9s)
    Starting test(python): pyspark.sql.avro.functions
    Finished test(python): pyspark.shuffle (1s)
    Starting test(python): pyspark.sql.catalog
    Finished test(python): pyspark.rdd (22s)
    Starting test(python): pyspark.sql.column
    Finished test(python): pyspark.serializers (12s)
    Starting test(python): pyspark.sql.conf
    Finished test(python): pyspark.sql.conf (6s)
    Starting test(python): pyspark.sql.context
    Finished test(python): pyspark.sql.catalog (14s)
    Starting test(python): pyspark.sql.dataframe
    Finished test(python): pyspark.sql.avro.functions (15s)
    Starting test(python): pyspark.sql.functions
    Finished test(python): pyspark.sql.column (24s)
    Starting test(python): pyspark.sql.group
    Finished test(python): pyspark.sql.context (20s)
    Starting test(python): pyspark.sql.pandas.conversion
    Finished test(python): pyspark.sql.pandas.conversion (13s)
    Starting test(python): pyspark.sql.pandas.group_ops
    Finished test(python): pyspark.sql.group (36s)
    Starting test(python): pyspark.sql.pandas.map_ops
    Finished test(python): pyspark.sql.pandas.group_ops (21s)
    Starting test(python): pyspark.sql.pandas.serializers
    Finished test(python): pyspark.sql.pandas.serializers (0s)
    Starting test(python): pyspark.sql.pandas.typehints
    Finished test(python): pyspark.sql.pandas.typehints (0s)
    Starting test(python): pyspark.sql.pandas.types
    Finished test(python): pyspark.sql.pandas.types (0s)
    Starting test(python): pyspark.sql.pandas.utils
    Finished test(python): pyspark.sql.pandas.utils (0s)
    Starting test(python): pyspark.sql.readwriter
    Finished test(python): pyspark.sql.dataframe (56s)
    Starting test(python): pyspark.sql.session
    Finished test(python): pyspark.sql.functions (57s)
    Starting test(python): pyspark.sql.streaming
    Finished test(python): pyspark.sql.pandas.map_ops (12s)
    Starting test(python): pyspark.sql.types
    Finished test(python): pyspark.sql.types (10s)
    Starting test(python): pyspark.sql.udf
    Finished test(python): pyspark.sql.streaming (16s)
    Starting test(python): pyspark.sql.window
    Finished test(python): pyspark.sql.session (19s)
    Starting test(python): pyspark.streaming.util
    Finished test(python): pyspark.streaming.util (0s)
    Starting test(python): pyspark.util
    Finished test(python): pyspark.util (0s)
    Finished test(python): pyspark.sql.readwriter (24s)
    Finished test(python): pyspark.sql.udf (13s)
    Finished test(python): pyspark.sql.window (14s)
    Tests passed in 780 seconds
    
    ```
    
    Closes apache#30288 from HyukjinKwon/SPARK-33371-3.0.
    
    Authored-by: HyukjinKwon <gurwls223@apache.org>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    HyukjinKwon authored and dongjoon-hyun committed Nov 9, 2020
    Configuration menu
    Copy the full SHA
    808dd8f View commit details
    Browse the repository at this point in the history
  2. [SPARK-33372][SQL] Fix InSet bucket pruning

    ### What changes were proposed in this pull request?
    
    This pr fix `InSet` bucket pruning because of it's values should not be `Literal`:
    https://github.com/apache/spark/blob/cbd3fdea62dab73fc4a96702de8fd1f07722da66/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala#L253-L255
    
    ### Why are the changes needed?
    
    Fix bug.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Unit test and manual test:
    
    ```scala
    spark.sql("select id as a, id as b from range(10000)").write.bucketBy(100, "a").saveAsTable("t")
    spark.sql("select * from t where a in (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11)").show
    ```
    
    Before this PR | After this PR
    -- | --
    ![image](https://user-images.githubusercontent.com/5399861/98380788-fb120980-2083-11eb-8fae-4e21ad873e9b.png) | ![image](https://user-images.githubusercontent.com/5399861/98381095-5ba14680-2084-11eb-82ca-2d780c85305c.png)
    
    Closes apache#30279 from wangyum/SPARK-33372.
    
    Authored-by: Yuming Wang <yumwang@ebay.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    (cherry picked from commit 69799c5)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    wangyum authored and cloud-fan committed Nov 9, 2020
    Configuration menu
    Copy the full SHA
    c157fa3 View commit details
    Browse the repository at this point in the history

Commits on Nov 10, 2020

  1. [SPARK-33397][YARN][DOC] Fix generating md to html for available-patt…

    …erns-for-shs-custom-executor-log-url
    
    ### What changes were proposed in this pull request?
    
    1. replace `{{}}`  with `&apache#123;&apache#123;&apache#125;&apache#125;`
    2. using `<code></code>` in td-tag
    
    ### Why are the changes needed?
    
    to fix this.
    ![image](https://user-images.githubusercontent.com/8326978/98544155-8c74bc00-22ce-11eb-8889-8dacb726b762.png)
    
    ### Does this PR introduce _any_ user-facing change?
    
    yes, you will see the correct online doc with this change
    
    ![image](https://user-images.githubusercontent.com/8326978/98545256-2e48d880-22d0-11eb-9dd9-b8cae3df8659.png)
    
    ### How was this patch tested?
    
    shown as the above pic via jekyll serve.
    
    Closes apache#30298 from yaooqinn/SPARK-33397.
    
    Authored-by: Kent Yao <yaooqinn@hotmail.com>
    Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
    (cherry picked from commit 036c11b)
    Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
    yaooqinn authored and maropu committed Nov 10, 2020
    Configuration menu
    Copy the full SHA
    a418495 View commit details
    Browse the repository at this point in the history
  2. [SPARK-33405][BUILD][3.0] Upgrade commons-compress to 1.20

    ### What changes were proposed in this pull request?
    
    This PR aims to upgrade `commons-compress` from 1.8 to 1.20.
    
    ### Why are the changes needed?
    
    - https://commons.apache.org/proper/commons-compress/security-reports.html
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Pass the CIs.
    
    Closes apache#30305 from dongjoon-hyun/SPARK-33405-3.0.
    
    Authored-by: Dongjoon Hyun <dhyun@apple.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    dongjoon-hyun authored and HyukjinKwon committed Nov 10, 2020
    Configuration menu
    Copy the full SHA
    1aa8f4f View commit details
    Browse the repository at this point in the history
  3. [SPARK-33391][SQL] element_at with CreateArray not respect one based …

    …index
    
    ### What changes were proposed in this pull request?
    
    element_at with CreateArray not respect one based index.
    
    repo step:
    
    ```
    var df = spark.sql("select element_at(array(3, 2, 1), 0)")
    df.printSchema()
    
    df = spark.sql("select element_at(array(3, 2, 1), 1)")
    df.printSchema()
    
    df = spark.sql("select element_at(array(3, 2, 1), 2)")
    df.printSchema()
    
    df = spark.sql("select element_at(array(3, 2, 1), 3)")
    df.printSchema()
    
    root
    – element_at(array(3, 2, 1), 0): integer (nullable = false)
    
    root
    – element_at(array(3, 2, 1), 1): integer (nullable = false)
    
    root
    – element_at(array(3, 2, 1), 2): integer (nullable = false)
    
    root
    – element_at(array(3, 2, 1), 3): integer (nullable = true)
    
    correct answer should be
    0 true which is outOfBounds return default true.
    1 false
    2 false
    3 false
    
    ```
    
    For expression eval, it respect the oneBasedIndex, but within checking the nullable, it calculates with zeroBasedIndex using `computeNullabilityFromArray`.
    
    ### Why are the changes needed?
    
    Correctness issue.
    
    ### Does this PR introduce any user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Added UT and existing UT.
    
    Closes apache#30296 from leanken/leanken-SPARK-33391.
    
    Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    (cherry picked from commit e3a768d)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    leanken-zz authored and cloud-fan committed Nov 10, 2020
    Configuration menu
    Copy the full SHA
    b905d65 View commit details
    Browse the repository at this point in the history
  4. [SPARK-33339][PYTHON] Pyspark application will hang due to non Except…

    …ion error
    
    ### What changes were proposed in this pull request?
    
    When a system.exit exception occurs during the process, the python worker exits abnormally, and then the executor task is still waiting for the worker for reading from socket, causing it to hang.
    The system.exit exception may be caused by the user's error code, but spark should at least throw an error to remind the user, not get stuck
    we can run a simple test to reproduce this case:
    
    ```
    from pyspark.sql import SparkSession
    def err(line):
      raise SystemExit
    spark = SparkSession.builder.appName("test").getOrCreate()
    spark.sparkContext.parallelize(range(1,2), 2).map(err).collect()
    spark.stop()
    ```
    
    ### Why are the changes needed?
    
    to make sure pyspark application won't hang if there's non-Exception error in python worker
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    added a new test and also manually tested the case above
    
    Closes apache#30248 from li36909/pyspark.
    
    Lead-authored-by: lrz <lrz@lrzdeMacBook-Pro.local>
    Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    (cherry picked from commit 27bb40b)
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    lrz and HyukjinKwon committed Nov 10, 2020
    Configuration menu
    Copy the full SHA
    4a1c143 View commit details
    Browse the repository at this point in the history

Commits on Nov 11, 2020

  1. [SPARK-33417][SQL][TEST] Correct the behaviour of query filters in TP…

    …CDSQueryBenchmark
    
    ### What changes were proposed in this pull request?
    
    This PR intends to fix the behaviour of query filters in `TPCDSQueryBenchmark`. We can use an option `--query-filter` for selecting TPCDS queries to run, e.g., `--query-filter q6,q8,q13`. But, the current master has a weird behaviour about the option. For example, if we pass `--query-filter q6` so as to run the TPCDS q6 only, `TPCDSQueryBenchmark` runs `q6` and `q6-v2.7` because the `filterQueries` method does not respect the name suffix. So, there is no way now to run the TPCDS q6 only.
    
    ### Why are the changes needed?
    
    Bugfix.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Manually checked.
    
    Closes apache#30324 from maropu/FilterBugInTPCDSQueryBenchmark.
    
    Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
    Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
    (cherry picked from commit 4b36797)
    Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
    maropu committed Nov 11, 2020
    Configuration menu
    Copy the full SHA
    577dbb9 View commit details
    Browse the repository at this point in the history
  2. [SPARK-33412][SQL][3.0] OverwriteByExpression should resolve its dele…

    …te condition based on the table relation not the input query
    
    backport apache#30318 to 3.0
    
    Closes apache#30328 from cloud-fan/backport.
    
    Authored-by: Wenchen Fan <wenchen@databricks.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    cloud-fan committed Nov 11, 2020
    Configuration menu
    Copy the full SHA
    1e2984b View commit details
    Browse the repository at this point in the history
  3. [SPARK-33402][CORE] Jobs launched in same second have duplicate MapRe…

    …duce JobIDs
    
    ### What changes were proposed in this pull request?
    
    1. Applies the SQL changes in SPARK-33230 to SparkHadoopWriter, so that `rdd.saveAsNewAPIHadoopDataset` passes in a unique job UUID in `spark.sql.sources.writeJobUUID`
    1. `SparkHadoopWriterUtils.createJobTrackerID` generates a JobID by appending a random long number to the supplied timestamp to ensure the probability of a collision is near-zero.
    1. With tests of uniqueness, round trips and negative jobID rejection.
    
    ### Why are the changes needed?
    
    Without this, if more than one job is started in the same second *and the committer expects application attempt IDs to be unique* is at risk of clashing with other jobs.
    
    With the fix,
    
    * those committers which use the ID set in `spark.sql.sources.writeJobUUID` as a priority ID will pick that up instead and so be unique.
    * committers which use the Hadoop JobID for unique paths and filenames will get the randomly generated jobID.  Assuming all clocks in a cluster in sync, the probability of two jobs launched in the same second has dropped from 1 to 1/(2^63)
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Unit tests
    
    There's a new test suite SparkHadoopWriterUtilsSuite which creates jobID, verifies they are unique even for the same timestamp and that they can be marshalled to string and parsed back in the hadoop code, which contains some (brittle) assumptions about the format of job IDs.
    
    Functional Integration Tests
    
    1. Hadoop-trunk built with [HADOOP-17318], publishing to local maven repository
    1. Spark built with hadoop.version=3.4.0-SNAPSHOT to pick up these JARs.
    1. Spark + Object store integration tests at [https://github.com/hortonworks-spark/cloud-integration](https://github.com/hortonworks-spark/cloud-integration) were built against that local spark version
    1. And executed against AWS london.
    
    The tests were run with `fs.s3a.committer.require.uuid=true`, so the s3a committers fail fast if they don't get a job ID down. This showed that `rdd.saveAsNewAPIHadoopDataset` wasn't setting the UUID option. It again uses the current Date value for an app attempt -which is not guaranteed to be unique.
    
    With the change applied to spark, the relevant tests work, therefore the committers are getting unique job IDs.
    
    Closes apache#30319 from steveloughran/BUG/SPARK-33402-jobuuid.
    
    Authored-by: Steve Loughran <stevel@cloudera.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    (cherry picked from commit 318a173)
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    steveloughran authored and dongjoon-hyun committed Nov 11, 2020
    Configuration menu
    Copy the full SHA
    3edec10 View commit details
    Browse the repository at this point in the history
  4. [SPARK-33404][SQL][3.0] Fix incorrect results in date_trunc expression

    **Backport apache#30303 to 3.0**
    
    ### What changes were proposed in this pull request?
    The following query produces incorrect results:
    ```
    SELECT date_trunc('minute', '1769-10-17 17:10:02')
    ```
    Spark currently incorrectly returns
    ```
    1769-10-17 17:10:02
    ```
    against the expected return value of
    ```
    1769-10-17 17:10:00
    ```
    **Steps to repro**
    Run the following commands in spark-shell:
    ```
    spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
    spark.sql("SELECT date_trunc('minute', '1769-10-17 17:10:02')").show()
    ```
    This happens as `truncTimestamp` in package `org.apache.spark.sql.catalyst.util.DateTimeUtils` incorrectly assumes that time zone offsets can never have the granularity of a second and thus does not account for time zone adjustment when truncating the given timestamp to `minute`.
    This assumption is currently used when truncating the timestamps to `microsecond, millisecond, second, or minute`.
    
    This PR fixes this issue and always uses time zone knowledge when truncating timestamps regardless of the truncation unit.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Added new tests to `DateTimeUtilsSuite` which previously failed and pass now.
    
    Closes apache#30339 from utkarsh39/date_trunc_fix_3.0.
    
    Authored-by: Utkarsh <utkarsh.agarwal@databricks.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    utkarsh39 authored and dongjoon-hyun committed Nov 11, 2020
    Configuration menu
    Copy the full SHA
    00be83a View commit details
    Browse the repository at this point in the history

Commits on Nov 12, 2020

  1. [SPARK-33408][K8S][R][3.0] Use R 3.6.3 in K8s R image

    ### What changes were proposed in this pull request?
    
    This PR aims to upgrade K8s R image to use R 3.6.3 which is the same version installed in Jenkins Servers.
    
    ### Why are the changes needed?
    
    Jenkins Server is using `R 3.6.3`.
    ```
    + SPARK_HOME=/home/jenkins/workspace/SparkPullRequestBuilder-K8s
    + /usr/bin/R CMD check --as-cran --no-tests SparkR_3.1.0.tar.gz
    * using log directory ‘/home/jenkins/workspace/SparkPullRequestBuilder-K8s/R/SparkR.Rcheck’
    * using R version 3.6.3 (2020-02-29)
    ```
    
    OpenJDK docker image is using `R 3.5.2 (2018-12-20)` which is old and currently `spark-3.0.1` fails to run SparkR.
    ```
    $ cd spark-3.0.1-bin-hadoop3.2
    
    $ bin/docker-image-tool.sh -R kubernetes/dockerfiles/spark/bindings/R/Dockerfile -n build
    
    $ bin/spark-submit --master k8s://https://192.168.64.49:8443 --deploy-mode cluster --conf spark.kubernetes.container.image=spark-r:latest local:///opt/spark/examples/src/main/r/dataframe.R
    
    $ k logs dataframe-r-b1c14b75b0c09eeb-driver
    ...
    + exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=172.17.0.4 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.deploy.RRunner local:///opt/spark/examples/src/main/r/dataframe.R
    20/11/10 06:03:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    log4j:WARN No appenders could be found for logger (io.netty.util.internal.logging.InternalLoggerFactory).
    log4j:WARN Please initialize the log4j system properly.
    log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
    Error: package or namespace load failed for ‘SparkR’ in rbind(info, getNamespaceInfo(env, "S3methods")):
     number of columns of matrices must match (see arg 2)
    In addition: Warning message:
    package ‘SparkR’ was built under R version 4.0.2
    Execution halted
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    
    ### How was this patch tested?
    
    Pass K8s IT.
    
    Closes apache#30310 from dongjoon-hyun/SPARK-33408.
    
    Lead-authored-by: Dongjoon Hyun <dongjoon@apache.org>
    Co-authored-by: Dongjoon Hyun <dhyun@apple.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    2 people authored and HyukjinKwon committed Nov 12, 2020
    Configuration menu
    Copy the full SHA
    2eadedc View commit details
    Browse the repository at this point in the history
  2. [MINOR][DOC] spark.executor.memoryOverhead is not cluster-mode only

    ### What changes were proposed in this pull request?
    
    Remove "in cluster mode" from the description of `spark.executor.memoryOverhead`
    
    ### Why are the changes needed?
    
    fix correctness issue in documentaion
    
    ### Does this PR introduce _any_ user-facing change?
    
    yes, users may not get confused about the description `spark.executor.memoryOverhead`
    
    ### How was this patch tested?
    
    pass GA doc generation
    
    Closes apache#30311 from yaooqinn/minordoc.
    
    Authored-by: Kent Yao <yaooqinn@hotmail.com>
    Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
    (cherry picked from commit 4335af0)
    Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
    yaooqinn authored and maropu committed Nov 12, 2020
    Configuration menu
    Copy the full SHA
    5ee76e6 View commit details
    Browse the repository at this point in the history

Commits on Nov 13, 2020

  1. [SPARK-33435][SQL][3.0] DSv2: REFRESH TABLE should invalidate caches …

    …referencing the table
    
    ### What changes were proposed in this pull request?
    
    This is a backport for PR apache#30359.
    
    This changes `RefreshTableExec` in DSv2 to also invalidate caches with references to the target table to be refreshed. The change itself is similar to what's done in apache#30211. Note that though, since we currently don't support caching a DSv2 table directly, this doesn't add recache logic as in the DSv1 impl. I marked it as a TODO for now.
    
    Note there is some conflicts in the backport: in branch-3.0 `DataSourceV2Strategy` we don't have a `ResolvedTable` when analyzing `RefreshTable` so instead in `RefreshTableExec` this loads the table from the catalog if it exists, and the rest is the same.
    
    ### Why are the changes needed?
    
    Currently the behavior in DSv1 and DSv2 is inconsistent w.r.t refreshing table: in DSv1 we invalidate both metadata cache as well as all table caches that are related to the table, but in DSv2 we only do the former. This addresses the issue and make the behavior consistent.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, now refreshing a v2 table also invalidate all the related caches.
    
    ### How was this patch tested?
    
    Added a new UT.
    
    Closes apache#30360 from sunchao/SPARK-33435-branch-3.0.
    
    Authored-by: Chao Sun <sunchao@apple.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    sunchao authored and dongjoon-hyun committed Nov 13, 2020
    Configuration menu
    Copy the full SHA
    e684720 View commit details
    Browse the repository at this point in the history
  2. [SPARK-33439][INFRA] Use SERIAL_SBT_TESTS=1 for SQL modules

    ### What changes were proposed in this pull request?
    
    This PR aims to decrease the parallelism of `SQL` module like `Hive` module.
    
    ### Why are the changes needed?
    
    GitHub Action `sql - slow tests` become flaky.
    - https://github.com/apache/spark/runs/1393670291
    - https://github.com/apache/spark/runs/1393088031
    
    ### Does this PR introduce _any_ user-facing change?
    
    No. This is dev-only feature.
    Although this will increase the running time, but it's better than flakiness.
    
    ### How was this patch tested?
    
    Pass the GitHub Action stably.
    
    Closes apache#30365 from dongjoon-hyun/SPARK-33439.
    
    Authored-by: Dongjoon Hyun <dongjoon@apache.org>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    (cherry picked from commit a70a2b0)
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    dongjoon-hyun committed Nov 13, 2020
    Configuration menu
    Copy the full SHA
    921daa8 View commit details
    Browse the repository at this point in the history

Commits on Nov 16, 2020

  1. [SPARK-33358][SQL] Return code when command process failed

    Exit Spark SQL CLI processing loop if one of the commands (sub sql statement) process failed
    
    This is a regression at Apache Spark 3.0.0.
    
    ```
    $ cat 1.sql
    select * from nonexistent_table;
    select 2;
    ```
    
    **Apache Spark 2.4.7**
    ```
    spark-2.4.7-bin-hadoop2.7:$ bin/spark-sql -f 1.sql
    20/11/15 16:14:38 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Error in query: Table or view not found: nonexistent_table; line 1 pos 14
    ```
    
    **Apache Spark 3.0.1**
    ```
    $ bin/spark-sql -f 1.sql
    Error in query: Table or view not found: nonexistent_table; line 1 pos 14;
    'Project [*]
    +- 'UnresolvedRelation [nonexistent_table]
    
    2
    Time taken: 2.786 seconds, Fetched 1 row(s)
    ```
    
    **Apache Hive 1.2.2**
    ```
    apache-hive-1.2.2-bin:$ bin/hive -f 1.sql
    
    Logging initialized using configuration in jar:file:/Users/dongjoon/APACHE/hive-release/apache-hive-1.2.2-bin/lib/hive-common-1.2.2.jar!/hive-log4j.properties
    FAILED: SemanticException [Error 10001]: Line 1:14 Table not found 'nonexistent_table'
    ```
    
    Yes. This is a fix of regression.
    
    Pass the UT.
    
    Closes apache#30263 from artiship/SPARK-33358.
    
    Authored-by: artiship <meilziner@gmail.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    (cherry picked from commit 1ae6d64)
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    artiship authored and dongjoon-hyun committed Nov 16, 2020
    Configuration menu
    Copy the full SHA
    45bdb58 View commit details
    Browse the repository at this point in the history
  2. [SPARK-33451][DOCS] Change to 'spark.sql.adaptive.skewJoin.skewedPart…

    …itionThresholdInBytes' in documentation
    
    ### What changes were proposed in this pull request?
    
    In the 'Optimizing Skew Join' section of the following two pages:
    1. [https://spark.apache.org/docs/3.0.0/sql-performance-tuning.html](https://spark.apache.org/docs/3.0.0/sql-performance-tuning.html)
    2. [https://spark.apache.org/docs/3.0.1/sql-performance-tuning.html](https://spark.apache.org/docs/3.0.1/sql-performance-tuning.html)
    
    The configuration 'spark.sql.adaptive.skewedPartitionThresholdInBytes' should be changed to 'spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes', The former is missing the 'skewJoin'.
    
    ### Why are the changes needed?
    
    To document the correct name of configuration
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, this is a user-facing doc change.
    
    ### How was this patch tested?
    
    Jenkins / CI builds in this PR.
    
    Closes apache#30376 from aof00/doc_change.
    
    Authored-by: aof00 <x14562573449@gmail.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    (cherry picked from commit 0933f1c)
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    Southwest16 authored and HyukjinKwon committed Nov 16, 2020
    Configuration menu
    Copy the full SHA
    265363d View commit details
    Browse the repository at this point in the history

Commits on Nov 17, 2020

  1. [MINOR][GRAPHX][3.0] Correct typos in the sub-modules: graphx, extern…

    …al, and examples
    
    ### What changes were proposed in this pull request?
    
    This PR intends to fix typos in the sub-modules: graphx, external, and examples.
    Split per holdenk apache#30323 (comment)
    
    NOTE: The misspellings have been reported at jsoref@706a726#commitcomment-44064356
    
    Backport of apache#30326
    
    ### Why are the changes needed?
    
    Misspelled words make it harder to read / understand content.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    No testing was performed
    
    Closes apache#30342 from jsoref/branch-3.0-30326.
    
    Authored-by: Josh Soref <jsoref@users.noreply.github.com>
    Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
    jsoref authored and maropu committed Nov 17, 2020
    Configuration menu
    Copy the full SHA
    26c0404 View commit details
    Browse the repository at this point in the history

Commits on Nov 19, 2020

  1. [SPARK-33464][INFRA][3.0] Add/remove (un)necessary cache and restruct…

    …ure GitHub Actions yaml
    
    ### What changes were proposed in this pull request?
    
    This PR backports apache#30391. Note that it's a partial backport.
    
    This PR proposes:
    - Add `~/.sbt` directory into the build cache, see also sbt/sbt#3681
    - ~Move `hadoop-2` below to put up together with `java-11` and `scala-213`, see apache#30391 (comment)
    - Remove unnecessary `.m2` cache if you run SBT tests only.
    - Remove `rm ~/.m2/repository/org/apache/spark`. If you don't `sbt publishLocal` or `mvn install`, we don't need to care about it.
    - ~Use Java 8 in Scala 2.13 build. We can switch the Java version to 11 used for release later.~
    - Add caches into linters. The linter scripts uses `sbt` in, for example, `./dev/lint-scala`, and uses `mvn` in, for example, `./dev/lint-java`. Also, it requires to `sbt package` in Jekyll build, see: https://github.com/apache/spark/blob/master/docs/_plugins/copy_api_dirs.rb#L160-L161. We need full caches here for SBT, Maven and build tools.
    - Use the same syntax of Java version, 1.8 -> 8.
    
    ### Why are the changes needed?
    
    - Remove unnecessary stuff
    - Cache what we can in the build
    
    ### Does this PR introduce _any_ user-facing change?
    
    No, dev-only.
    
    ### How was this patch tested?
    
    It will be tested in GitHub Actions build at the current PR
    
    Closes apache#30416 from HyukjinKwon/SPARK-33464-3.0.
    
    Authored-by: HyukjinKwon <gurwls223@apache.org>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    HyukjinKwon authored and dongjoon-hyun committed Nov 19, 2020
    Configuration menu
    Copy the full SHA
    c301d9c View commit details
    Browse the repository at this point in the history
  2. [SPARK-27421][SQL] Fix filter for int column and value class java.lan…

    …g.String when pruning partition column
    
    ### What changes were proposed in this pull request?
    
    This pr fix filter for int column and value class java.lang.String when pruning partition column.
    
    How to reproduce this issue:
    ```scala
    spark.sql("CREATE table test (name STRING) partitioned by (id int) STORED AS PARQUET")
    spark.sql("CREATE VIEW test_view as select cast(id as string) as id, name from test")
    spark.sql("SELECT * FROM test_view WHERE id = '0'").explain
    ```
    ```
    20/11/15 06:19:01 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_partitions_by_filter : db=default tbl=test
    20/11/15 06:19:01 INFO MetaStoreDirectSql: Unable to push down SQL filter: Cannot push down filter for int column and value class java.lang.String
    20/11/15 06:19:01 ERROR SparkSQLDriver: Failed in [SELECT * FROM test_view WHERE id = '0']
    java.lang.RuntimeException: Caught Hive MetaException attempting to get partition metadata by filter from Hive. You can set the Spark configuration setting spark.sql.hive.manageFilesourcePartitions to false to work around this problem, however this will result in degraded performance. Please report a bug: https://issues.apache.org/jira/browse/SPARK
     at org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:828)
     at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getPartitionsByFilter$1(HiveClientImpl.scala:745)
     at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:294)
     at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:227)
     at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:226)
     at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:276)
     at org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:743)
    ```
    
    ### Why are the changes needed?
    
    Fix bug.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Unit test.
    
    Closes apache#30380 from wangyum/SPARK-27421.
    
    Authored-by: Yuming Wang <yumwang@ebay.com>
    Signed-off-by: Yuming Wang <yumwang@ebay.com>
    (cherry picked from commit 014e1fb)
    Signed-off-by: Yuming Wang <yumwang@ebay.com>
    wangyum committed Nov 19, 2020
    Configuration menu
    Copy the full SHA
    1101938 View commit details
    Browse the repository at this point in the history
  3. [SPARK-33483][INFRA][TESTS][3.0] Fix rat exclusion patterns and add a…

    … LICENSE
    
    ### What changes were proposed in this pull request?
    
    This PR fixes the RAT exclusion rule which was originated from SPARK-1144 (Apache Spark 1.0)
    
    ### Why are the changes needed?
    
    This prevents the situation like apache#30415.
    
    Currently, it missed `catalog` directory due to `.log` rule.
    ```
    $ dev/check-license
    Could not find Apache license headers in the following files:
     !????? /Users/dongjoon/APACHE/spark-merge/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/MetadataColumn.java
     !????? /Users/dongjoon/APACHE/spark-merge/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsMetadataColumns.java
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Pass the CI with the new rule.
    
    Closes apache#30424 from dongjoon-hyun/SPARK-33483-3.0.
    
    Authored-by: Dongjoon Hyun <dongjoon@apache.org>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    dongjoon-hyun authored and HyukjinKwon committed Nov 19, 2020
    Configuration menu
    Copy the full SHA
    6b7172b View commit details
    Browse the repository at this point in the history

Commits on Nov 20, 2020

  1. [SPARK-33422][DOC] Fix the correct display of left menu item

    ### What changes were proposed in this pull request?
    Limit the height of the menu area on the left to display vertical scroll bar
    
    ### Why are the changes needed?
    
    The bottom menu item cannot be displayed when the left menu tree is long
    
    ### Does this PR introduce any user-facing change?
    
    Yes, if the menu item shows more, you'll see it by pulling down the vertical scroll bar
    
    before:
    ![image](https://user-images.githubusercontent.com/28332082/98805115-16995d80-2452-11eb-933a-3b72c14bea78.png)
    
    after:
    ![image](https://user-images.githubusercontent.com/28332082/98805418-7e4fa880-2452-11eb-9a9b-8d265078297c.png)
    
    ### How was this patch tested?
    NA
    
    Closes apache#30335 from liucht-inspur/master.
    
    Authored-by: liucht <liucht@inspur.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    (cherry picked from commit cbc8be2)
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    liucht-inspur authored and HyukjinKwon committed Nov 20, 2020
    Configuration menu
    Copy the full SHA
    d7c2dae View commit details
    Browse the repository at this point in the history
  2. [SPARK-33472][SQL][3.0] Adjust RemoveRedundantSorts rule order

    Backport apache#30373 for branch-3.0.
    
    ### What changes were proposed in this pull request?
    This PR switched the order for the rule `RemoveRedundantSorts` and `EnsureRequirements` so that `EnsureRequirements` will be invoked before `RemoveRedundantSorts` to avoid IllegalArgumentException when instantiating PartitioningCollection.
    
    ### Why are the changes needed?
    `RemoveRedundantSorts` rule uses SparkPlan's `outputPartitioning` to check whether a sort node is redundant. Currently, it is added before `EnsureRequirements`. Since `PartitioningCollection` requires left and right partitioning to have the same number of partitions, which is not necessarily true before applying `EnsureRequirements`, the rule can fail with the following exception:
    ```
    IllegalArgumentException: requirement failed: PartitioningCollection requires all of its partitionings have the same numPartitions.
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Unit test
    
    Closes apache#30438 from allisonwang-db/spark-33472-3.0.
    
    Authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    allisonwang-db authored and dongjoon-hyun committed Nov 20, 2020
    Configuration menu
    Copy the full SHA
    1e525c1 View commit details
    Browse the repository at this point in the history

Commits on Nov 23, 2020

  1. [MINOR][INFRA] Suppress warning in check-license

    ### What changes were proposed in this pull request?
    This PR aims to suppress the warning `File exists` in check-license
    
    ### Why are the changes needed?
    
    **BEFORE**
    ```
    % dev/check-license
    Attempting to fetch rat
    RAT checks passed.
    
    % dev/check-license
    mkdir: target: File exists
    RAT checks passed.
    ```
    
    **AFTER**
    ```
    % dev/check-license
    Attempting to fetch rat
    RAT checks passed.
    
    % dev/check-license
    RAT checks passed.
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    No.
    
    ### How was this patch tested?
    Manually do dev/check-license twice.
    
    Closes apache#30460 from williamhyun/checklicense.
    
    Authored-by: William Hyun <williamhyun3@gmail.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    (cherry picked from commit a459238)
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    williamhyun authored and HyukjinKwon committed Nov 23, 2020
    Configuration menu
    Copy the full SHA
    b70584f View commit details
    Browse the repository at this point in the history

Commits on Nov 24, 2020

  1. [SPARK-33524][SQL][TESTS] Change InMemoryTable not to use Tuple.has…

    …hCode for `BucketTransform`
    
    This PR aims to change `InMemoryTable` not to use `Tuple.hashCode` for `BucketTransform`.
    
    SPARK-32168 made `InMemoryTable` to handle `BucketTransform` as a hash of `Tuple` which is dependents on Scala versions.
    - https://github.com/apache/spark/blob/master/sql/catalyst/src/test/scala/org/apache/spark/sql/connector/InMemoryTable.scala#L159
    
    **Scala 2.12.10**
    ```scala
    $ bin/scala
    Welcome to Scala 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_272).
    Type in expressions for evaluation. Or try :help.
    
    scala> (1, 1).hashCode
    res0: Int = -2074071657
    ```
    
    **Scala 2.13.3**
    ```scala
    Welcome to Scala 2.13.3 (OpenJDK 64-Bit Server VM, Java 1.8.0_272).
    Type in expressions for evaluation. Or try :help.
    
    scala> (1, 1).hashCode
    val res0: Int = -1669302457
    ```
    
    Yes. This is a correctness issue.
    
    Pass the UT with both Scala 2.12/2.13.
    
    Closes apache#30477 from dongjoon-hyun/SPARK-33524.
    
    Authored-by: Dongjoon Hyun <dongjoon@apache.org>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    (cherry picked from commit 8380e00)
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    dongjoon-hyun committed Nov 24, 2020
    Configuration menu
    Copy the full SHA
    200417e View commit details
    Browse the repository at this point in the history
  2. [SPARK-33535][INFRA][TESTS] Export LANG to en_US.UTF-8 in run-tests-j…

    …enkins script
    
    ### What changes were proposed in this pull request?
    It seems that Jenkins tests tasks in many pr have test failed. The failed cases include:
    
    -  `org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V1 get binary type`
    - `org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V2 get binary type`
    - `org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V3 get binary type`
    - `org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V4 get binary type`
    - `org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V5 get binary type`
    
    The error message as follows:
    
    ```
    Error Messageorg.scalatest.exceptions.TestFailedException: "[?](" did not equal "[�]("Stacktracesbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: "[?](" did not equal "[�]("
    	at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
    	at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
    	at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
    	at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
    	at org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.$anonfun$new$26(SparkThriftServerProtocolVersionsSuite.scala:302)
    ```
    
    But they can pass the GitHub Action, maybe it's related to the `LANG` of the Jenkins build machine, this pr add `export LANG="en_US.UTF-8"` in `run-test-jenkins` script.
    
    ### Why are the changes needed?
    Ensure LANG in Jenkins test process is `en_US.UTF-8` to pass `HIVE_CLI_SERVICE_PROTOCOL_VX` related tests
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Jenkins tests pass
    
    Closes apache#30487 from LuciferYang/SPARK-33535.
    
    Authored-by: yangjie01 <yangjie01@baidu.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    (cherry picked from commit 048a982)
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    LuciferYang authored and dongjoon-hyun committed Nov 24, 2020
    Configuration menu
    Copy the full SHA
    efae8b6 View commit details
    Browse the repository at this point in the history

Commits on Nov 25, 2020

  1. [SPARK-33565][PYTHON][BUILD][3.0] Remove py38 spark3

    ### What changes were proposed in this pull request?
    remove python 3.8 from python/run-tests.py and stop build breaks
    
    ### Why are the changes needed?
    the python tests are running against the bare-bones system install of python3, rather than an anaconda environment.
    
    ### Does this PR introduce _any_ user-facing change?
    no
    
    ### How was this patch tested?
    jenkins
    
    see also apache#30506
    
    Closes apache#30509 from shaneknapp/remove-py38-spark3.
    
    Authored-by: shane knapp <incomplete@gmail.com>
    Signed-off-by: shane knapp <incomplete@gmail.com>
    shaneknapp committed Nov 25, 2020
    Configuration menu
    Copy the full SHA
    8eedc41 View commit details
    Browse the repository at this point in the history

Commits on Nov 26, 2020

  1. [SPARK-33565][INFRA][FOLLOW-UP][3.0] Keep the test coverage with Pyth…

    …on 3.8 in GitHub Actions
    
    ### What changes were proposed in this pull request?
    
    This is a backport PR of apache#30510
    
    This PR proposes to keep the test coverage with Python 3.8 in GitHub Actions. It is not tested for now in Jenkins due to an env issue.
    
    **Before this change in GitHub Actions:**
    
    ```
    ========================================================================
    Running PySpark tests
    ========================================================================
    Running PySpark tests. Output is in /__w/spark/spark/python/unit-tests.log
    Will test against the following Python executables: ['/usr/bin/python3', 'python2.7', 'pypy3']
    ...
    ```
    
    **After this change in GitHub Actions:**
    
    ```
    
    ========================================================================
    Running PySpark tests
    ========================================================================
    Running PySpark tests. Output is in /__w/spark/spark/python/unit-tests.log
    Will test against the following Python executables: ['python3.8', 'python2.7', 'pypy3']
    ```
    
    ### Why are the changes needed?
    
    To keep the test coverage with Python 3.8 in GitHub Actions.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No, dev-only.
    
    ### How was this patch tested?
    
    GitHub Actions in this build will test.
    
    Closes apache#30511 from HyukjinKwon/SPARK-33565-3.0.
    
    Authored-by: HyukjinKwon <gurwls223@apache.org>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    HyukjinKwon committed Nov 26, 2020
    Configuration menu
    Copy the full SHA
    7503c4a View commit details
    Browse the repository at this point in the history

Commits on Nov 29, 2020

  1. [SPARK-33585][SQL][DOCS] Fix the comment for SQLContext.tables() an…

    …d mention the `database` column
    
    ### What changes were proposed in this pull request?
    Change the comments for `SQLContext.tables()` to "The returned DataFrame has three columns, database, tableName and isTemporary".
    
    ### Why are the changes needed?
    Currently, the comment mentions only 2 columns but `tables()` returns 3 columns actually:
    ```scala
    scala> spark.range(10).createOrReplaceTempView("view1")
    scala> val tables = spark.sqlContext.tables()
    tables: org.apache.spark.sql.DataFrame = [database: string, tableName: string ... 1 more field]
    
    scala> tables.printSchema
    root
     |-- database: string (nullable = false)
     |-- tableName: string (nullable = false)
     |-- isTemporary: boolean (nullable = false)
    
    scala> tables.show
    +--------+---------+-----------+
    |database|tableName|isTemporary|
    +--------+---------+-----------+
    | default|       t1|      false|
    | default|       t2|      false|
    | default|      ymd|      false|
    |        |    view1|       true|
    +--------+---------+-----------+
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    By running `./dev/scalastyle`
    
    Closes apache#30526 from MaxGekk/sqlcontext-tables-doc.
    
    Authored-by: Max Gekk <max.gekk@gmail.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    (cherry picked from commit a088a80)
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    MaxGekk authored and dongjoon-hyun committed Nov 29, 2020
    Configuration menu
    Copy the full SHA
    f67f80b View commit details
    Browse the repository at this point in the history

Commits on Nov 30, 2020

  1. [SPARK-33579][UI] Fix executor blank page behind proxy

    ### What changes were proposed in this pull request?
    
    Fix some "hardcoded" API urls in Web UI.
    More specifically, we avoid the use of `location.origin` when constructing URLs for internal API calls within the JavaScript.
    Instead, we use `apiRoot` global variable.
    
    ### Why are the changes needed?
    
    On one hand, it allows us to build relative URLs. On the other hand, `apiRoot` reflects the Spark property `spark.ui.proxyBase` which can be set to change the root path of the Web UI.
    
    If `spark.ui.proxyBase` is actually set, original URLs become incorrect, and we end up with an executors blank page.
    I encounter this bug when accessing the Web UI behind a proxy (in my case a Kubernetes Ingress).
    
    See the following link for more context:
    jupyterhub/jupyter-server-proxy#57 (comment)
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, as all the changes introduced are in the JavaScript for the Web UI.
    
    ### How the changes have been tested ?
    I modified/debugged the JavaScript as in the commit with the help of the developer tools in Google Chrome, while accessing the Web UI of my Spark app behind my k8s ingress.
    
    Closes apache#30523 from pgillet/fix-executors-blank-page-behind-proxy.
    
    Authored-by: Pascal Gillet <pascal.gillet@stack-labs.com>
    Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
    (cherry picked from commit 6e5446e)
    Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
    Pascal Gillet authored and sarutak committed Nov 30, 2020
    Configuration menu
    Copy the full SHA
    f6638cf View commit details
    Browse the repository at this point in the history
  2. [SPARK-33588][SQL][3.0] Respect the spark.sql.caseSensitive config …

    …while resolving partition spec in v1 `SHOW TABLE EXTENDED`
    
    ### What changes were proposed in this pull request?
    Perform partition spec normalization in `ShowTablesCommand` according to the table schema before getting partitions from the catalog. The normalization via `PartitioningUtils.normalizePartitionSpec()` adjusts the column names in partition specification, w.r.t. the real partition column names and case sensitivity.
    
    ### Why are the changes needed?
    Even when `spark.sql.caseSensitive` is `false` which is the default value, v1 `SHOW TABLE EXTENDED` is case sensitive:
    ```sql
    spark-sql> CREATE TABLE tbl1 (price int, qty int, year int, month int)
             > USING parquet
             > partitioned by (year, month);
    spark-sql> INSERT INTO tbl1 PARTITION(year = 2015, month = 1) SELECT 1, 1;
    spark-sql> SHOW TABLE EXTENDED LIKE 'tbl1' PARTITION(YEAR = 2015, Month = 1);
    Error in query: Partition spec is invalid. The spec (YEAR, Month) must match the partition spec (year, month) defined in table '`default`.`tbl1`';
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    Yes. After the changes, the `SHOW TABLE EXTENDED` command respects the SQL config. And for example above, it returns correct result:
    ```sql
    spark-sql> SHOW TABLE EXTENDED LIKE 'tbl1' PARTITION(YEAR = 2015, Month = 1);
    default	tbl1	false	Partition Values: [year=2015, month=1]
    Location: file:/Users/maximgekk/spark-warehouse/tbl1/year=2015/month=1
    Serde Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
    InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
    OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
    Storage Properties: [serialization.format=1, path=file:/Users/maximgekk/spark-warehouse/tbl1]
    Partition Parameters: {transient_lastDdlTime=1606595118, totalSize=623, numFiles=1}
    Created Time: Sat Nov 28 23:25:18 MSK 2020
    Last Access: UNKNOWN
    Partition Statistics: 623 bytes
    ```
    
    ### How was this patch tested?
    By running the modified test suite via:
    ```
    $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *DDLSuite"
    ```
    
    Authored-by: Max Gekk <max.gekkgmail.com>
    Signed-off-by: Dongjoon Hyun <dongjoonapache.org>
    (cherry picked from commit 0054fc9)
    Signed-off-by: Max Gekk <max.gekkgmail.com>
    
    Closes apache#30549 from MaxGekk/show-table-case-sensitive-spec-3.0.
    
    Authored-by: Max Gekk <max.gekk@gmail.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    MaxGekk authored and dongjoon-hyun committed Nov 30, 2020
    Configuration menu
    Copy the full SHA
    03291c8 View commit details
    Browse the repository at this point in the history
  3. [SPARK-33440][CORE] Use current timestamp with warning log in HadoopF…

    …SDelegationTokenProvider when the issue date for token is not set up properly
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to use current timestamp with warning log when the issue date for token is not set up properly. The next section will explain the rationalization with details.
    
    ### Why are the changes needed?
    
    Unfortunately not every implementations respect the `issue date` in `AbstractDelegationTokenIdentifier`, which Spark relies on while calculating. The default value of issue date is 0L, which is far from actual issue date, breaking logic on calculating next renewal date under some circumstance, leading to 0 interval (immediate) on rescheduling token renewal.
    
    In HadoopFSDelegationTokenProvider, Spark calculates token renewal interval as below:
    
    https://github.com/apache/spark/blob/2c64b731ae6a976b0d75a95901db849b4a0e2393/core/src/main/scala/org/apache/spark/deploy/security/HadoopFSDelegationTokenProvider.scala#L123-L134
    
    The interval is calculated as `token.renew() - identifier.getIssueDate`, which is providing correct interval assuming both `token.renew()` and `identifier.getIssueDate` produce correct value, but it's going to be weird when `identifier.getIssueDate` provides 0L (default value), like below:
    
    ```
    20/10/13 06:34:19 INFO security.HadoopFSDelegationTokenProvider: Renewal interval is 1603175657000 for token S3ADelegationToken/IDBroker
    20/10/13 06:34:19 INFO security.HadoopFSDelegationTokenProvider: Renewal interval is 86400048 for token HDFS_DELEGATION_TOKEN
    ```
    
    Hopefully we pick the minimum value as safety guard (so in this case, `86400048` is being picked up), but the safety guard leads unintentional bad impact on this case.
    
    https://github.com/apache/spark/blob/2c64b731ae6a976b0d75a95901db849b4a0e2393/core/src/main/scala/org/apache/spark/deploy/security/HadoopFSDelegationTokenProvider.scala#L58-L71
    
    Spark leverages the interval being calculated in above, "minimum" value of intervals, and blindly adds the value to token's issue date to calculates the next renewal date for the token, and picks "minimum" value again. In problematic case, the value would be `86400048` (86400048 + 0) which is quite smaller than current timestamp.
    
    https://github.com/apache/spark/blob/2c64b731ae6a976b0d75a95901db849b4a0e2393/core/src/main/scala/org/apache/spark/deploy/security/HadoopDelegationTokenManager.scala#L228-L234
    
    The next renewal date is subtracted with current timestamp again to get the interval, and multiplexed by configured ratio to produce the final schedule interval. In problematic case, this value goes to negative.
    
    https://github.com/apache/spark/blob/2c64b731ae6a976b0d75a95901db849b4a0e2393/core/src/main/scala/org/apache/spark/deploy/security/HadoopDelegationTokenManager.scala#L180-L188
    
    There's a safety guard to not allow negative value, but that's simply 0 meaning schedule immediately. This triggers next calculation of next renewal date to calculate the schedule interval, lead to the same behavior, hence updating delegation token immediately and continuously.
    
    As we fetch token just before the calculation happens, the actual issue date is likely slightly before, hence it's not that dangerous to use current timestamp as issue date for the token the issue date has not been set up properly. Still, it's better not to leave the token implementation as it is, so we log warn message to let end users consult with token implementer.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes. End users won't encounter the tight loop of schedule of token renewal after the PR. In end users' perspective of reflection, there's nothing end users need to change.
    
    ### How was this patch tested?
    
    Manually tested with problematic environment.
    
    Closes apache#30366 from HeartSaVioR/SPARK-33440.
    
    Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
    Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
    (cherry picked from commit f5d2165)
    Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
    HeartSaVioR committed Nov 30, 2020
    Configuration menu
    Copy the full SHA
    242581f View commit details
    Browse the repository at this point in the history

Commits on Dec 1, 2020

  1. [SPARK-33611][UI] Avoid encoding twice on the query parameter of rewr…

    …itten proxy URL
    
    ### What changes were proposed in this pull request?
    
    When running Spark behind a reverse proxy(e.g. Nginx, Apache HTTP server), the request URL can be encoded twice if we pass the query string directly to the constructor of `java.net.URI`:
    ```
    > val uri = "http://localhost:8081/test"
    > val query = "order%5B0%5D%5Bcolumn%5D=0"  // query string of URL from the reverse proxy
    > val rewrittenURI = URI.create(uri.toString())
    
    > new URI(rewrittenURI.getScheme(),
          rewrittenURI.getAuthority(),
          rewrittenURI.getPath(),
          query,
          rewrittenURI.getFragment()).toString
    result: http://localhost:8081/test?order%255B0%255D%255Bcolumn%255D=0
    ```
    
    In Spark's stage page, the URL of "/taskTable" contains query parameter order[0][dir]. After encoding twice, the query parameter becomes `order%255B0%255D%255Bdir%255D` and it will be decoded as `order%5B0%5D%5Bdir%5D` instead of `order[0][dir]`. As a result, there will be NullPointerException from https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/api/v1/StagesResource.scala#L176
    Other than that, the other parameter may not work as expected after encoded twice.
    
    This PR is to fix the bug by calling the method `URI.create(String URL)` directly. This convenience method can avoid encoding twice on the query parameter.
    ```
    > val uri = "http://localhost:8081/test"
    > val query = "order%5B0%5D%5Bcolumn%5D=0"
    > URI.create(s"$uri?$query").toString
    result: http://localhost:8081/test?order%5B0%5D%5Bcolumn%5D=0
    
    > URI.create(s"$uri?$query").getQuery
    result: order[0][column]=0
    ```
    
    ### Why are the changes needed?
    
    Fix a potential bug when Spark's reverse proxy is enabled.
    The bug itself is similar to apache#29271.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    Add a new unit test.
    Also, Manual UI testing for master, worker and app UI with an nginx proxy
    
    Spark config:
    ```
    spark.ui.port 8080
    spark.ui.reverseProxy=true
    spark.ui.reverseProxyUrl=/path/to/spark/
    ```
    nginx config:
    ```
    server {
        listen 9000;
        set $SPARK_MASTER http://127.0.0.1:8080;
        # split spark UI path into prefix and local path within master UI
        location ~ ^(/path/to/spark/) {
            # strip prefix when forwarding request
            rewrite /path/to/spark(/.*) $1  break;
            #rewrite /path/to/spark/ "/" ;
            # forward to spark master UI
            proxy_pass $SPARK_MASTER;
            proxy_intercept_errors on;
            error_page 301 302 307 = handle_redirects;
        }
        location handle_redirects {
            set $saved_redirect_location '$upstream_http_location';
            proxy_pass $saved_redirect_location;
        }
    }
    ```
    
    Closes apache#30552 from gengliangwang/decodeProxyRedirect.
    
    Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
    Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
    (cherry picked from commit 5d0045e)
    Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
    gengliangwang committed Dec 1, 2020
    Configuration menu
    Copy the full SHA
    6abfeb6 View commit details
    Browse the repository at this point in the history

Commits on Dec 2, 2020

  1. [SPARK-33504][CORE] The application log in the Spark history server c…

    …ontains sensitive attributes should be redacted
    
    ### What changes were proposed in this pull request?
    To make sure the sensitive attributes to be redacted in the history server log.
    
    ### Why are the changes needed?
    We found the secure attributes like password  in SparkListenerJobStart and SparkListenerStageSubmitted events would not been redated, resulting in sensitive attributes can be viewd directly.
    The screenshot can be viewed in the attachment of JIRA spark-33504
    ### Does this PR introduce _any_ user-facing change?
    no
    
    ### How was this patch tested?
    muntual test works well, I have also added unit testcase.
    
    Closes apache#30446 from akiyamaneko/eventlog_unredact.
    
    Authored-by: neko <echohlne@gmail.com>
    Signed-off-by: Thomas Graves <tgraves@apache.org>
    (cherry picked from commit 28dad1b)
    Signed-off-by: Thomas Graves <tgraves@apache.org>
    echohlne authored and tgravescs committed Dec 2, 2020
    Configuration menu
    Copy the full SHA
    e59179b View commit details
    Browse the repository at this point in the history
  2. Revert "[SPARK-33504][CORE] The application log in the Spark history …

    …server contains sensitive attributes should be redacted"
    
    ### What changes were proposed in this pull request?
    
    Revert SPARK-33504 on branch-3.0 compilation error. Original PR apache#30446
    
    This reverts commit e59179b.
    
    ### Why are the changes needed?
    
    ### Does this PR introduce _any_ user-facing change?
    
    ### How was this patch tested?
    
    Closes apache#30576 from tgravescs/revert33504.
    
    Authored-by: Thomas Graves <tgraves@nvidia.com>
    Signed-off-by: Thomas Graves <tgraves@apache.org>
    tgravescs committed Dec 2, 2020
    Configuration menu
    Copy the full SHA
    3fb9f6f View commit details
    Browse the repository at this point in the history
  3. [SPARK-33631][DOCS][TEST] Clean up spark.core.connection.ack.wait.tim…

    …eout from configuration.md
    
    SPARK-9767  remove `ConnectionManager` and related files, the configuration `spark.core.connection.ack.wait.timeout` previously used by `ConnectionManager` is no longer used by other Spark code, but it still exists in the `configuration.md`.
    
    So this pr cleans up the useless configuration item spark.core.connection.ack.wait.timeout` from `configuration.md`.
    
    Clean up useless configuration from `configuration.md`.
    
    No
    
    Pass the Jenkins or GitHub Action
    
    Closes apache#30569 from LuciferYang/SPARK-33631.
    
    Authored-by: yangjie01 <yangjie01@baidu.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    (cherry picked from commit 92bfbcb)
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    LuciferYang authored and dongjoon-hyun committed Dec 2, 2020
    Configuration menu
    Copy the full SHA
    6f4587a View commit details
    Browse the repository at this point in the history

Commits on Dec 3, 2020

  1. [SPARK-33636][PYTHON][ML][3.0] Add labelsArray to PySpark StringIndexer

    ### What changes were proposed in this pull request?
    
    This is a followup to add missing `labelsArray` to PySpark `StringIndexer`.
    
    ### Why are the changes needed?
    
    `labelsArray` is for multi-column case for `StringIndexer`. We should provide this accessor at PySpark side too.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, `labelsArray` was missing in PySpark `StringIndexer` in Spark 3.0.
    
    ### How was this patch tested?
    
    Unit test.
    
    Closes apache#30580 from viirya/SPARK-33636.
    
    Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
    Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
    viirya committed Dec 3, 2020
    Configuration menu
    Copy the full SHA
    13ca88c View commit details
    Browse the repository at this point in the history
  2. [SPARK-33629][PYTHON] Make spark.buffer.size configuration visible on…

    … driver side
    
    `spark.buffer.size` not applied in driver from pyspark. In this PR I've fixed this issue.
    
    Apply the mentioned config on driver side.
    
    No.
    
    Existing unit tests + manually.
    
    Added the following code temporarily:
    ```
    def local_connect_and_auth(port, auth_secret):
    ...
                sock.connect(sa)
                print("SPARK_BUFFER_SIZE: %d" % int(os.environ.get("SPARK_BUFFER_SIZE", 65536))) <- This is the addition
                sockfile = sock.makefile("rwb", int(os.environ.get("SPARK_BUFFER_SIZE", 65536)))
    ...
    ```
    
    Test:
    ```
    
    echo "spark.buffer.size 10000" >> conf/spark-defaults.conf
    
    $ ./bin/pyspark
    Python 3.8.5 (default, Jul 21 2020, 10:48:26)
    [Clang 11.0.3 (clang-1103.0.32.62)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    20/12/03 13:38:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
    20/12/03 13:38:14 WARN SparkEnv: I/O encryption enabled without RPC encryption: keys will be visible on the wire.
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /__ / .__/\_,_/_/ /_/\_\   version 3.1.0-SNAPSHOT
          /_/
    
    Using Python version 3.8.5 (default, Jul 21 2020 10:48:26)
    Spark context Web UI available at http://192.168.0.189:4040
    Spark context available as 'sc' (master = local[*], app id = local-1606999094506).
    SparkSession available as 'spark'.
    >>> sc.setLogLevel("TRACE")
    >>> sc.parallelize([0, 2, 3, 4, 6], 5).glom().collect()
    ...
    SPARK_BUFFER_SIZE: 10000
    ...
    [[0], [2], [3], [4], [6]]
    >>>
    ```
    
    Closes apache#30592 from gaborgsomogyi/SPARK-33629.
    
    Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    (cherry picked from commit bd71186)
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    gaborgsomogyi authored and HyukjinKwon committed Dec 3, 2020
    Configuration menu
    Copy the full SHA
    c4318a1 View commit details
    Browse the repository at this point in the history

Commits on Dec 4, 2020

  1. [SPARK-33660][DOCS][SS] Fix Kafka Headers Documentation

    ### What changes were proposed in this pull request?
    
    Update kafka headers documentation, type is not longer a map but an array
    
    [jira](https://issues.apache.org/jira/browse/SPARK-33660)
    
    ### Why are the changes needed?
    To help users
    
    ### Does this PR introduce _any_ user-facing change?
    no
    
    ### How was this patch tested?
    
    It is only documentation
    
    Closes apache#30605 from Gschiavon/SPARK-33660-fix-kafka-headers-documentation.
    
    Authored-by: german <germanschiavon@gmail.com>
    Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
    (cherry picked from commit d671e05)
    Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
    Gschiavon authored and HeartSaVioR committed Dec 4, 2020
    Configuration menu
    Copy the full SHA
    6121c8f View commit details
    Browse the repository at this point in the history
  2. [SPARK-33571][SQL][DOCS][3.0] Add a ref to INT96 config from the doc …

    …for `spark.sql.legacy.parquet.datetimeRebaseModeInWrite/Read`
    
    ### What changes were proposed in this pull request?
    For the SQL configs `spark.sql.legacy.parquet.datetimeRebaseModeInWrite` and `spark.sql.legacy.parquet.datetimeRebaseModeInRead`, improve their descriptions by:
    1. Explicitly document on which parquet types, those configs influence on
    2. Refer to corresponding configs for `INT96`
    
    ### Why are the changes needed?
    To avoid user confusions like reposted in SPARK-33571, and make the config description more precise.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    By running `./dev/scalastyle`.
    
    Closes apache#30604 from MaxGekk/clarify-rebase-docs-3.0.
    
    Authored-by: Max Gekk <max.gekk@gmail.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    MaxGekk authored and dongjoon-hyun committed Dec 4, 2020
    Configuration menu
    Copy the full SHA
    8743571 View commit details
    Browse the repository at this point in the history

Commits on Dec 6, 2020

  1. [MINOR] Fix string interpolation in CommandUtils.scala and KafkaDataC…

    …onsumer.scala
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to fix a string interpolation in `CommandUtils.scala` and `KafkaDataConsumer.scala`.
    
    ### Why are the changes needed?
    
    To fix a string interpolation bug.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, the string will be correctly constructed.
    
    ### How was this patch tested?
    
    Existing tests since they were used in exception/log messages.
    
    Closes apache#30609 from imback82/fix_cache_str_interporlation.
    
    Authored-by: Terry Kim <yuminkim@gmail.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    (cherry picked from commit 154f604)
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    imback82 authored and HyukjinKwon committed Dec 6, 2020
    Configuration menu
    Copy the full SHA
    66b1bdb View commit details
    Browse the repository at this point in the history
  2. [SPARK-33667][SQL][3.0] Respect the spark.sql.caseSensitive config …

    …while resolving partition spec in v1 `SHOW PARTITIONS`
    
    ### What changes were proposed in this pull request?
    Preprocess the partition spec passed to the V1 SHOW PARTITIONS implementation `ShowPartitionsCommand`, and normalize the passed spec according to the partition columns w.r.t the case sensitivity flag  **spark.sql.caseSensitive**.
    
    ### Why are the changes needed?
    V1 SHOW PARTITIONS is case sensitive in fact, and doesn't respect the SQL config **spark.sql.caseSensitive** which is false by default, for instance:
    ```sql
    spark-sql> CREATE TABLE tbl1 (price int, qty int, year int, month int)
             > USING parquet
             > PARTITIONED BY (year, month);
    spark-sql> INSERT INTO tbl1 PARTITION(year = 2015, month = 1) SELECT 1, 1;
    spark-sql> SHOW PARTITIONS tbl1 PARTITION(YEAR = 2015, Month = 1);
    Error in query: Non-partitioning column(s) [YEAR, Month] are specified for SHOW PARTITIONS;
    ```
    The `SHOW PARTITIONS` command must show the partition `year = 2015, month = 1` specified by `YEAR = 2015, Month = 1`.
    
    ### Does this PR introduce _any_ user-facing change?
    Yes. After the changes, the command above works as expected:
    ```sql
    spark-sql> SHOW PARTITIONS tbl1 PARTITION(YEAR = 2015, Month = 1);
    year=2015/month=1
    ```
    
    ### How was this patch tested?
    By running the affected test suites:
    ```
    $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite"
    ```
    
    Closes apache#30626 from MaxGekk/show-partitions-case-sensitivity-test-3.0.
    
    Authored-by: Max Gekk <max.gekk@gmail.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    MaxGekk authored and dongjoon-hyun committed Dec 6, 2020
    Configuration menu
    Copy the full SHA
    a11a07a View commit details
    Browse the repository at this point in the history

Commits on Dec 7, 2020

  1. [SPARK-33675][INFRA][3.0] Add GitHub Action job to publish snapshot

    ### What changes were proposed in this pull request?
    
    This PR aims to add `GitHub Action` job to publish snapshot from `branch-3.0`.
    
    https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.12/3.0.2-SNAPSHOT/
    
    ### Why are the changes needed?
    
    This will remove our maintenance burden for `branch-3.0` and will stop automatically when we don't have any commit on `branch-3.0`.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    N/A
    
    Closes apache#30630 from dongjoon-hyun/SPARK-33675-3.0.
    
    Authored-by: Dongjoon Hyun <dongjoon@apache.org>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    dongjoon-hyun authored and HyukjinKwon committed Dec 7, 2020
    Configuration menu
    Copy the full SHA
    8029d66 View commit details
    Browse the repository at this point in the history
  2. [SPARK-33681][K8S][TESTS][3.0] Increase K8s IT timeout to 3 minutes

    ### What changes were proposed in this pull request?
    
    This PR aims to increase the timeout of K8s integration test of `branch-3.0/2.4` from 2 minutes to 3 minutes which is consistent with `master/branch-3.1`.
    
    ### Why are the changes needed?
    
    This will reduce the chance of this kind of failures.
    - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36905/console
    ```
    ...
      20/12/07 00:11:23 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
      20/12/07 00:11:38 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
      " did not contain "PySpark Worker Memory Check is: True" The application did not complete.. (KubernetesSuite.scala:249)
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Pass the K8s IT Jenkins job.
    
    Closes apache#30632 from dongjoon-hyun/SPARK-33681.
    
    Authored-by: Dongjoon Hyun <dongjoon@apache.org>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    dongjoon-hyun authored and HyukjinKwon committed Dec 7, 2020
    Configuration menu
    Copy the full SHA
    313a460 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    1bb37dc View commit details
    Browse the repository at this point in the history
  4. [SPARK-33592][ML][PYTHON][3.0] Backport Fix: Pyspark ML Validator par…

    …ams in estimatorParamMaps may be lost after saving and reloading
    
    Fix: Pyspark ML Validator params in estimatorParamMaps may be lost after saving and reloading
    
    When saving validator estimatorParamMaps, will check all nested stages in tuned estimator to get correct param parent.
    
    Two typical cases to manually test:
    ~~~python
    tokenizer = Tokenizer(inputCol="text", outputCol="words")
    hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
    lr = LogisticRegression()
    pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
    
    paramGrid = ParamGridBuilder() \
        .addGrid(hashingTF.numFeatures, [10, 100]) \
        .addGrid(lr.maxIter, [100, 200]) \
        .build()
    tvs = TrainValidationSplit(estimator=pipeline,
                               estimatorParamMaps=paramGrid,
                               evaluator=MulticlassClassificationEvaluator())
    
    tvs.save(tvsPath)
    loadedTvs = TrainValidationSplit.load(tvsPath)
    
    ~~~
    
    ~~~python
    lr = LogisticRegression()
    ova = OneVsRest(classifier=lr)
    grid = ParamGridBuilder().addGrid(lr.maxIter, [100, 200]).build()
    evaluator = MulticlassClassificationEvaluator()
    tvs = TrainValidationSplit(estimator=ova, estimatorParamMaps=grid, evaluator=evaluator)
    
    tvs.save(tvsPath)
    loadedTvs = TrainValidationSplit.load(tvsPath)
    
    ~~~
    
    Bug fix.
    
    No
    
    Unit test.
    
    Closes apache#30539 from WeichenXu123/fix_tuning_param_maps_io.
    
    Authored-by: Weichen Xu <weichen.xudatabricks.com>
    Signed-off-by: Ruifeng Zheng <ruifengzfoxmail.com>
    (cherry picked from commit 8016123)
    Signed-off-by: Weichen Xu <weichen.xudatabricks.com>
    
    ### What changes were proposed in this pull request?
    
    ### Why are the changes needed?
    
    ### Does this PR introduce _any_ user-facing change?
    
    ### How was this patch tested?
    
    Closes apache#30590 from WeichenXu123/SPARK-33592-bp-3.0.
    
    Authored-by: Weichen Xu <weichen.xu@databricks.com>
    Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
    WeichenXu123 committed Dec 7, 2020
    Configuration menu
    Copy the full SHA
    8acbe5b View commit details
    Browse the repository at this point in the history
  5. [SPARK-33670][SQL][3.0] Verify the partition provider is Hive in v1 S…

    …HOW TABLE EXTENDED
    
    ### What changes were proposed in this pull request?
    Invoke the check `DDLUtils.verifyPartitionProviderIsHive()` from V1 implementation of `SHOW TABLE EXTENDED` when partition specs are specified.
    
    This PR is some kind of follow up apache#16373 and apache#15515.
    
    ### Why are the changes needed?
    To output an user friendly error with recommendation like
    **"
    ... partition metadata is not stored in the Hive metastore. To import this information into the metastore, run `msck repair table tableName`
    "**
    instead of silently output an empty result.
    
    ### Does this PR introduce _any_ user-facing change?
    Yes.
    
    ### How was this patch tested?
    By running the affected test suites, in particular:
    ```
    $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *HiveCatalogedDDLSuite"
    $ build/sbt -Phive-2.3 -Phive-thriftserver "hive/test:testOnly *PartitionProviderCompatibilitySuite"
    ```
    
    Authored-by: Max Gekk <max.gekkgmail.com>
    Signed-off-by: HyukjinKwon <gurwls223apache.org>
    (cherry picked from commit 29096a8)
    Signed-off-by: Max Gekk <max.gekkgmail.com>
    
    Closes apache#30640 from MaxGekk/show-table-extended-verifyPartitionProviderIsHive-3.0.
    
    Authored-by: Max Gekk <max.gekk@gmail.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    MaxGekk authored and HyukjinKwon committed Dec 7, 2020
    Configuration menu
    Copy the full SHA
    9555658 View commit details
    Browse the repository at this point in the history

Commits on Dec 8, 2020

  1. [SPARK-32680][SQL][3.0] Don't Preprocess V2 CTAS with Unresolved Query

    ### What changes were proposed in this pull request?
    The analyzer rule `PreprocessTableCreation` will preprocess table creation related logical plan. But for
    CTAS, if the sub-query can't be resolved, preprocess it will cause "Invalid call to toAttribute on unresolved
    object" (instead of a user-friendly error msg: "table or view not found").
    This PR fixes this wrongly preprocess for CTAS using V2 catalog.
    
    ### Why are the changes needed?
    bug fix
    
    ### Does this PR introduce _any_ user-facing change?
    The error message for CTAS with a non-exists table changed from:
    `UnresolvedException: Invalid call to toAttribute on unresolved object, tree: xxx` to
    `AnalysisException: Table or view not found: xxx`
    
    ### How was this patch tested?
    added test
    
    Closes apache#30649 from linhongliu-db/fix-ctas-3.0.
    
    Authored-by: Linhong Liu <linhong.liu@databricks.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    linhongliu-db authored and cloud-fan committed Dec 8, 2020
    Configuration menu
    Copy the full SHA
    46a0ec5 View commit details
    Browse the repository at this point in the history
  2. [SPARK-33677][SQL] Skip LikeSimplification rule if pattern contains a…

    …ny escapeChar
    
    ### What changes were proposed in this pull request?
    `LikeSimplification` rule does not work correctly for many cases that have patterns containing escape characters, for example:
    
    `SELECT s LIKE 'm%aca' ESCAPE '%' FROM t`
    `SELECT s LIKE 'maacaa' ESCAPE 'a' FROM t`
    
    For simpilicy, this PR makes this rule just be skipped if `pattern` contains any `escapeChar`.
    
    ### Why are the changes needed?
    Result corrupt.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Added Unit test.
    
    Closes apache#30625 from luluorta/SPARK-33677.
    
    Authored-by: luluorta <luluorta@gmail.com>
    Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
    (cherry picked from commit 99613cd)
    Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
    luluorta authored and maropu committed Dec 8, 2020
    Configuration menu
    Copy the full SHA
    ea7c2a1 View commit details
    Browse the repository at this point in the history
  3. [SPARK-32110][SQL] normalize special floating numbers in HyperLogLog++

    ### What changes were proposed in this pull request?
    
    Currently, Spark treats 0.0 and -0.0 semantically equal, while it still retains the difference between them so that users can see -0.0 when displaying the data set.
    
    The comparison expressions in Spark take care of the special floating numbers and implement the correct semantic. However, Spark doesn't always use these comparison expressions to compare values, and we need to normalize the special floating numbers before comparing them in these places:
    1. GROUP BY
    2. join keys
    3. window partition keys
    
    This PR fixes one more place that compares values without using comparison expressions: HyperLogLog++
    
    ### Why are the changes needed?
    
    Fix the query result
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, the result of HyperLogLog++ becomes correct now.
    
    ### How was this patch tested?
    
    a new test case, and a few more test cases that pass before this PR to improve test coverage.
    
    Closes apache#30673 from cloud-fan/bug.
    
    Authored-by: Wenchen Fan <wenchen@databricks.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    (cherry picked from commit 6fd2345)
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    cloud-fan authored and dongjoon-hyun committed Dec 8, 2020
    Configuration menu
    Copy the full SHA
    eae6a3e View commit details
    Browse the repository at this point in the history

Commits on Dec 10, 2020

  1. [SPARK-33727][K8S] Fall back from gnupg.net to openpgp.org

    ### What changes were proposed in this pull request?
    
    While building R docker image if we can't fetch the key from gnupg.net fall back to openpgp.org
    
    ### Why are the changes needed?
    
    gnupg.net key servers are flaky and sometimes fail to resolve or return keys.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    Tried to add key on my desktop, it failed, then tried to add key with openpgp.org and it succeed.
    
    Closes apache#30696 from holdenk/SPARK-33727-gnupg-server-is-flaky.
    
    Authored-by: Holden Karau <hkarau@apple.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    (cherry picked from commit 991b797)
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    holdenk authored and HyukjinKwon committed Dec 10, 2020
    Configuration menu
    Copy the full SHA
    a4c5e54 View commit details
    Browse the repository at this point in the history
  2. [SPARK-33725][BUILD][3.0] Upgrade snappy-java to 1.1.8.2

    ### What changes were proposed in this pull request?
    
    This upgrades snappy-java to 1.1.8.2.
    
    ### Why are the changes needed?
    
    Minor version upgrade that includes:
    
    - [Fixed](xerial/snappy-java#265) an initialization issue when using a recent Mac OS X version
    - Support Apple Silicon (M1, Mac-aarch64)
    - Fixed the pure-java Snappy fallback logic when no native library for your platform is found.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    Unit test.
    
    Closes apache#30698 from viirya/upgrade-snappy-3.0.
    
    Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    viirya authored and HyukjinKwon committed Dec 10, 2020
    Configuration menu
    Copy the full SHA
    2921c4e View commit details
    Browse the repository at this point in the history
  3. [SPARK-33732][K8S][TESTS][3.0] Kubernetes integration tests doesn't w…

    …ork with Minikube 1.9+
    
    ### What changes were proposed in this pull request?
    
    This is a backport of apache#30700 .
    
    This PR changes `Minikube.scala` for Kubernetes integration tests to work with Minikube 1.9+.
    `Minikube.scala` assumes that `apiserver.key` and `apiserver.crt` are in `~/.minikube/`.
    But as of Minikube 1.9, they are in `~/.minikube/profiles/<profile>`.
    
    ### Why are the changes needed?
    
    Currently, Kubernetes integration tests doesn't work with Minikube 1.9+.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    I confirmed the following test passes.
    ```
    $ build/sbt -Pkubernetes -Pkubernetes-integration-tests package 'kubernetes-integration-tests/testOnly -- -z "SparkPi with no"'
    ```
    
    Closes apache#30702 from sarutak/minikube-1.9-branch-3.0.
    
    Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    sarutak authored and dongjoon-hyun committed Dec 10, 2020
    Configuration menu
    Copy the full SHA
    83af036 View commit details
    Browse the repository at this point in the history

Commits on Dec 11, 2020

  1. [SPARK-33749][BUILD][PYTHON] Exclude target directory in pycodestyle …

    …and flake8
    
    Once you build and ran K8S tests, Python lint fails as below:
    
    ```bash
    $ ./dev/lint-python
    ```
    
    Before this PR:
    
    ```
    starting python compilation test...
    python compilation succeeded.
    
    downloading pycodestyle from https://raw.githubusercontent.com/PyCQA/pycodestyle/2.6.0/pycodestyle.py...
    starting pycodestyle test...
    pycodestyle checks failed:
    ./resource-managers/kubernetes/integration-tests/target/spark-dist-unpacked/python/pyspark/cloudpickle/cloudpickle.py:15:101: E501 line too long (105 > 100 characters)
    ./resource-managers/kubernetes/integration-tests/target/spark-dist-unpacked/python/docs/source/conf.py:60:101: E501 line too long (124 > 100 characters)
    ...
    ```
    
    After this PR:
    
    ```
    starting python compilation test...
    python compilation succeeded.
    
    downloading pycodestyle from https://raw.githubusercontent.com/PyCQA/pycodestyle/2.6.0/pycodestyle.py...
    starting pycodestyle test...
    pycodestyle checks passed.
    
    starting flake8 test...
    flake8 checks passed.
    
    starting mypy test...
    mypy checks passed.
    
    starting sphinx-build tests...
    sphinx-build checks passed.
    ```
    
    This PR excludes target directory to avoid such cases in the future.
    
    To make it easier to run linters
    
    No, dev-only.
    
    Manually tested va running `./dev/lint-python`.
    
    Closes apache#30718 from HyukjinKwon/SPARK-33749.
    
    Authored-by: HyukjinKwon <gurwls223@apache.org>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    (cherry picked from commit cd7a306)
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    HyukjinKwon committed Dec 11, 2020
    Configuration menu
    Copy the full SHA
    728bdb7 View commit details
    Browse the repository at this point in the history
  2. [SPARK-33740][SQL][3.0] hadoop configs in hive-site.xml can overrides…

    … pre-existing hadoop ones
    
    Backport  apache#30709 to 3.0
    
    ### What changes were proposed in this pull request?
    
     org.apache.hadoop.conf.Configuration#setIfUnset will ignore those with defaults too
    
    ### Why are the changes needed?
    
        fix a regression
    
    ### Does this PR introduce _any_ user-facing change?
    
    no
    
    ### How was this patch tested?
    
    new tests
    
    Closes apache#30720 from yaooqinn/SPARK-33740-30.
    
    Authored-by: Kent Yao <yaooqinn@hotmail.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    yaooqinn authored and dongjoon-hyun committed Dec 11, 2020
    Configuration menu
    Copy the full SHA
    9439e11 View commit details
    Browse the repository at this point in the history
  3. [SPARK-33757][INFRA][R] Fix the R dependencies build error on GitHub …

    …Actions and AppVeyor
    
    ### What changes were proposed in this pull request?
    
    This PR fixes the R dependencies build error on GitHub Actions and AppVeyor.
    The reason seems that `usethis` package is updated 2020/12/10.
    https://cran.r-project.org/web/packages/usethis/index.html
    
    ### Why are the changes needed?
    
    To keep the build clean.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Should be done by GitHub Actions.
    
    Closes apache#30737 from sarutak/fix-r-dependencies-build-error.
    
    Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    (cherry picked from commit fb2e3af)
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    sarutak authored and HyukjinKwon committed Dec 11, 2020
    Configuration menu
    Copy the full SHA
    2534165 View commit details
    Browse the repository at this point in the history
  4. [SPARK-33742][SQL][3.0] Throw PartitionsAlreadyExistException from Hi…

    …veExternalCatalog.createPartitions()
    
    ### What changes were proposed in this pull request?
    Throw `PartitionsAlreadyExistException` from `createPartitions()` in Hive external catalog when a partition exists. Currently, `HiveExternalCatalog.createPartitions()` throws `AlreadyExistsException` wrapped by `AnalysisException`.
    
    In the PR, I propose to catch `AlreadyExistsException` in `HiveClientImpl` and replace it by `PartitionsAlreadyExistException`.
    
    ### Why are the changes needed?
    The behaviour of Hive external catalog deviates from V1/V2 in-memory catalogs that throw `PartitionsAlreadyExistException`. To improve user experience with Spark SQL, it would be better to throw the same exception.
    
    ### Does this PR introduce _any_ user-facing change?
    Yes
    
    ### How was this patch tested?
    By running existing test suites:
    ```
    $ build/sbt -Phive -Phive-thriftserver "hive/test:testOnly org.apache.spark.sql.hive.client.VersionsSuite"
    $ build/sbt -Phive -Phive-thriftserver "hive/test:testOnly org.apache.spark.sql.hive.execution.HiveDDLSuite"
    ```
    
    Authored-by: Max Gekk <max.gekkgmail.com>
    Signed-off-by: Dongjoon Hyun <dongjoonapache.org>
    (cherry picked from commit fab2995)
    Signed-off-by: Max Gekk <max.gekkgmail.com>
    
    Closes apache#30730 from MaxGekk/hive-partition-exception-3.0.
    
    Authored-by: Max Gekk <max.gekk@gmail.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    MaxGekk authored and dongjoon-hyun committed Dec 11, 2020
    Configuration menu
    Copy the full SHA
    fe38821 View commit details
    Browse the repository at this point in the history

Commits on Dec 13, 2020

  1. [MINOR][UI] Correct JobPage's skipped/pending tableHeaderId

    ### What changes were proposed in this pull request?
    
    Current Spark Web UI job page's header link of pending/skipped stages is inconsistent with their statuses. See the picture below:
    ![image](https://user-images.githubusercontent.com/9404831/101998894-1e843180-3c8c-11eb-8d94-10df9edb68e7.png)
    
    ### Why are the changes needed?
    
    The code determining the `pendingOrSkippedTableId` has the wrong logic. As explained in the code:
    > If the job is completed, then any pending stages are displayed as "skipped" [code pointer](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ui/jobs/JobPage.scala#L266)
    
    This PR fixes the logic for `pendingOrSkippedTableId` which aligns with the stage statuses.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Verified that header link is consistent with stage status with the fix.
    
    Closes apache#30749 from linzebing/ui_bug.
    
    Authored-by: linzebing <linzebing1995@gmail.com>
    Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
    (cherry picked from commit 0277fdd)
    Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
    linzebing authored and sarutak committed Dec 13, 2020
    Configuration menu
    Copy the full SHA
    14e77ab View commit details
    Browse the repository at this point in the history

Commits on Dec 14, 2020

  1. [SPARK-33757][INFRA][R][FOLLOWUP] Provide more simple solution

    ### What changes were proposed in this pull request?
    
    This PR proposes a better solution for the R build failure on GitHub Actions.
    The issue is solved in apache#30737 but I noticed the following two things.
    
    * We can use the latest `usethis` if we install additional libraries on the GitHub Actions environment.
    * For tests on AppVeyor, `usethis` is not necessary, so I partially revert the previous change.
    
    ### Why are the changes needed?
    
    For more simple solution.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Confirmed on GitHub Actions and AppVeyor on my account.
    
    Closes apache#30753 from sarutak/followup-SPARK-33757.
    
    Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    (cherry picked from commit b135db3)
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    sarutak authored and dongjoon-hyun committed Dec 14, 2020
    Configuration menu
    Copy the full SHA
    7cd1aab View commit details
    Browse the repository at this point in the history
  2. [SPARK-33770][SQL][TESTS][3.1][3.0] Fix the `ALTER TABLE .. DROP PART…

    …ITION` tests that delete files out of partition path
    
    ### What changes were proposed in this pull request?
    Modify the tests that add partitions with `LOCATION`, and where the number of nested folders in `LOCATION` doesn't match to the number of partitioned columns. In that case, `ALTER TABLE .. DROP PARTITION` tries to access (delete) folder out of the "base" path in `LOCATION`.
    
    The problem belongs to Hive's MetaStore method `drop_partition_common`:
    https://github.com/apache/hive/blob/8696c82d07d303b6dbb69b4d443ab6f2b241b251/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java#L4876
    which tries to delete empty partition sub-folders recursively starting from the most deeper partition sub-folder up to the base folder. In the case when the number of sub-folder is not equal to the number of partitioned columns `part_vals.size()`, the method will try to list and delete folders out of the base path.
    
    ### Why are the changes needed?
    To fix test failures like apache#30643 (comment):
    ```
    org.apache.spark.sql.hive.execution.command.AlterTableAddPartitionSuite.ALTER TABLE .. ADD PARTITION Hive V1: SPARK-33521: universal type conversions of partition values
    sbt.ForkMain$ForkError: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: File file:/home/jenkins/workspace/SparkPullRequestBuilder/target/tmp/spark-832cb19c-65fd-41f3-ae0b-937d76c07897 does not exist;
    	at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:112)
    	at org.apache.spark.sql.hive.HiveExternalCatalog.dropPartitions(HiveExternalCatalog.scala:1014)
    ...
    Caused by: sbt.ForkMain$ForkError: org.apache.hadoop.hive.metastore.api.MetaException: File file:/home/jenkins/workspace/SparkPullRequestBuilder/target/tmp/spark-832cb19c-65fd-41f3-ae0b-937d76c07897 does not exist
    	at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.drop_partition_with_environment_context(HiveMetaStore.java:3381)
    	at sun.reflect.GeneratedMethodAccessor304.invoke(Unknown Source)
    ```
    
    The issue can be reproduced by the following steps:
    1. Create a base folder, for example: `/Users/maximgekk/tmp/part-location`
    2. Create a sub-folder in the base folder and drop permissions for it:
    ```
    $ mkdir /Users/maximgekk/tmp/part-location/aaa
    $ chmod a-rwx chmod a-rwx /Users/maximgekk/tmp/part-location/aaa
    $ ls -al /Users/maximgekk/tmp/part-location
    total 0
    drwxr-xr-x   3 maximgekk  staff    96 Dec 13 18:42 .
    drwxr-xr-x  33 maximgekk  staff  1056 Dec 13 18:32 ..
    d---------   2 maximgekk  staff    64 Dec 13 18:42 aaa
    ```
    3. Create a table with a partition folder in the base folder:
    ```sql
    spark-sql> create table tbl (id int) partitioned by (part0 int, part1 int);
    spark-sql> alter table tbl add partition (part0=1,part1=2) location '/Users/maximgekk/tmp/part-location/tbl';
    ```
    4. Try to drop this partition:
    ```
    spark-sql> alter table tbl drop partition (part0=1,part1=2);
    20/12/13 18:46:07 ERROR HiveClientImpl:
    ======================
    Attempt to drop the partition specs in table 'tbl' database 'default':
    Map(part0 -> 1, part1 -> 2)
    In this attempt, the following partitions have been dropped successfully:
    
    The remaining partitions have not been dropped:
    [1, 2]
    ======================
    
    Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: Error accessing file:/Users/maximgekk/tmp/part-location/aaa;
    org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: Error accessing file:/Users/maximgekk/tmp/part-location/aaa;
    ```
    The command fails because it tries to access to the sub-folder `aaa` that is out of the partition path `/Users/maximgekk/tmp/part-location/tbl`.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    By running the affected tests from local IDEA which does not have access to folders out of partition paths.
    
    Lead-authored-by: Max Gekk <max.gekkgmail.com>
    Co-authored-by: Maxim Gekk <max.gekkgmail.com>
    Signed-off-by: HyukjinKwon <gurwls223apache.org>
    (cherry picked from commit 9160d59)
    Signed-off-by: Max Gekk <max.gekkgmail.com>
    
    Closes apache#30756 from MaxGekk/fix-drop-partition-location-3.1.
    
    Authored-by: Max Gekk <max.gekk@gmail.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    MaxGekk authored and HyukjinKwon committed Dec 14, 2020
    Configuration menu
    Copy the full SHA
    d652b47 View commit details
    Browse the repository at this point in the history

Commits on Dec 16, 2020

  1. [SPARK-33786][SQL][3.0] The storage level for a cache should be respe…

    …cted when a table name is altered
    
    ### What changes were proposed in this pull request?
    
    This is a back port of apache#30774.
    
    This PR proposes to retain the cache's storage level when a table name is altered by `ALTER TABLE ... RENAME TO ...`.
    
    ### Why are the changes needed?
    
    Currently, when a table name is altered, the table's cache is refreshed (if exists), but the storage level is not retained. For example:
    ```scala
            def getStorageLevel(tableName: String): StorageLevel = {
              val table = spark.table(tableName)
              val cachedData = spark.sharedState.cacheManager.lookupCachedData(table).get
              cachedData.cachedRepresentation.cacheBuilder.storageLevel
            }
    
            Seq(1 -> "a").toDF("i", "j").write.parquet(path.getCanonicalPath)
            sql(s"CREATE TABLE old USING parquet LOCATION '${path.toURI}'")
            sql("CACHE TABLE old OPTIONS('storageLevel' 'MEMORY_ONLY')")
            val oldStorageLevel = getStorageLevel("old")
    
            sql("ALTER TABLE old RENAME TO new")
            val newStorageLevel = getStorageLevel("new")
    ```
    `oldStorageLevel` will be `StorageLevel(memory, deserialized, 1 replicas)` whereas `newStorageLevel` will be `StorageLevel(disk, memory, deserialized, 1 replicas)`, which is the default storage level.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, now the storage level for the cache will be retained.
    
    ### How was this patch tested?
    
    Added a unit test.
    
    Closes apache#30793 from imback82/alter_table_rename_cache_fix_3.0.
    
    Authored-by: Terry Kim <yuminkim@gmail.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    imback82 authored and cloud-fan committed Dec 16, 2020
    Configuration menu
    Copy the full SHA
    f2c8079 View commit details
    Browse the repository at this point in the history
  2. [SPARK-33788][SQL][3.1][3.0][2.4] Throw NoSuchPartitionsException fro…

    …m HiveExternalCatalog.dropPartitions()
    
    ### What changes were proposed in this pull request?
    Throw `NoSuchPartitionsException` from `ALTER TABLE .. DROP TABLE` for not existing partitions of a table in V1 Hive external catalog.
    
    ### Why are the changes needed?
    The behaviour of Hive external catalog deviates from V1/V2 in-memory catalogs that throw `NoSuchPartitionsException`. To improve user experience with Spark SQL, it would be better to throw the same exception.
    
    ### Does this PR introduce _any_ user-facing change?
    Yes, the command throws `NoSuchPartitionsException` instead of the general exception `AnalysisException`.
    
    ### How was this patch tested?
    By running new UT via:
    ```
    $ build/sbt -Phive -Phive-thriftserver "test:testOnly *HiveDDLSuite"
    ```
    
    Authored-by: Max Gekk <max.gekkgmail.com>
    Signed-off-by: HyukjinKwon <gurwls223apache.org>
    (cherry picked from commit 3dfdcf4)
    Signed-off-by: Max Gekk <max.gekkgmail.com>
    
    Closes apache#30802 from MaxGekk/hive-drop-partition-exception-3.1.
    
    Authored-by: Max Gekk <max.gekk@gmail.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    MaxGekk authored and HyukjinKwon committed Dec 16, 2020
    Configuration menu
    Copy the full SHA
    a77b70d View commit details
    Browse the repository at this point in the history
  3. [SPARK-33793][TESTS][3.0] Introduce withExecutor to ensure proper cle…

    …anup in tests
    
    Backport of: apache#30783
    
    ### What changes were proposed in this pull request?
    This PR introduces a helper method `withExecutor` that handles the creation of an Executor object and ensures that it is always stopped in a finally block. The tests in ExecutorSuite have been refactored to use this method.
    
    ### Why are the changes needed?
    Recently an issue was discovered that leaked Executors (which are not explicitly stopped after a test) can cause other tests to fail due to the JVM being killed after 10 min. It is therefore crucial that tests always stop the Executor. By introducing this helper method, a simple pattern is established that can be easily adopted in new tests, which reduces the risk of regressions.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Run the ExecutorSuite locally.
    
    Closes apache#30801 from sander-goos/SPARK-33793-close-executors-3.0.
    
    Authored-by: Sander Goos <sander.goos@databricks.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    sander-goos authored and HyukjinKwon committed Dec 16, 2020
    Configuration menu
    Copy the full SHA
    da272f7 View commit details
    Browse the repository at this point in the history

Commits on Dec 17, 2020

  1. [SPARK-33733][SQL][3.0] PullOutNondeterministic should check and coll…

    …ect deterministic field
    
    backport [apache#30703](apache#30703) for branch-3.0.
    
    ### What changes were proposed in this pull request?
    
    The deterministic field is wider than `NonDerterministic`, we should keep same range between pull out and check analysis.
    
    ### Why are the changes needed?
    
    For example
    ```
    select * from values(1), (4) as t(c1) order by java_method('java.lang.Math', 'abs', c1)
    ```
    
    We will get exception since `java_method` deterministic field is false but not a `NonDeterministic`
    ```
    Exception in thread "main" org.apache.spark.sql.AnalysisException: nondeterministic expressions are only allowed in
    Project, Filter, Aggregate or Window, found:
     java_method('java.lang.Math', 'abs', t.`c1`) ASC NULLS FIRST
    in operator Sort [java_method(java.lang.Math, abs, c1#1) ASC NULLS FIRST], true
                   ;;
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes.
    
    ### How was this patch tested?
    
    Add test.
    
    Closes apache#30771 from ulysses-you/SPARK-33733-branch-3.0.
    
    Authored-by: ulysses-you <ulyssesyou18@gmail.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    ulysses-you authored and cloud-fan committed Dec 17, 2020
    Configuration menu
    Copy the full SHA
    b86ea0f View commit details
    Browse the repository at this point in the history
  2. [SPARK-33819][CORE][3.0] SingleFileEventLogFileReader/RollingEventLog…

    …FilesFileReader should be `package private`
    
    ### What changes were proposed in this pull request?
    
    This PR aims to convert `EventLogFileReader`'s derived classes into `package private`.
    - SingleFileEventLogFileReader
    - RollingEventLogFilesFileReader
    
    `EventLogFileReader` itself is used in `scheduler` module during tests.
    
    ### Why are the changes needed?
    
    This classes were designed to be internal. This PR hides it explicitly to reduce the maintenance burden.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, but these were exposed accidentally.
    
    ### How was this patch tested?
    
    Pass CIs.
    
    Closes apache#30820 from dongjoon-hyun/SPARK-33819-3.0.
    
    Authored-by: Dongjoon Hyun <dongjoon@apache.org>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    dongjoon-hyun authored and HyukjinKwon committed Dec 17, 2020
    Configuration menu
    Copy the full SHA
    cd683b3 View commit details
    Browse the repository at this point in the history
  3. [SPARK-33774][UI][CORE] Back to Master" returns 500 error in Standalo…

    …ne cluster
    
    ### What changes were proposed in this pull request?
    
    Initiate the `masterWebUiUrl` with the `webUi. webUrl` instead of the `masterPublicAddress`.
    
    ### Why are the changes needed?
    
    Since [SPARK-21642](https://issues.apache.org/jira/browse/SPARK-21642), `WebUI` has changed from `localHostName` to `localCanonicalHostName` as the hostname to set up the web UI. However, the `masterPublicAddress` is from `RpcEnv`'s host address, which still uses `localHostName`. As a result, it returns the wrong Master web URL to the Worker.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, when users click "Back to Master" in the Worker page:
    
    Before this PR:
    
    <img width="3258" alt="WeChat4acbfd163f51c76a5f9bc388c7479785" src="https://user-images.githubusercontent.com/16397174/102057951-b9664280-3e29-11eb-8749-5ee293902bdf.png">
    
    After this PR:
    
    ![image](https://user-images.githubusercontent.com/16397174/102058016-d438b700-3e29-11eb-8641-a23a6b2f542e.png)
    
    (Return to the Master page successfully.)
    
    ### How was this patch tested?
    
    Tested manually.
    
    Closes apache#30759 from Ngone51/fix-back-to-master.
    
    Authored-by: yi.wu <yi.wu@databricks.com>
    Signed-off-by: Sean Owen <srowen@gmail.com>
    (cherry picked from commit 34e4d87)
    Signed-off-by: Sean Owen <srowen@gmail.com>
    Ngone51 authored and srowen committed Dec 17, 2020
    Configuration menu
    Copy the full SHA
    99eb027 View commit details
    Browse the repository at this point in the history

Commits on Dec 18, 2020

  1. [SPARK-33822][SQL] Use the CastSupport.cast method in HashJoin

    ### What changes were proposed in this pull request?
    
    This PR intends to fix the bug that throws a unsupported exception when running [the TPCDS q5](https://github.com/apache/spark/blob/master/sql/core/src/test/resources/tpcds/q5.sql) with AQE enabled ([this option is enabled by default now via SPARK-33679](apache@031c5ef)):
    ```
    java.lang.UnsupportedOperationException: BroadcastExchange does not support the execute() code path.
      at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecute(BroadcastExchangeExec.scala:189)
      at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)
      at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)
      at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
      at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
      at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
      at org.apache.spark.sql.execution.exchange.ReusedExchangeExec.doExecute(Exchange.scala:60)
      at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)
      at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)
      at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
      at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
      at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
      at org.apache.spark.sql.execution.adaptive.QueryStageExec.doExecute(QueryStageExec.scala:115)
      at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)
      at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)
      at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
      at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
      at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
      at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:321)
      at org.apache.spark.sql.execution.SparkPlan.executeCollectIterator(SparkPlan.scala:397)
      at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.$anonfun$relationFuture$1(BroadcastExchangeExec.scala:118)
      at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$1(SQLExecution.scala:185)
      at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
      ...
    ```
    
    I've checked the AQE code and I found `EnsureRequirements` wrongly puts `BroadcastExchange` on a top of `BroadcastQueryStage` in the `reOptimize` phase as follows:
    ```
    +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), [id=apache#2183]
      +- BroadcastQueryStage 2
        +- ReusedExchange [d_date_sk#1086], BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), [id=apache#1963]
    ```
    A root cause is that a `Cast` class in a required child's distribution does not have a `timeZoneId` field (`timeZoneId=None`), and a `Cast` class in `child.outputPartitioning` has it. So, this difference can make the distribution requirement check fail in `EnsureRequirements`:
    https://github.com/apache/spark/blob/1e85707738a830d33598ca267a6740b3f06b1861/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala#L47-L50
    
    The `Cast` class that does not have a `timeZoneId` field is generated in the `HashJoin` object. To fix this issue, this PR proposes to use the `CastSupport.cast` method there.
    
    ### Why are the changes needed?
    
    Bugfix.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Manually checked that q5 passed.
    
    Closes apache#30818 from maropu/BugfixInAQE.
    
    Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    (cherry picked from commit 51ef443)
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    maropu authored and dongjoon-hyun committed Dec 18, 2020
    Configuration menu
    Copy the full SHA
    3ef6827 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    5ce0b9f View commit details
    Browse the repository at this point in the history
  3. [SPARK-33822][SQL][3.0] Use the CastSupport.cast method in HashJoin

    ### What changes were proposed in this pull request?
    
    This PR intends to fix the bug that throws a unsupported exception when running [the TPCDS q5](https://github.com/apache/spark/blob/master/sql/core/src/test/resources/tpcds/q5.sql) with AQE enabled ([this option is enabled by default now via SPARK-33679](apache@031c5ef)):
    ```
    java.lang.UnsupportedOperationException: BroadcastExchange does not support the execute() code path.
      at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecute(BroadcastExchangeExec.scala:189)
      at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)
      at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)
      at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
      at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
      at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
      at org.apache.spark.sql.execution.exchange.ReusedExchangeExec.doExecute(Exchange.scala:60)
      at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)
      at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)
      at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
      at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
      at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
      at org.apache.spark.sql.execution.adaptive.QueryStageExec.doExecute(QueryStageExec.scala:115)
      at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)
      at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)
      at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
      at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
      at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
      at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:321)
      at org.apache.spark.sql.execution.SparkPlan.executeCollectIterator(SparkPlan.scala:397)
      at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.$anonfun$relationFuture$1(BroadcastExchangeExec.scala:118)
      at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$1(SQLExecution.scala:185)
      at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
      ...
    ```
    
    I've checked the AQE code and I found `EnsureRequirements` wrongly puts `BroadcastExchange` on a top of `BroadcastQueryStage` in the `reOptimize` phase as follows:
    ```
    +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), [id=apache#2183]
      +- BroadcastQueryStage 2
        +- ReusedExchange [d_date_sk#1086], BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), [id=apache#1963]
    ```
    A root cause is that a `Cast` class in a required child's distribution does not have a `timeZoneId` field (`timeZoneId=None`), and a `Cast` class in `child.outputPartitioning` has it. So, this difference can make the distribution requirement check fail in `EnsureRequirements`:
    https://github.com/apache/spark/blob/1e85707738a830d33598ca267a6740b3f06b1861/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala#L47-L50
    
    The `Cast` class that does not have a `timeZoneId` field is generated in the `HashJoin` object. To fix this issue, this PR proposes to use the `CastSupport.cast` method there.
    
    This is a backport PR for apache#30818.
    
    ### Why are the changes needed?
    
    Bugfix.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Manually checked that q5 passed with AQE enabled.
    
    Closes apache#30830 from maropu/SPARK-33822-BRANCH3.0.
    
    Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    maropu authored and dongjoon-hyun committed Dec 18, 2020
    Configuration menu
    Copy the full SHA
    1615b0e View commit details
    Browse the repository at this point in the history
  4. [SPARK-33831][UI] Update to jetty 9.4.34

    Update Jetty to 9.4.34
    
    Picks up fixes and improvements, including a possible CVE fix.
    
    https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.33.v20201020
    https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.34.v20201102
    
    No.
    
    Existing tests.
    
    Closes apache#30828 from srowen/SPARK-33831.
    
    Authored-by: Sean Owen <srowen@gmail.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    (cherry picked from commit 131a23d)
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    srowen authored and dongjoon-hyun committed Dec 18, 2020
    Configuration menu
    Copy the full SHA
    faf4a0e View commit details
    Browse the repository at this point in the history
  5. [SPARK-33593][SQL][3.0] Vector reader got incorrect data with binary …

    …partition value
    
    ### What changes were proposed in this pull request?
    
    Currently when enable parquet vectorized reader, use binary type as partition col will return incorrect value as below UT
    ```scala
    test("Parquet vector reader incorrect with binary partition value") {
      Seq(false, true).foreach(tag => {
        withSQLConf("spark.sql.parquet.enableVectorizedReader" -> tag.toString) {
          withTable("t1") {
            sql(
              """CREATE TABLE t1(name STRING, id BINARY, part BINARY)
                | USING PARQUET PARTITIONED BY (part)""".stripMargin)
            sql(s"INSERT INTO t1 PARTITION(part = 'Spark SQL') VALUES('a', X'537061726B2053514C')")
            if (tag) {
              checkAnswer(sql("SELECT name, cast(id as string), cast(part as string) FROM t1"),
                Row("a", "Spark SQL", ""))
            } else {
              checkAnswer(sql("SELECT name, cast(id as string), cast(part as string) FROM t1"),
                Row("a", "Spark SQL", "Spark SQL"))
            }
          }
        }
      })
    }
    ```
    
    ### Why are the changes needed?
    Fix data incorrect issue
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Added UT
    
    Closes apache#30839 from AngersZhuuuu/SPARK-33593-3.0.
    
    Authored-by: angerszhu <angers.zhu@gmail.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    AngersZhuuuu authored and dongjoon-hyun committed Dec 18, 2020
    Configuration menu
    Copy the full SHA
    f67c3c2 View commit details
    Browse the repository at this point in the history
  6. [SPARK-33841][CORE][3.0] Fix issue with jobs disappearing intermitten…

    …tly from the SHS under high load
    
    ### What changes were proposed in this pull request?
    
    Mark SHS event log entries that were `processing` at the beginning of the `checkForLogs` run as not stale and check for this mark before deleting an event log. This fixes the issue when a particular job was displayed in the SHS and disappeared after some time, but then, in several minutes showed up again.
    
    ### Why are the changes needed?
    
    The issue is caused by [SPARK-29043](https://issues.apache.org/jira/browse/SPARK-29043), which is designated to improve the concurrent performance of the History Server. The [change](https://github.com/apache/spark/pull/25797/files#) breaks the ["app deletion" logic](https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R563) because of missing proper synchronization for `processing` event log entries. Since SHS now [filters out](https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R462) all `processing` event log entries, such entries do not have a chance to be [updated with the new `lastProcessed`](https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R472) time and thus any entity that completes processing right after [filtering](https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R462) and before [the check for stale entities](https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R560) will be identified as stale and will be deleted from the UI until the next `checkForLogs` run. This is because [updated `lastProcessed` time is used as criteria](https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R557), and event log entries that missed to be updated with a new time, will match that criteria.
    
    The issue can be reproduced by generating a big number of event logs and uploading them to the SHS event log directory on S3. Essentially, around 800(82.6 MB) copies of an event log file were created using [shs-monitor](https://github.com/vladhlinsky/shs-monitor) script. Strange behavior of SHS counting the total number of applications was noticed - at first, the number was increasing as expected, but with the next page refresh, the total number of applications decreased. No errors were logged by SHS.
    
    241 entities are displayed at `20:50:42`:
    ![1-241-entities-at-20-50](https://user-images.githubusercontent.com/61428392/102611539-c2138d00-4137-11eb-9bbd-d77b22041f3b.png)
    203 entities are displayed at `20:52:17`:
    ![2-203-entities-at-20-52](https://user-images.githubusercontent.com/61428392/102611561-cdff4f00-4137-11eb-91ed-7405fe58a695.png)
    The number of loaded applications over time:
    ![4-loaded-applications](https://user-images.githubusercontent.com/61428392/102611586-d8b9e400-4137-11eb-8747-4007fc5469de.png)
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, SHS users won't face the behavior when the number of displayed applications decreases periodically.
    
    ### How was this patch tested?
    
    Tested using [shs-monitor](https://github.com/vladhlinsky/shs-monitor) script:
    * Build SHS with the proposed change
    * Download Hadoop AWS and AWS Java SDK
    * Prepare S3 bucket and user for programmatic access, grant required roles to the user. Get access key and secret key
    * Configure SHS to read event logs from S3
    * Start [monitor](https://github.com/vladhlinsky/shs-monitor/blob/main/monitor.sh) script to query SHS API
    * Run 8 [producers](https://github.com/vladhlinsky/shs-monitor/blob/main/producer.sh) for ~10 mins, create 805(83.1 MB) event log copies
    * Wait for SHS to load all the applications
    * Verify that the number of loaded applications increases continuously over time
    ![5-loaded-applications-fixed](https://user-images.githubusercontent.com/61428392/102617363-bf1d9a00-4141-11eb-9bae-f982d02fd30f.png)
    
    For more details, please refer to the [shs-monitor](https://github.com/vladhlinsky/shs-monitor) repository.
    
    Closes apache#30842 from vladhlinsky/SPARK-33841-branch-3.0.
    
    Authored-by: Vlad Glinsky <vladhlinsky@gmail.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    vhlinskyi authored and dongjoon-hyun committed Dec 18, 2020
    Configuration menu
    Copy the full SHA
    7881622 View commit details
    Browse the repository at this point in the history

Commits on Dec 20, 2020

  1. [SPARK-33756][SQL] Make BytesToBytesMap's MapIterator idempotent

    ### What changes were proposed in this pull request?
    Make MapIterator of BytesToBytesMap `hasNext` method idempotent
    
    ### Why are the changes needed?
    The `hasNext` maybe called multiple times, if not guarded, second call of hasNext method after reaching the end of iterator will throw NoSuchElement exception.
    
    ### Does this PR introduce _any_ user-facing change?
    NO.
    
    ### How was this patch tested?
    Update a unit test to cover this case.
    
    Closes apache#30728 from advancedxy/SPARK-33756.
    
    Authored-by: Xianjin YE <advancedxy@gmail.com>
    Signed-off-by: Sean Owen <srowen@gmail.com>
    (cherry picked from commit 1339168)
    Signed-off-by: Sean Owen <srowen@gmail.com>
    advancedxy authored and srowen committed Dec 20, 2020
    Configuration menu
    Copy the full SHA
    faf8dd5 View commit details
    Browse the repository at this point in the history

Commits on Dec 21, 2020

  1. [SPARK-33853][SQL] EXPLAIN CODEGEN and BenchmarkQueryTest don't show …

    …subquery code
    
    ### What changes were proposed in this pull request?
    
    This PR fixes an issue that `EXPLAIN CODEGEN` and `BenchmarkQueryTest` don't show the corresponding code for subqueries.
    
    The following example is about `EXPLAIN CODEGEN`.
    ```
    spark.conf.set("spark.sql.adaptive.enabled", "false")
    val df = spark.range(1, 100)
    df.createTempView("df")
    spark.sql("SELECT (SELECT min(id) AS v FROM df)").explain("CODEGEN")
    
    scala> spark.sql("SELECT (SELECT min(id) AS v FROM df)").explain("CODEGEN")
    Found 1 WholeStageCodegen subtrees.
    == Subtree 1 / 1 (maxMethodCodeSize:55; maxConstantPoolSize:97(0.15% used); numInnerClasses:0) ==
    *(1) Project [Subquery scalar-subquery#3, [id=apache#24] AS scalarsubquery()#5L]
    :  +- Subquery scalar-subquery#3, [id=apache#24]
    :     +- *(2) HashAggregate(keys=[], functions=[min(id#0L)], output=[v#2L])
    :        +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=apache#20]
    :           +- *(1) HashAggregate(keys=[], functions=[partial_min(id#0L)], output=[min#8L])
    :              +- *(1) Range (1, 100, step=1, splits=12)
    +- *(1) Scan OneRowRelation[]
    
    Generated code:
    /* 001 */ public Object generate(Object[] references) {
    /* 002 */   return new GeneratedIteratorForCodegenStage1(references);
    /* 003 */ }
    /* 004 */
    /* 005 */ // codegenStageId=1
    /* 006 */ final class GeneratedIteratorForCodegenStage1 extends org.apache.spark.sql.execution.BufferedRowIterator {
    /* 007 */   private Object[] references;
    /* 008 */   private scala.collection.Iterator[] inputs;
    /* 009 */   private scala.collection.Iterator rdd_input_0;
    /* 010 */   private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] project_mutableStateArray_0 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[1];
    /* 011 */
    /* 012 */   public GeneratedIteratorForCodegenStage1(Object[] references) {
    /* 013 */     this.references = references;
    /* 014 */   }
    /* 015 */
    /* 016 */   public void init(int index, scala.collection.Iterator[] inputs) {
    /* 017 */     partitionIndex = index;
    /* 018 */     this.inputs = inputs;
    /* 019 */     rdd_input_0 = inputs[0];
    /* 020 */     project_mutableStateArray_0[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0);
    /* 021 */
    /* 022 */   }
    /* 023 */
    /* 024 */   private void project_doConsume_0() throws java.io.IOException {
    /* 025 */     // common sub-expressions
    /* 026 */
    /* 027 */     project_mutableStateArray_0[0].reset();
    /* 028 */
    /* 029 */     if (false) {
    /* 030 */       project_mutableStateArray_0[0].setNullAt(0);
    /* 031 */     } else {
    /* 032 */       project_mutableStateArray_0[0].write(0, 1L);
    /* 033 */     }
    /* 034 */     append((project_mutableStateArray_0[0].getRow()));
    /* 035 */
    /* 036 */   }
    /* 037 */
    /* 038 */   protected void processNext() throws java.io.IOException {
    /* 039 */     while ( rdd_input_0.hasNext()) {
    /* 040 */       InternalRow rdd_row_0 = (InternalRow) rdd_input_0.next();
    /* 041 */       ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] /* numOutputRows */).add(1);
    /* 042 */       project_doConsume_0();
    /* 043 */       if (shouldStop()) return;
    /* 044 */     }
    /* 045 */   }
    /* 046 */
    /* 047 */ }
    ```
    
    After this change, the corresponding code for subqueries are shown.
    ```
    Found 3 WholeStageCodegen subtrees.
    == Subtree 1 / 3 (maxMethodCodeSize:282; maxConstantPoolSize:206(0.31% used); numInnerClasses:0) ==
    *(1) HashAggregate(keys=[], functions=[partial_min(id#0L)], output=[min#8L])
    +- *(1) Range (1, 100, step=1, splits=12)
    
    Generated code:
    /* 001 */ public Object generate(Object[] references) {
    /* 002 */   return new GeneratedIteratorForCodegenStage1(references);
    /* 003 */ }
    /* 004 */
    /* 005 */ // codegenStageId=1
    /* 006 */ final class GeneratedIteratorForCodegenStage1 extends org.apache.spark.sql.execution.BufferedRowIterator {
    /* 007 */   private Object[] references;
    /* 008 */   private scala.collection.Iterator[] inputs;
    /* 009 */   private boolean agg_initAgg_0;
    /* 010 */   private boolean agg_bufIsNull_0;
    /* 011 */   private long agg_bufValue_0;
    /* 012 */   private boolean range_initRange_0;
    /* 013 */   private long range_nextIndex_0;
    /* 014 */   private TaskContext range_taskContext_0;
    /* 015 */   private InputMetrics range_inputMetrics_0;
    /* 016 */   private long range_batchEnd_0;
    /* 017 */   private long range_numElementsTodo_0;
    /* 018 */   private boolean agg_agg_isNull_2_0;
    /* 019 */   private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] range_mutableStateArray_0 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[3];
    /* 020 */
    /* 021 */   public GeneratedIteratorForCodegenStage1(Object[] references) {
    /* 022 */     this.references = references;
    /* 023 */   }
    /* 024 */
    /* 025 */   public void init(int index, scala.collection.Iterator[] inputs) {
    /* 026 */     partitionIndex = index;
    /* 027 */     this.inputs = inputs;
    /* 028 */
    /* 029 */     range_taskContext_0 = TaskContext.get();
    /* 030 */     range_inputMetrics_0 = range_taskContext_0.taskMetrics().inputMetrics();
    /* 031 */     range_mutableStateArray_0[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0);
    /* 032 */     range_mutableStateArray_0[1] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0);
    /* 033 */     range_mutableStateArray_0[2] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0);
    /* 034 */
    /* 035 */   }
    /* 036 */
    /* 037 */   private void agg_doAggregateWithoutKey_0() throws java.io.IOException {
    /* 038 */     // initialize aggregation buffer
    /* 039 */     agg_bufIsNull_0 = true;
    /* 040 */     agg_bufValue_0 = -1L;
    /* 041 */
    /* 042 */     // initialize Range
    /* 043 */     if (!range_initRange_0) {
    /* 044 */       range_initRange_0 = true;
    /* 045 */       initRange(partitionIndex);
    /* 046 */     }
    /* 047 */
    /* 048 */     while (true) {
    /* 049 */       if (range_nextIndex_0 == range_batchEnd_0) {
    /* 050 */         long range_nextBatchTodo_0;
    /* 051 */         if (range_numElementsTodo_0 > 1000L) {
    /* 052 */           range_nextBatchTodo_0 = 1000L;
    /* 053 */           range_numElementsTodo_0 -= 1000L;
    /* 054 */         } else {
    /* 055 */           range_nextBatchTodo_0 = range_numElementsTodo_0;
    /* 056 */           range_numElementsTodo_0 = 0;
    /* 057 */           if (range_nextBatchTodo_0 == 0) break;
    /* 058 */         }
    /* 059 */         range_batchEnd_0 += range_nextBatchTodo_0 * 1L;
    /* 060 */       }
    /* 061 */
    /* 062 */       int range_localEnd_0 = (int)((range_batchEnd_0 - range_nextIndex_0) / 1L);
    /* 063 */       for (int range_localIdx_0 = 0; range_localIdx_0 < range_localEnd_0; range_localIdx_0++) {
    /* 064 */         long range_value_0 = ((long)range_localIdx_0 * 1L) + range_nextIndex_0;
    /* 065 */
    /* 066 */         agg_doConsume_0(range_value_0);
    /* 067 */
    /* 068 */         // shouldStop check is eliminated
    /* 069 */       }
    /* 070 */       range_nextIndex_0 = range_batchEnd_0;
    /* 071 */       ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] /* numOutputRows */).add(range_localEnd_0);
    /* 072 */       range_inputMetrics_0.incRecordsRead(range_localEnd_0);
    /* 073 */       range_taskContext_0.killTaskIfInterrupted();
    /* 074 */     }
    /* 075 */
    /* 076 */   }
    /* 077 */
    /* 078 */   private void initRange(int idx) {
    /* 079 */     java.math.BigInteger index = java.math.BigInteger.valueOf(idx);
    /* 080 */     java.math.BigInteger numSlice = java.math.BigInteger.valueOf(12L);
    /* 081 */     java.math.BigInteger numElement = java.math.BigInteger.valueOf(99L);
    /* 082 */     java.math.BigInteger step = java.math.BigInteger.valueOf(1L);
    /* 083 */     java.math.BigInteger start = java.math.BigInteger.valueOf(1L);
    /* 084 */     long partitionEnd;
    /* 085 */
    /* 086 */     java.math.BigInteger st = index.multiply(numElement).divide(numSlice).multiply(step).add(start);
    /* 087 */     if (st.compareTo(java.math.BigInteger.valueOf(Long.MAX_VALUE)) > 0) {
    /* 088 */       range_nextIndex_0 = Long.MAX_VALUE;
    /* 089 */     } else if (st.compareTo(java.math.BigInteger.valueOf(Long.MIN_VALUE)) < 0) {
    /* 090 */       range_nextIndex_0 = Long.MIN_VALUE;
    /* 091 */     } else {
    /* 092 */       range_nextIndex_0 = st.longValue();
    /* 093 */     }
    /* 094 */     range_batchEnd_0 = range_nextIndex_0;
    /* 095 */
    /* 096 */     java.math.BigInteger end = index.add(java.math.BigInteger.ONE).multiply(numElement).divide(numSlice)
    /* 097 */     .multiply(step).add(start);
    /* 098 */     if (end.compareTo(java.math.BigInteger.valueOf(Long.MAX_VALUE)) > 0) {
    /* 099 */       partitionEnd = Long.MAX_VALUE;
    /* 100 */     } else if (end.compareTo(java.math.BigInteger.valueOf(Long.MIN_VALUE)) < 0) {
    /* 101 */       partitionEnd = Long.MIN_VALUE;
    /* 102 */     } else {
    /* 103 */       partitionEnd = end.longValue();
    /* 104 */     }
    /* 105 */
    /* 106 */     java.math.BigInteger startToEnd = java.math.BigInteger.valueOf(partitionEnd).subtract(
    /* 107 */       java.math.BigInteger.valueOf(range_nextIndex_0));
    /* 108 */     range_numElementsTodo_0  = startToEnd.divide(step).longValue();
    /* 109 */     if (range_numElementsTodo_0 < 0) {
    /* 110 */       range_numElementsTodo_0 = 0;
    /* 111 */     } else if (startToEnd.remainder(step).compareTo(java.math.BigInteger.valueOf(0L)) != 0) {
    /* 112 */       range_numElementsTodo_0++;
    /* 113 */     }
    /* 114 */   }
    /* 115 */
    /* 116 */   private void agg_doConsume_0(long agg_expr_0_0) throws java.io.IOException {
    /* 117 */     // do aggregate
    /* 118 */     // common sub-expressions
    /* 119 */
    /* 120 */     // evaluate aggregate functions and update aggregation buffers
    /* 121 */
    /* 122 */     agg_agg_isNull_2_0 = true;
    /* 123 */     long agg_value_2 = -1L;
    /* 124 */
    /* 125 */     if (!agg_bufIsNull_0 && (agg_agg_isNull_2_0 ||
    /* 126 */         agg_value_2 > agg_bufValue_0)) {
    /* 127 */       agg_agg_isNull_2_0 = false;
    /* 128 */       agg_value_2 = agg_bufValue_0;
    /* 129 */     }
    /* 130 */
    /* 131 */     if (!false && (agg_agg_isNull_2_0 ||
    /* 132 */         agg_value_2 > agg_expr_0_0)) {
    /* 133 */       agg_agg_isNull_2_0 = false;
    /* 134 */       agg_value_2 = agg_expr_0_0;
    /* 135 */     }
    /* 136 */
    /* 137 */     agg_bufIsNull_0 = agg_agg_isNull_2_0;
    /* 138 */     agg_bufValue_0 = agg_value_2;
    /* 139 */
    /* 140 */   }
    /* 141 */
    /* 142 */   protected void processNext() throws java.io.IOException {
    /* 143 */     while (!agg_initAgg_0) {
    /* 144 */       agg_initAgg_0 = true;
    /* 145 */       long agg_beforeAgg_0 = System.nanoTime();
    /* 146 */       agg_doAggregateWithoutKey_0();
    /* 147 */       ((org.apache.spark.sql.execution.metric.SQLMetric) references[2] /* aggTime */).add((System.nanoTime() - agg_beforeAgg_0) / 1000000);
    /* 148 */
    /* 149 */       // output the result
    /* 150 */
    /* 151 */       ((org.apache.spark.sql.execution.metric.SQLMetric) references[1] /* numOutputRows */).add(1);
    /* 152 */       range_mutableStateArray_0[2].reset();
    /* 153 */
    /* 154 */       range_mutableStateArray_0[2].zeroOutNullBytes();
    /* 155 */
    /* 156 */       if (agg_bufIsNull_0) {
    /* 157 */         range_mutableStateArray_0[2].setNullAt(0);
    /* 158 */       } else {
    /* 159 */         range_mutableStateArray_0[2].write(0, agg_bufValue_0);
    /* 160 */       }
    /* 161 */       append((range_mutableStateArray_0[2].getRow()));
    /* 162 */     }
    /* 163 */   }
    /* 164 */
    /* 165 */ }
    ```
    
    ### Why are the changes needed?
    
    For better debuggability.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes. After this change, users can see subquery code by `EXPLAIN CODEGEN`.
    
    ### How was this patch tested?
    
    New test.
    
    Closes apache#30859 from sarutak/explain-codegen-subqueries.
    
    Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    (cherry picked from commit f4e1069)
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    sarutak authored and dongjoon-hyun committed Dec 21, 2020
    Configuration menu
    Copy the full SHA
    78dbb4a View commit details
    Browse the repository at this point in the history
  2. [SPARK-33869][PYTHON][SQL][TESTS] Have a separate metastore directory…

    … for each PySpark test job
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to have its own metastore directory to avoid potential conflict in catalog operations.
    
    ### Why are the changes needed?
    
    To make PySpark tests less flaky.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No, dev-only.
    
    ### How was this patch tested?
    
    Manually tested by trying some sleeps in apache#30873.
    
    Closes apache#30875 from HyukjinKwon/SPARK-33869.
    
    Authored-by: HyukjinKwon <gurwls223@apache.org>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    (cherry picked from commit 38bbcca)
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    HyukjinKwon authored and dongjoon-hyun committed Dec 21, 2020
    Configuration menu
    Copy the full SHA
    c9fe712 View commit details
    Browse the repository at this point in the history

Commits on Dec 22, 2020

  1. [SPARK-28863][SQL][FOLLOWUP][3.0] Make sure optimized plan will not b…

    …e re-analyzed
    
    backport apache#30777 to 3.0
    
    ----------
    
    ### What changes were proposed in this pull request?
    
    It's a known issue that re-analyzing an optimized plan can lead to various issues. We made several attempts to avoid it from happening, but the current solution `AlreadyOptimized` is still not 100% safe, as people can inject catalyst rules to call analyzer directly.
    
    This PR proposes a simpler and safer idea: we set the `analyzed` flag to true after optimization, and analyzer will skip processing plans whose `analyzed` flag is true.
    
    ### Why are the changes needed?
    
    make the code simpler and safer
    
    ### Does this PR introduce _any_ user-facing change?
    
    no
    
    ### How was this patch tested?
    
    existing tests.
    
    Closes apache#30872 from cloud-fan/ds.
    
    Authored-by: Wenchen Fan <wenchen@databricks.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    cloud-fan authored and HyukjinKwon committed Dec 22, 2020
    Configuration menu
    Copy the full SHA
    0820beb View commit details
    Browse the repository at this point in the history
  2. [SPARK-33860][SQL] Make CatalystTypeConverters.convertToCatalyst matc…

    …h special Array value
    
    ### What changes were proposed in this pull request?
    
    Add some case to match Array whose element type is primitive.
    
    ### Why are the changes needed?
    
    We will get exception when use `Literal.create(Array(1, 2, 3), ArrayType(IntegerType))` .
    ```
    Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Literal must have a corresponding value to array<int>, but class int[] found.
    	at scala.Predef$.require(Predef.scala:281)
    	at org.apache.spark.sql.catalyst.expressions.Literal$.validateLiteralValue(literals.scala:215)
    	at org.apache.spark.sql.catalyst.expressions.Literal.<init>(literals.scala:292)
    	at org.apache.spark.sql.catalyst.expressions.Literal$.create(literals.scala:140)
    ```
    And same problem with other array whose element is primitive.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes.
    
    ### How was this patch tested?
    
    Add test.
    
    Closes apache#30868 from ulysses-you/SPARK-33860.
    
    Authored-by: ulysses-you <ulyssesyou18@gmail.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    (cherry picked from commit 1dd63dc)
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    ulysses-you authored and HyukjinKwon committed Dec 22, 2020
    Configuration menu
    Copy the full SHA
    7af54fd View commit details
    Browse the repository at this point in the history
  3. [BUILD][MINOR] Do not publish snapshots from forks

    ### What changes were proposed in this pull request?
    The GitHub workflow `Publish Snapshot` publishes master and 3.1 branch via Nexus. For this, the workflow uses `secrets.NEXUS_USER` and `secrets.NEXUS_PW` secrets. These are not available in forks where this workflow fails every day:
    
    - https://github.com/G-Research/spark/actions/runs/431626797
    - https://github.com/G-Research/spark/actions/runs/433153049
    - https://github.com/G-Research/spark/actions/runs/434680048
    - https://github.com/G-Research/spark/actions/runs/436958780
    
    ### Why are the changes needed?
    Avoid attempting to publish snapshots from forked repositories.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Code review only.
    
    Closes apache#30884 from EnricoMi/branch-do-not-publish-snapshots-from-forks.
    
    Authored-by: Enrico Minack <github@enrico.minack.dev>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    (cherry picked from commit 1d45025)
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    EnricoMi authored and HyukjinKwon committed Dec 22, 2020
    Configuration menu
    Copy the full SHA
    73f5626 View commit details
    Browse the repository at this point in the history

Commits on Dec 23, 2020

  1. Revert "[SPARK-33860][SQL] Make CatalystTypeConverters.convertToCatal…

    …yst match special Array value"
    
    This reverts commit 7af54fd.
    HyukjinKwon committed Dec 23, 2020
    Configuration menu
    Copy the full SHA
    4299a48 View commit details
    Browse the repository at this point in the history
  2. [SPARK-33891][DOCS][CORE] Update dynamic allocation related documents

    ### What changes were proposed in this pull request?
    
    This PR aims to update the followings.
    - Remove the outdated requirement for `spark.shuffle.service.enabled` in `configuration.md`
    - Dynamic allocation section in `job-scheduling.md`
    
    ### Why are the changes needed?
    
    To make the document up-to-date.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No, it's a documentation update.
    
    ### How was this patch tested?
    
    Manual.
    
    **BEFORE**
    ![Screen Shot 2020-12-23 at 2 22 04 AM](https://user-images.githubusercontent.com/9700541/102986441-ae647f80-44c5-11eb-97a3-87c2d368952a.png)
    ![Screen Shot 2020-12-23 at 2 22 34 AM](https://user-images.githubusercontent.com/9700541/102986473-bcb29b80-44c5-11eb-8eae-6802001c6dfa.png)
    
    **AFTER**
    ![Screen Shot 2020-12-23 at 2 25 36 AM](https://user-images.githubusercontent.com/9700541/102986767-2df24e80-44c6-11eb-8540-e74856a4c313.png)
    ![Screen Shot 2020-12-23 at 2 21 13 AM](https://user-images.githubusercontent.com/9700541/102986366-8e34c080-44c5-11eb-8054-1efd07c9458c.png)
    
    Closes apache#30906 from dongjoon-hyun/SPARK-33891.
    
    Authored-by: Dongjoon Hyun <dhyun@apple.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    (cherry picked from commit 47d1aa4)
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    dongjoon-hyun authored and HyukjinKwon committed Dec 23, 2020
    Configuration menu
    Copy the full SHA
    8c4e166 View commit details
    Browse the repository at this point in the history
  3. [SPARK-33277][PYSPARK][SQL] Use ContextAwareIterator to stop consumin…

    …g after the task ends
    
    ### What changes were proposed in this pull request?
    
    This is a retry of apache#30177.
    
    This is not a complete fix, but it would take long time to complete (apache#30242).
    As discussed offline, at least using `ContextAwareIterator` should be helpful enough for many cases.
    
    As the Python evaluation consumes the parent iterator in a separate thread, it could consume more data from the parent even after the task ends and the parent is closed. Thus, we should use `ContextAwareIterator` to stop consuming after the task ends.
    
    ### Why are the changes needed?
    
    Python/Pandas UDF right after off-heap vectorized reader could cause executor crash.
    
    E.g.,:
    
    ```py
    spark.range(0, 100000, 1, 1).write.parquet(path)
    
    spark.conf.set("spark.sql.columnVector.offheap.enabled", True)
    
    def f(x):
        return 0
    
    fUdf = udf(f, LongType())
    
    spark.read.parquet(path).select(fUdf('id')).head()
    ```
    
    This is because, the Python evaluation consumes the parent iterator in a separate thread and it consumes more data from the parent even after the task ends and the parent is closed. If an off-heap column vector exists in the parent iterator, it could cause segmentation fault which crashes the executor.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Added tests, and manually.
    
    Closes apache#30899 from ueshin/issues/SPARK-33277/context_aware_iterator.
    
    Authored-by: Takuya UESHIN <ueshin@databricks.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    (cherry picked from commit 5c9b421)
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    ueshin authored and dongjoon-hyun committed Dec 23, 2020
    Configuration menu
    Copy the full SHA
    83adba7 View commit details
    Browse the repository at this point in the history

Commits on Dec 24, 2020

  1. [SPARK-33900][WEBUI] Show shuffle read size / records correctly when …

    …only remotebytesread is available
    
    ### What changes were proposed in this pull request?
    Shuffle Read Size / Records can also be displayed in remoteBytesRead>0 localBytesRead=0.
    
    current:
    ![image](https://user-images.githubusercontent.com/3898450/103079421-c4ca2280-460e-11eb-9e2f-49d35b5d324d.png)
    fix:
    ![image](https://user-images.githubusercontent.com/3898450/103079439-cc89c700-460e-11eb-9a41-6b2882980d11.png)
    
    ### Why are the changes needed?
    At present, the page only displays the data of Shuffle Read Size / Records when localBytesRead>0.
    When there is only remote reading, metrics cannot be seen on the stage page.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    manual test
    
    Closes apache#30916 from cxzl25/SPARK-33900.
    
    Authored-by: sychen <sychen@ctrip.com>
    Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
    (cherry picked from commit 700f5ab)
    Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
    cxzl25 authored and sarutak committed Dec 24, 2020
    Configuration menu
    Copy the full SHA
    1445129 View commit details
    Browse the repository at this point in the history

Commits on Dec 27, 2020

  1. [SPARK-33911][SQL][DOCS][3.0] Update the SQL migration guide about ch…

    …anges in `HiveClientImpl`
    
    ### What changes were proposed in this pull request?
    Update the SQL migration guide about the changes made by:
    - apache#30778
    - apache#30711
    
    ### Why are the changes needed?
    To inform users about the recent changes in the upcoming releases.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    N/A
    
    Closes apache#30932 from MaxGekk/sql-migr-guide-hiveclientimpl-3.0.
    
    Authored-by: Max Gekk <max.gekk@gmail.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    MaxGekk authored and HyukjinKwon committed Dec 27, 2020
    Configuration menu
    Copy the full SHA
    65dd1d0 View commit details
    Browse the repository at this point in the history

Commits on Dec 30, 2020

  1. [MINOR][SS] Call fetchEarliestOffsets when it is necessary

    ### What changes were proposed in this pull request?
    
    This minor patch changes two variables where calling `fetchEarliestOffsets` to `lazy` because these values are not always necessary.
    
    ### Why are the changes needed?
    
    To avoid unnecessary Kafka RPC calls.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    Unit test.
    
    Closes apache#30969 from viirya/ss-minor3.
    
    Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    (cherry picked from commit 4a669f5)
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    viirya authored and HyukjinKwon committed Dec 30, 2020
    Configuration menu
    Copy the full SHA
    91a2260 View commit details
    Browse the repository at this point in the history

Commits on Dec 31, 2020

  1. [SPARK-33942][DOCS] Remove hiveClientCalls.count in CodeGenerator

    … metrics docs
    
    ### What changes were proposed in this pull request?
    Removed the **hiveClientCalls.count** in CodeGenerator metrics in Component instance = Executor
    
    ### Why are the changes needed?
    Wrong information regarding metrics was being displayed on Monitoring Documentation. I had added referred documentation for adding metrics logging in Graphite. This metric was not being reported. I had to check if the issue was at my application end or spark code or documentation. Documentation had the wrong info.
    
    ### Does this PR introduce _any_ user-facing change?
    No.
    
    ### How was this patch tested?
    Manual, checked it on my forked repository feature branch [SPARK-33942](https://github.com/coderbond007/spark/blob/SPARK-33942/docs/monitoring.md)
    
    Closes apache#30976 from coderbond007/SPARK-33942.
    
    Authored-by: Pradyumn Agrawal (pradyumn.ag) <pradyumn.ag@media.net>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    (cherry picked from commit 13e8c28)
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    Pradyumn Agrawal (pradyumn.ag) authored and dongjoon-hyun committed Dec 31, 2020
    Configuration menu
    Copy the full SHA
    b156c1f View commit details
    Browse the repository at this point in the history

Commits on Jan 1, 2021

  1. [SPARK-33931][INFRA][3.0] Recover GitHub Action build_and_test job

    ### What changes were proposed in this pull request?
    
    This is a backport of apache#30959 .
    This PR aims to recover GitHub Action `build_and_test` job.
    
    ### Why are the changes needed?
    
    Currently, `build_and_test` job fails to start because of  the following in master/branch-3.1 at least.
    ```
    r-lib/actions/setup-rv1 is not allowed to be used in apache/spark.
    Actions in this workflow must be: created by GitHub, verified in the GitHub Marketplace,
    within a repository owned by apache or match the following:
    adoptopenjdk/*, apache/*, gradle/wrapper-validation-action.
    ```
    - https://github.com/apache/spark/actions/runs/449826457
    
    ![Screen Shot 2020-12-28 at 10 06 11 PM](https://user-images.githubusercontent.com/9700541/103262174-f1f13a80-4958-11eb-8ceb-631527155775.png)
    
    ### Does this PR introduce _any_ user-facing change?
    
    No. This is a test infra.
    
    ### How was this patch tested?
    
    To check GitHub Action `build_and_test` job on this PR.
    
    Closes apache#30986 from dongjoon-hyun/SPARK-33931-3.0.
    
    Authored-by: Dongjoon Hyun <dhyun@apple.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    dongjoon-hyun committed Jan 1, 2021
    Configuration menu
    Copy the full SHA
    39867a8 View commit details
    Browse the repository at this point in the history

Commits on Jan 3, 2021

  1. [SPARK-33963][SQL] Canonicalize HiveTableRelation w/o table stats

    ### What changes were proposed in this pull request?
    Skip table stats in canonicalizing of `HiveTableRelation`.
    
    ### Why are the changes needed?
    The changes fix a regression comparing to Spark 3.0, see SPARK-33963.
    
    ### Does this PR introduce _any_ user-facing change?
    Yes. After changes Spark behaves as in the version 3.0.1.
    
    ### How was this patch tested?
    By running new UT:
    ```
    $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *CachedTableSuite"
    ```
    
    Closes apache#30995 from MaxGekk/fix-caching-hive-table.
    
    Authored-by: Max Gekk <max.gekk@gmail.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    (cherry picked from commit fc7d016)
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    MaxGekk authored and HyukjinKwon committed Jan 3, 2021
    Configuration menu
    Copy the full SHA
    dda431a View commit details
    Browse the repository at this point in the history
  2. [SPARK-33398] Fix loading tree models prior to Spark 3.0

    ### What changes were proposed in this pull request?
    In https://github.com/apache/spark/pull/21632/files#diff-0fdae8a6782091746ed20ea43f77b639f9c6a5f072dd2f600fcf9a7b37db4f47, a new field `rawCount` was added into `NodeData`, which cause that a tree model trained in 2.4 can not be loaded in 3.0/3.1/master;
    field `rawCount` is only used in training, and not used in `transform`/`predict`/`featureImportance`. So I just set it to -1L.
    
    ### Why are the changes needed?
    to support load old tree model in 3.0/3.1/master
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    added testsuites
    
    Closes apache#30889 from zhengruifeng/fix_tree_load.
    
    Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
    Signed-off-by: Sean Owen <srowen@gmail.com>
    (cherry picked from commit 6b7527e)
    Signed-off-by: Sean Owen <srowen@gmail.com>
    zhengruifeng authored and srowen committed Jan 3, 2021
    Configuration menu
    Copy the full SHA
    9f1bf4e View commit details
    Browse the repository at this point in the history

Commits on Jan 4, 2021

  1. [SPARK-33950][SQL][3.1][3.0] Refresh cache in v1 `ALTER TABLE .. DROP…

    … PARTITION`
    
    ### What changes were proposed in this pull request?
    Invoke `refreshTable()` from `AlterTableDropPartitionCommand.run()` after partitions dropping. In particular, this re-creates the cache associated with the modified table.
    
    ### Why are the changes needed?
    This fixes the issues portrayed by the example:
    ```sql
    spark-sql> CREATE TABLE tbl1 (col0 int, part0 int) USING parquet PARTITIONED BY (part0);
    spark-sql> INSERT INTO tbl1 PARTITION (part0=0) SELECT 0;
    spark-sql> INSERT INTO tbl1 PARTITION (part0=1) SELECT 1;
    spark-sql> CACHE TABLE tbl1;
    spark-sql> SELECT * FROM tbl1;
    0	0
    1	1
    spark-sql> ALTER TABLE tbl1 DROP PARTITION (part0=0);
    spark-sql> SELECT * FROM tbl1;
    0	0
    1	1
    ```
    The last query must not return `0	0` since it was deleted by previous command.
    
    ### Does this PR introduce _any_ user-facing change?
    Yes. After the changes for the example above:
    ```sql
    ...
    spark-sql> ALTER TABLE tbl1 DROP PARTITION (part0=0);
    spark-sql> SELECT * FROM tbl1;
    1	1
    ```
    
    ### How was this patch tested?
    By running the affected test suites:
    ```
    $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *CachedTableSuite"
    ```
    
    Authored-by: Max Gekk <max.gekkgmail.com>
    Signed-off-by: Wenchen Fan <wenchendatabricks.com>
    (cherry picked from commit 67195d0)
    Signed-off-by: Max Gekk <max.gekkgmail.com>
    
    Closes apache#31006 from MaxGekk/drop-partition-refresh-cache-3.1.
    
    Authored-by: Max Gekk <max.gekk@gmail.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    (cherry picked from commit eef0e4c)
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    MaxGekk authored and dongjoon-hyun committed Jan 4, 2021
    Configuration menu
    Copy the full SHA
    e882c90 View commit details
    Browse the repository at this point in the history

Commits on Jan 5, 2021

  1. [SPARK-34000][CORE] Fix stageAttemptToNumSpeculativeTasks java.util.N…

    …oSuchElementException
    
    ### What changes were proposed in this pull request?
    From below log, Stage 600 could be removed from `stageAttemptToNumSpeculativeTasks` by `onStageCompleted()`, but the speculative task 306.1 in stage 600 threw `NoSuchElementException` when it entered into `onTaskEnd()`.
    ```
    21/01/04 03:00:32,259 WARN [task-result-getter-2] scheduler.TaskSetManager:69 : Lost task 306.1 in stage 600.0 (TID 283610, hdc49-mcc10-01-0510-4108-039-tess0097.stratus.rno.ebay.com, executor 27): TaskKilled (another attempt succeeded)
    21/01/04 03:00:32,259 INFO [task-result-getter-2] scheduler.TaskSetManager:57 : Task 306.1 in stage 600.0 (TID 283610) failed, but the task will not be re-executed (either because the task failed with a shuffle data fetch failure, so the
    previous stage needs to be re-run, or because a different copy of the task has already succeeded).
    21/01/04 03:00:32,259 INFO [task-result-getter-2] cluster.YarnClusterScheduler:57 : Removed TaskSet 600.0, whose tasks have all completed, from pool default
    21/01/04 03:00:32,259 INFO [HiveServer2-Handler-Pool: Thread-5853] thriftserver.SparkExecuteStatementOperation:190 : Returning result set with 50 rows from offsets [5378600, 5378650) with 1fe245f8-a7f9-4ec0-bcb5-8cf324cbbb47
    21/01/04 03:00:32,260 ERROR [spark-listener-group-executorManagement] scheduler.AsyncEventQueue:94 : Listener ExecutorAllocationListener threw an exception
    java.util.NoSuchElementException: key not found: Stage 600 (Attempt 0)
            at scala.collection.MapLike.default(MapLike.scala:235)
            at scala.collection.MapLike.default$(MapLike.scala:234)
            at scala.collection.AbstractMap.default(Map.scala:63)
            at scala.collection.mutable.HashMap.apply(HashMap.scala:69)
            at org.apache.spark.ExecutorAllocationManager$ExecutorAllocationListener.onTaskEnd(ExecutorAllocationManager.scala:621)
            at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:45)
            at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28)
            at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:38)
            at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:38)
            at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:115)
            at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:99)
            at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:116)
            at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:116)
            at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
            at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:102)
            at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:97)
            at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1320)
            at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:97)
    ```
    
    ### Why are the changes needed?
    To avoid throwing the java.util.NoSuchElementException
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    This is a protective patch and it's not easy to reproduce in UT due to the event order is not fixed in a async queue.
    
    Closes apache#31025 from LantaoJin/SPARK-34000.
    
    Authored-by: LantaoJin <jinlantao@gmail.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    (cherry picked from commit a7d3fcd)
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    LantaoJin authored and dongjoon-hyun committed Jan 5, 2021
    Configuration menu
    Copy the full SHA
    36e845b View commit details
    Browse the repository at this point in the history
  2. [SPARK-34010][SQL][DODCS] Use python3 instead of python in SQL docume…

    …ntation build
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to use python3 instead of python in SQL documentation build.
    After SPARK-29672, we use `sql/create-docs.sh` everywhere in Spark dev. We should fix it in `sql/create-docs.sh` too.
    This blocks release because the release container does not have `python` but only `python3`.
    
    ### Why are the changes needed?
    
    To unblock the release.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No, dev-only.
    
    ### How was this patch tested?
    
    I manually ran the script
    
    Closes apache#31041 from HyukjinKwon/SPARK-34010.
    
    Authored-by: HyukjinKwon <gurwls223@apache.org>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    (cherry picked from commit 8d09f96)
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    HyukjinKwon committed Jan 5, 2021
    Configuration menu
    Copy the full SHA
    7a2f4da View commit details
    Browse the repository at this point in the history
  3. [SPARK-33935][SQL][3.0] Fix CBO cost function

    ### What changes were proposed in this pull request?
    
    Changed the cost function in CBO to match documentation.
    
    ### Why are the changes needed?
    
    The parameter `spark.sql.cbo.joinReorder.card.weight` is documented as:
    ```
    The weight of cardinality (number of rows) for plan cost comparison in join reorder: rows * weight + size * (1 - weight).
    ```
    The implementation in `JoinReorderDP.betterThan` does not match this documentaiton:
    ```
    def betterThan(other: JoinPlan, conf: SQLConf): Boolean = {
          if (other.planCost.card == 0 || other.planCost.size == 0) {
            false
          } else {
            val relativeRows = BigDecimal(this.planCost.card) / BigDecimal(other.planCost.card)
            val relativeSize = BigDecimal(this.planCost.size) / BigDecimal(other.planCost.size)
            relativeRows * conf.joinReorderCardWeight +
              relativeSize * (1 - conf.joinReorderCardWeight) < 1
          }
        }
    ```
    
    This different implementation has an unfortunate consequence:
    given two plans A and B, both A betterThan B and B betterThan A might give the same results. This happes when one has many rows with small sizes and other has few rows with large sizes.
    
    A example values, that have this fenomen with the default weight value (0.7):
    A.card = 500, B.card = 300
    A.size = 30, B.size = 80
    Both A betterThan B and B betterThan A would have score above 1 and would return false.
    
    This happens with several of the TPCDS queries.
    
    The new implementation does not have this behavior.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    New and existing UTs
    
    Closes apache#31042 from tanelk/SPARK-33935_cbo_cost_function_3.0.
    
    Authored-by: Tanel Kiis <tanel.kiis@reach-u.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    Tanel Kiis authored and dongjoon-hyun committed Jan 5, 2021
    Configuration menu
    Copy the full SHA
    1179b8b View commit details
    Browse the repository at this point in the history
  4. [SPARK-33844][SQL][3.0] InsertIntoHiveDir command should check col na…

    …me too
    
    ### What changes were proposed in this pull request?
    
    In hive-1.2.1, hive serde just split `serdeConstants.LIST_COLUMNS` and `serdeConstants.LIST_COLUMN_TYPES` use comma.
    
    When we use spark 2.4 with UT
    ```
      test("insert overwrite directory with comma col name") {
        withTempDir { dir =>
          val path = dir.toURI.getPath
    
          val v1 =
            s"""
               | INSERT OVERWRITE DIRECTORY '${path}'
               | STORED AS TEXTFILE
               | SELECT 1 as a, 'c' as b, if(1 = 1, "true", "false")
             """.stripMargin
    
          sql(v1).explain(true)
    
          sql(v1).show()
        }
      }
    ```
    failed with as below since column name contains `,` then column names and column types size not equal.
    ```
    19:56:05.618 ERROR org.apache.spark.sql.execution.datasources.FileFormatWriter:  [ angerszhu ] Aborting job dd774f18-93fa-431f-9468-3534c7d8acda.
    org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apache.hadoop.hive.serde2.SerDeException: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 5 elements while columns.types has 3 elements!
    	at org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:145)
    	at org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.<init>(LazySerDeParameters.java:85)
    	at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125)
    	at org.apache.spark.sql.hive.execution.HiveOutputWriter.<init>(HiveFileFormat.scala:119)
    	at org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103)
    	at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120)
    	at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:108)
    	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:287)
    	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:219)
    	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:218)
    	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    	at org.apache.spark.scheduler.Task.run(Task.scala:121)
    	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$12.apply(Executor.scala:461)
    	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:467)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    	at java.lang.Thread.run(Thread.java:748)
    
    ```
    
    After hive-2.3 we will set COLUMN_NAME_DELIMITER to special char when col name cntains ',':
    https://github.com/apache/hive/blob/6f4c35c9e904d226451c465effdc5bfd31d395a0/metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreUtils.java#L1180-L1188
    https://github.com/apache/hive/blob/6f4c35c9e904d226451c465effdc5bfd31d395a0/metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreUtils.java#L1044-L1075
    
    And in script transform, we parse column name  to avoid this problem
    https://github.com/apache/spark/blob/554600c2af0dbc8979955807658fafef5dc66c08/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationExec.scala#L257-L261
    
    So I think in `InsertIntoHiveDirComman`, we should do same thing too. And I have verified this method can make spark-2.4 work well.
    
    ### Why are the changes needed?
    More save use serde
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    
    Closes apache#31038 from AngersZhuuuu/SPARK-33844-3.0.
    
    Authored-by: angerszhu <angers.zhu@gmail.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    AngersZhuuuu authored and cloud-fan committed Jan 5, 2021
    Configuration menu
    Copy the full SHA
    9ba6db9 View commit details
    Browse the repository at this point in the history

Commits on Jan 6, 2021

  1. [SPARK-33635][SS] Adjust the order of check in KafkaTokenUtil.needTok…

    …enUpdate to remedy perf regression
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to adjust the order of check in KafkaTokenUtil.needTokenUpdate, so that short-circuit applies on the non-delegation token cases (insecure + secured without delegation token) and remedies the performance regression heavily.
    
    ### Why are the changes needed?
    
    There's a serious performance regression between Spark 2.4 vs Spark 3.0 on read path against Kafka data source.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Manually ran a reproducer (https://github.com/codegorillauk/spark-kafka-read with modification to just count instead of writing to Kafka topic) with measuring the time.
    
    > the branch applying the change with adding measurement
    
    https://github.com/HeartSaVioR/spark/commits/debug-SPARK-33635-v3.0.1
    
    > the branch only adding measurement
    
    https://github.com/HeartSaVioR/spark/commits/debug-original-ver-SPARK-33635-v3.0.1
    
    > the result (before the fix)
    
    count: 10280000
    Took 41.634007047 secs
    
    21/01/06 13:16:07 INFO KafkaDataConsumer: debug ver. 17-original
    21/01/06 13:16:07 INFO KafkaDataConsumer: Total time taken to retrieve: 82118 ms
    
    > the result (after the fix)
    
    count: 10280000
    Took 7.964058475 secs
    
    21/01/06 13:08:22 INFO KafkaDataConsumer: debug ver. 17
    21/01/06 13:08:22 INFO KafkaDataConsumer: Total time taken to retrieve: 987 ms
    
    Closes apache#31056 from HeartSaVioR/SPARK-33635.
    
    Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    (cherry picked from commit fa93090)
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    HeartSaVioR authored and dongjoon-hyun committed Jan 6, 2021
    Configuration menu
    Copy the full SHA
    98cb0cd View commit details
    Browse the repository at this point in the history
  2. [SPARK-33029][CORE][WEBUI][3.0] Fix the UI executor page incorrectly …

    …marking the driver as blacklisted
    
    This is a backport of apache#30954
    
    ### What changes were proposed in this pull request?
    Filter out the driver entity when updating the exclusion status of live executors(including the driver), so the driver won't be marked as blacklisted in the UI even if the node that hosts the driver has been marked as blacklisted.
    
    ### Why are the changes needed?
    Before this change, if we run spark with the standalone mode and with spark.blacklist.enabled=true. The driver will be marked as blacklisted when the host that hosts that driver has been marked as blacklisted. While it's incorrect because the exclude list feature will exclude executors only and the driver is still active.
    ![image](https://user-images.githubusercontent.com/26694233/103732959-3494c180-4fae-11eb-9da0-2c906309ea83.png)
    After the fix, the driver won't be marked as blacklisted.
    ![image](https://user-images.githubusercontent.com/26694233/103732974-3fe7ed00-4fae-11eb-90d1-7ee44d4ed7c9.png)
    
    ### Does this PR introduce _any_ user-facing change?
    No.
    
    ### How was this patch tested?
    Manual test. Reopen the UI and see the driver is no longer marked as blacklisted.
    
    Closes apache#31057 from baohe-zhang/SPARK-33029-3.0.
    
    Authored-by: Baohe Zhang <baohe.zhang@verizonmedia.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    Baohe Zhang authored and dongjoon-hyun committed Jan 6, 2021
    Configuration menu
    Copy the full SHA
    403bca4 View commit details
    Browse the repository at this point in the history
  3. [SPARK-34012][SQL][3.0] Keep behavior consistent when conf `spark.sql…

    …legacy.parser.havingWithoutGroupByAsWhere` is true with migration guide
    
    ### What changes were proposed in this pull request?
    In apache#22696 we support HAVING without GROUP BY means global aggregate
    But since we treat having as Filter before, in this way will cause a lot of analyze error, after apache#28294 we use `UnresolvedHaving` to instead `Filter` to solve such problem, but break origin logical about treat `SELECT 1 FROM range(10) HAVING true` as `SELECT 1 FROM range(10) WHERE true`   .
    This PR fix this issue and add UT.
    
    NOTE: This backport comes from apache#31039
    
    ### Why are the changes needed?
    Keep consistent behavior of migration guide.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    added UT
    
    Closes apache#31049 from AngersZhuuuu/SPARK-34012-3.0.
    
    Authored-by: angerszhu <angers.zhu@gmail.com>
    Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
    AngersZhuuuu authored and maropu committed Jan 6, 2021
    Configuration menu
    Copy the full SHA
    aaa3dcc View commit details
    Browse the repository at this point in the history
  4. [SPARK-34011][SQL][3.1][3.0] Refresh cache in `ALTER TABLE .. RENAME …

    …TO PARTITION`
    
    ### What changes were proposed in this pull request?
    1. Invoke `refreshTable()` from `AlterTableRenamePartitionCommand.run()` after partitions renaming. In particular, this re-creates the cache associated with the modified table.
    2. Refresh the cache associated with tables from v2 table catalogs in the `ALTER TABLE .. RENAME TO PARTITION` command.
    
    ### Why are the changes needed?
    This fixes the issues portrayed by the example:
    ```sql
    spark-sql> CREATE TABLE tbl1 (col0 int, part0 int) USING parquet PARTITIONED BY (part0);
    spark-sql> INSERT INTO tbl1 PARTITION (part0=0) SELECT 0;
    spark-sql> INSERT INTO tbl1 PARTITION (part0=1) SELECT 1;
    spark-sql> CACHE TABLE tbl1;
    spark-sql> SELECT * FROM tbl1;
    0	0
    1	1
    spark-sql> ALTER TABLE tbl1 PARTITION (part0=0) RENAME TO PARTITION (part=2);
    spark-sql> SELECT * FROM tbl1;
    0	0
    1	1
    ```
    The last query must not return `0	2` since `0  0` was renamed by previous command.
    
    ### Does this PR introduce _any_ user-facing change?
    Yes. After the changes for the example above:
    ```sql
    ...
    spark-sql> ALTER TABLE tbl1 PARTITION (part=0) RENAME TO PARTITION (part=2);
    spark-sql> SELECT * FROM tbl1;
    0	2
    1	1
    ```
    
    ### How was this patch tested?
    By running the affected test suite:
    ```
    $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *CachedTableSuite"
    ```
    
    Closes apache#31060 from MaxGekk/rename-partition-refresh-cache-3.1.
    
    Authored-by: Max Gekk <max.gekk@gmail.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    (cherry picked from commit f18d68a)
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    MaxGekk authored and dongjoon-hyun committed Jan 6, 2021
    Configuration menu
    Copy the full SHA
    c9c3d6f View commit details
    Browse the repository at this point in the history

Commits on Jan 8, 2021

  1. [SPARK-33100][SQL][3.0] Ignore a semicolon inside a bracketed comment…

    … in spark-sql
    
    ### What changes were proposed in this pull request?
    Now the spark-sql does not support parse the sql statements with bracketed comments.
    For the sql statements:
    ```
    /* SELECT 'test'; */
    SELECT 'test';
    ```
    Would be split to two statements:
    The first one: `/* SELECT 'test'`
    The second one: `*/ SELECT 'test'`
    
    Then it would throw an exception because the first one is illegal.
    In this PR, we ignore the content in bracketed comments while splitting the sql statements.
    Besides, we ignore the comment without any content.
    
    NOTE: This backport comes from apache#29982
    
    ### Why are the changes needed?
    Spark-sql might split the statements inside bracketed comments and it is not correct.
    
    ### Does this PR introduce _any_ user-facing change?
    No.
    
    ### How was this patch tested?
    Added UT.
    
    Closes apache#31033 from turboFei/SPARK-33100.
    
    Authored-by: fwang12 <fwang12@ebay.com>
    Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
    turboFei authored and maropu committed Jan 8, 2021
    Configuration menu
    Copy the full SHA
    e7d5344 View commit details
    Browse the repository at this point in the history

Commits on Jan 11, 2021

  1. [SPARK-34055][SQL][3.0] Refresh cache in ALTER TABLE .. ADD PARTITION

    ### What changes were proposed in this pull request?
    Invoke `refreshTable()` from `CatalogImpl` which refreshes the cache in v1 `ALTER TABLE .. ADD PARTITION`.
    
    ### Why are the changes needed?
    This fixes the issues portrayed by the example:
    ```sql
    spark-sql> create table tbl (col int, part int) using parquet partitioned by (part);
    spark-sql> insert into tbl partition (part=0) select 0;
    spark-sql> cache table tbl;
    spark-sql> select * from tbl;
    0	0
    spark-sql> show table extended like 'tbl' partition(part=0);
    default	tbl	false	Partition Values: [part=0]
    Location: file:/Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=0
    ...
    ```
    Create new partition by copying the existing one:
    ```
    $ cp -r /Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=0 /Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=1
    ```
    ```sql
    spark-sql> alter table tbl add partition (part=1) location '/Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=1';
    spark-sql> select * from tbl;
    0	0
    ```
    
    The last query must return `0	1` since it has been added by `ALTER TABLE .. ADD PARTITION`.
    
    ### Does this PR introduce _any_ user-facing change?
    Yes. After the changes for the example above:
    ```sql
    ...
    spark-sql> alter table tbl add partition (part=1) location '/Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=1';
    spark-sql> select * from tbl;
    0	0
    0	1
    ```
    
    ### How was this patch tested?
    By running the affected test suite:
    ```
    $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *CachedTableSuite"
    ```
    
    Closes apache#31116 from MaxGekk/add-partition-refresh-cache-2-3.0.
    
    Authored-by: Max Gekk <max.gekk@gmail.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    MaxGekk authored and HyukjinKwon committed Jan 11, 2021
    Configuration menu
    Copy the full SHA
    471a089 View commit details
    Browse the repository at this point in the history
  2. [SPARK-33591][SQL][3.0] Recognize null in partition spec values

    ### What changes were proposed in this pull request?
    1. Recognize `null` while parsing partition specs, and put `null` instead of `"null"` as partition values.
    2. For V1 catalog: replace `null` by `__HIVE_DEFAULT_PARTITION__`.
    
    ### Why are the changes needed?
    Currently, `null` in partition specs is recognized as the `"null"` string which could lead to incorrect results, for example:
    ```sql
    spark-sql> CREATE TABLE tbl5 (col1 INT, p1 STRING) USING PARQUET PARTITIONED BY (p1);
    spark-sql> INSERT INTO TABLE tbl5 PARTITION (p1 = null) SELECT 0;
    spark-sql> SELECT isnull(p1) FROM tbl5;
    false
    ```
    Even we inserted a row to the partition with the `null` value, **the resulted table doesn't contain `null`**.
    
    ### Does this PR introduce _any_ user-facing change?
    Yes. After the changes, the example above works as expected:
    ```sql
    spark-sql> SELECT isnull(p1) FROM tbl5;
    true
    ```
    
    ### How was this patch tested?
    By running the affected test suites:
    ```
    $ build/sbt -Phive -Phive-thriftserver "test:testOnly *SQLQuerySuite"
    $ build/sbt -Phive -Phive-thriftserver "test:testOnly *CatalogedDDLSuite"
    ```
    
    Authored-by: Max Gekk <max.gekkgmail.com>
    Signed-off-by: Wenchen Fan <wenchendatabricks.com>
    (cherry picked from commit 157b72a)
    Signed-off-by: Max Gekk <max.gekkgmail.com>
    
    Closes apache#31095 from MaxGekk/partition-spec-value-null-3.0.
    
    Authored-by: Max Gekk <max.gekk@gmail.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    MaxGekk authored and cloud-fan committed Jan 11, 2021
    Configuration menu
    Copy the full SHA
    16cab5c View commit details
    Browse the repository at this point in the history
  3. [MINOR][3.1][3.0] Improve flaky NaiveBayes test

    ### What changes were proposed in this pull request?
    
    Current test may sometimes fail under different BLAS library. Due to some absTol check. Error like
    ```
    Expected 0.7 and 0.6485507246376814 to be within 0.05 using absolute tolerance...
    
    ```
    
    * Change absTol to relTol: The `absTol 0.05` in some cases (such as compare 0.1 and 0.05) is a big difference
    * Remove the `exp` when comparing params. The `exp` will amplify the relative error.
    
    ### Why are the changes needed?
    Flaky test
    
    ### Does this PR introduce _any_ user-facing change?
    no
    
    ### How was this patch tested?
    N/A
    
    Closes apache#31004 from WeichenXu123/improve_bayes_tests.
    
    Authored-by: Weichen Xu <weichen.xudatabricks.com>
    Signed-off-by: Ruifeng Zheng <ruifengzfoxmail.com>
    
    ### What changes were proposed in this pull request?
    
    ### Why are the changes needed?
    
    ### Does this PR introduce _any_ user-facing change?
    
    ### How was this patch tested?
    
    Closes apache#31123 from WeichenXu123/bp-3.1-nb-test.
    
    Authored-by: Weichen Xu <weichen.xu@databricks.com>
    Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
    (cherry picked from commit d33f0d4)
    Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
    WeichenXu123 authored and zhengruifeng committed Jan 11, 2021
    Configuration menu
    Copy the full SHA
    4cbc177 View commit details
    Browse the repository at this point in the history
  4. [SPARK-34060][SQL][3.0] Fix Hive table caching while updating stats b…

    …y `ALTER TABLE .. DROP PARTITION`
    
    ### What changes were proposed in this pull request?
    Fix canonicalisation of `HiveTableRelation` by normalisation of `CatalogTable`, and exclude table stats and temporary fields from the canonicalized plan.
    
    ### Why are the changes needed?
    This fixes the issue demonstrated by the example below:
    ```scala
    scala> spark.conf.set("spark.sql.statistics.size.autoUpdate.enabled", true)
    scala> sql(s"CREATE TABLE tbl (id int, part int) USING hive PARTITIONED BY (part)")
    scala> sql("INSERT INTO tbl PARTITION (part=0) SELECT 0")
    scala> sql("INSERT INTO tbl PARTITION (part=1) SELECT 1")
    scala> sql("CACHE TABLE tbl")
    scala> sql("SELECT * FROM tbl").show(false)
    +---+----+
    |id |part|
    +---+----+
    |0  |0   |
    |1  |1   |
    +---+----+
    
    scala> spark.catalog.isCached("tbl")
    scala> sql("ALTER TABLE tbl DROP PARTITION (part=0)")
    scala> spark.catalog.isCached("tbl")
    res19: Boolean = false
    ```
    `ALTER TABLE .. DROP PARTITION` must keep the table in the cache.
    
    ### Does this PR introduce _any_ user-facing change?
    Yes. After the changes, the drop partition command keeps the table in the cache while updating table stats:
    ```scala
    scala> sql("ALTER TABLE tbl DROP PARTITION (part=0)")
    scala> spark.catalog.isCached("tbl")
    res19: Boolean = true
    ```
    
    ### How was this patch tested?
    By running new UT:
    ```
    $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *ShowCreateTableSuite"
    $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *CachedTableSuite"
    ```
    
    Authored-by: Max Gekk <max.gekkgmail.com>
    Signed-off-by: Wenchen Fan <wenchendatabricks.com>
    (cherry picked from commit d97e991)
    Signed-off-by: Max Gekk <max.gekkgmail.com>
    
    Closes apache#31126 from MaxGekk/fix-caching-hive-table-2-3.0.
    
    Authored-by: Max Gekk <max.gekk@gmail.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    MaxGekk authored and cloud-fan committed Jan 11, 2021
    Configuration menu
    Copy the full SHA
    ecfa015 View commit details
    Browse the repository at this point in the history

Commits on Jan 12, 2021

  1. [SPARK-34059][SQL][CORE][3.0] Use for/foreach rather than map to make…

    … sure execute it eagerly
    
    ### What changes were proposed in this pull request?
    
    This is a backport of apache#31110. I ran intelliJ inspection again in this branch.
    
    This PR is basically a followup of apache#14332.
    Calling `map` alone might leave it not executed due to lazy evaluation, e.g.)
    
    ```
    scala> val foo = Seq(1,2,3)
    foo: Seq[Int] = List(1, 2, 3)
    
    scala> foo.map(println)
    1
    2
    3
    res0: Seq[Unit] = List((), (), ())
    
    scala> foo.view.map(println)
    res1: scala.collection.SeqView[Unit,Seq[_]] = SeqViewM(...)
    
    scala> foo.view.foreach(println)
    1
    2
    3
    ```
    
    We should better use `foreach` to make sure it's executed where the output is unused or `Unit`.
    
    ### Why are the changes needed?
    
    To prevent the potential issues by not executing `map`.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No, the current codes look not causing any problem for now.
    
    ### How was this patch tested?
    
    I found these item by running IntelliJ inspection, double checked one by one, and fixed them. These should be all instances across the codebase ideally.
    
    Closes apache#31138 from HyukjinKwon/SPARK-34059-3.0.
    
    Authored-by: HyukjinKwon <gurwls223@apache.org>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    HyukjinKwon committed Jan 12, 2021
    Configuration menu
    Copy the full SHA
    27c03b6 View commit details
    Browse the repository at this point in the history
  2. [SPARK-31952][SQL][3.0] Fix incorrect memory spill metric when doing …

    …Aggregate
    
    ### What changes were proposed in this pull request?
    
    This PR takes over apache#28780.
    
    1. Counted the spilled memory size when creating the `UnsafeExternalSorter` with the existing `InMemorySorter`
    
    2. Accumulate the `totalSpillBytes` when merging two `UnsafeExternalSorter`
    
    ### Why are the changes needed?
    
    As mentioned in apache#28780:
    
    > It happends when hash aggregate downgrades to sort based aggregate.
    `UnsafeExternalSorter.createWithExistingInMemorySorter` calls spill on an `InMemorySorter` immediately, but the memory pointed by `InMemorySorter` is acquired by outside `BytesToBytesMap`, instead the allocatedPages in `UnsafeExternalSorter`. So the memory spill bytes metric is always 0, but disk bytes spill metric is right.
    
    Besides, this PR also fixes the `UnsafeExternalSorter.merge` by accumulating the `totalSpillBytes` of two sorters. Thus, we can report the correct spilled size in `HashAggregateExec.finishAggregate`.
    
    Issues can be reproduced by the following step by checking the SQL metrics in UI:
    
    ```
    bin/spark-shell --driver-memory 512m --executor-memory 512m --executor-cores 1 --conf "spark.default.parallelism=1"
    scala> sql("select id, count(1) from range(10000000) group by id").write.csv("/tmp/result.json")
    ```
    
    Before:
    
    <img width="200" alt="WeChatfe5146180d91015e03b9a27852e9a443" src="https://user-images.githubusercontent.com/16397174/103625414-e6fc6280-4f75-11eb-8b93-c55095bdb5b8.png">
    
    After:
    
    <img width="200" alt="WeChat42ab0e73c5fbc3b14c12ab85d232071d" src="https://user-images.githubusercontent.com/16397174/103625420-e8c62600-4f75-11eb-8e1f-6f5e8ab561b9.png">
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, users can see the correct spill metrics after this PR.
    
    ### How was this patch tested?
    
    Tested manually and added UTs.
    
    Closes apache#31140 from Ngone51/cp-spark-31952.
    
    Authored-by: yi.wu <yi.wu@databricks.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    Ngone51 authored and dongjoon-hyun committed Jan 12, 2021
    Configuration menu
    Copy the full SHA
    a30d20f View commit details
    Browse the repository at this point in the history
  3. [SPARK-32691][3.0] Bump commons-crypto to v1.1.0

    The package commons-crypto-1.0.0 doesn't support
    aarch64 platform, change to use v1.1.0.
    
    ### What changes were proposed in this pull request?
    Update the package commons-crypto to v1.1.0 to support aarch64 platform
    - https://issues.apache.org/jira/browse/CRYPTO-139
    
    NOTE: This backport comes from apache#30275
    
    ### Why are the changes needed?
    The package commons-crypto-1.0.0 available in the Maven repository
    doesn't support aarch64 platform. It costs long time in
    CryptoRandomFactory.getCryptoRandom(properties).nextBytes(iv) when NettyBlockRpcSever
    receive block data from client,  if the time more than the default value 120s, IOException raised and client
    will retry replicate the block data to other executors. But in fact the replication is complete,
    it makes the replication number incorrect.
    This makes DistributedSuite tests pass.
    
    ### Does this PR introduce _any_ user-facing change?
    No.
    
    ### How was this patch tested?
    Pass the CIs.
    
    Closes apache#31078 from huangtianhua/origin/branch-3.0.
    
    Authored-by: huangtianhua <huangtianhua223@gmail.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    huangtianhua authored and dongjoon-hyun committed Jan 12, 2021
    Configuration menu
    Copy the full SHA
    7cfc45b View commit details
    Browse the repository at this point in the history

Commits on Jan 13, 2021

  1. [SPARK-34084][SQL][3.0] Fix auto updating of table stats in `ALTER TA…

    …BLE .. ADD PARTITION`
    
    ### What changes were proposed in this pull request?
    Fix an issue in `ALTER TABLE .. ADD PARTITION` which happens when:
    - A table doesn't have stats
    - `spark.sql.statistics.size.autoUpdate.enabled` is `true`
    
    In that case, `ALTER TABLE .. ADD PARTITION` does not update table stats automatically.
    
    ### Why are the changes needed?
    The changes fix the issue demonstrated by the example:
    ```sql
    spark-sql> create table tbl (col0 int, part int) partitioned by (part);
    spark-sql> insert into tbl partition (part = 0) select 0;
    spark-sql> set spark.sql.statistics.size.autoUpdate.enabled=true;
    spark-sql> alter table tbl add partition (part = 1);
    ```
    the `add partition` command should update table stats but it does not. There are no stats in the output of:
    ```
    spark-sql> describe table extended tbl;
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    Yes. After the changes, `ALTER TABLE .. ADD PARTITION` updates stats even when a table doesn't have stats before the command:
    ```sql
    spark-sql> alter table tbl add partition (part = 1);
    spark-sql> describe table extended tbl;
    col0	int	NULL
    part	int	NULL
    # Partition Information
    # col_name	data_type	comment
    part	int	NULL
    
    # Detailed Table Information
    ...
    Statistics	2 bytes
    ```
    
    ### How was this patch tested?
    By running new UT and existing test suites:
    ```
    $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.StatisticsSuite"
    ```
    
    Authored-by: Max Gekk <max.gekkgmail.com>
    Signed-off-by: Wenchen Fan <wenchendatabricks.com>
    (cherry picked from commit 6c04795)
    Signed-off-by: Max Gekk <max.gekkgmail.com>
    
    Closes apache#31158 from MaxGekk/fix-stats-in-add-partition-3.0.
    
    Authored-by: Max Gekk <max.gekk@gmail.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    MaxGekk authored and cloud-fan committed Jan 13, 2021
    Configuration menu
    Copy the full SHA
    0c4fdea View commit details
    Browse the repository at this point in the history

Commits on Jan 14, 2021

  1. [SPARK-34103][INFRA] Fix MiMaExcludes by moving SPARK-23429 from 2.4 …

    …to 3.0
    
    ### What changes were proposed in this pull request?
    
    This PR aims to fix `MiMaExcludes` rule by moving SPARK-23429 from 2.4 to 3.0.
    
    ### Why are the changes needed?
    
    SPARK-23429 was added at Apache Spark 3.0.0.
    This should land on `master` and `branch-3.1` and `branch-3.0`.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Pass the MiMa rule.
    
    Closes apache#31174 from dongjoon-hyun/SPARK-34103.
    
    Authored-by: Dongjoon Hyun <dhyun@apple.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    (cherry picked from commit 9e93fdb)
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    dongjoon-hyun authored and HyukjinKwon committed Jan 14, 2021
    Configuration menu
    Copy the full SHA
    dbc18d6 View commit details
    Browse the repository at this point in the history
  2. [SPARK-33557][CORE][MESOS][3.0] Ensure the relationship between STORA…

    …GE_BLOCKMANAGER_HEARTBEAT_TIMEOUT and NETWORK_TIMEOUT
    
    ### What changes were proposed in this pull request?
    As described in SPARK-33557, `HeartbeatReceiver` and `MesosCoarseGrainedSchedulerBackend` will always use `Network.NETWORK_TIMEOUT.defaultValueString` as value of `STORAGE_BLOCKMANAGER_HEARTBEAT_TIMEOUT` when we configure `NETWORK_TIMEOUT` without configure `STORAGE_BLOCKMANAGER_HEARTBEAT_TIMEOUT`, this is different from the relationship described in `configuration.md`.
    
    To fix this problem,the main change of this pr as follow:
    
    - Remove the explicitly default value of `STORAGE_BLOCKMANAGER_HEARTBEAT_TIMEOUT`
    
    - Use actual value of `NETWORK_TIMEOUT` as `STORAGE_BLOCKMANAGER_HEARTBEAT_TIMEOUT` when `STORAGE_BLOCKMANAGER_HEARTBEAT_TIMEOUT` not configured in `HeartbeatReceiver` and `MesosCoarseGrainedSchedulerBackend`
    
    ### Why are the changes needed?
    To ensure the relationship between `NETWORK_TIMEOUT` and  `STORAGE_BLOCKMANAGER_HEARTBEAT_TIMEOUT` as we described in `configuration.md`
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    
    - Pass the Jenkins or GitHub Action
    
    - Manual test configure `NETWORK_TIMEOUT` and `STORAGE_BLOCKMANAGER_HEARTBEAT_TIMEOUT` locally
    
    Closes apache#31175 from dongjoon-hyun/SPARK-33557.
    
    Authored-by: yangjie01 <yangjie01@baidu.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    LuciferYang authored and HyukjinKwon committed Jan 14, 2021
    Configuration menu
    Copy the full SHA
    fcd10a6 View commit details
    Browse the repository at this point in the history

Commits on Jan 15, 2021

  1. [SPARK-34118][CORE][SQL][3.0] Replaces filter and check for emptiness…

    … with exists or forall
    
    ### What changes were proposed in this pull request?
    This pr use `exists` or `forall` to simplify `filter + emptiness check`, it's semantically consistent, but looks simpler. The rule as follow:
    
    - `seq.filter(p).size == 0)` -> `!seq.exists(p)`
    - `seq.filter(p).length > 0` -> `seq.exists(p)`
    - `seq.filterNot(p).isEmpty` -> `seq.forall(p)`
    - `seq.filterNot(p).nonEmpty` -> `!seq.forall(p)`
    
    ### Why are the changes needed?
    Code Simpilefications.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Pass the Jenkins or GitHub Action
    
    Closes apache#31190 from LuciferYang/SPARK-34118-30.
    
    Authored-by: yangjie01 <yangjie01@baidu.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    LuciferYang authored and HyukjinKwon committed Jan 15, 2021
    Configuration menu
    Copy the full SHA
    dc1816d View commit details
    Browse the repository at this point in the history
  2. [SPARK-33790][CORE][3.0] Reduce the rpc call of getFileStatus in Sing…

    …leFileEventLogFileReader
    
    ### What changes were proposed in this pull request?
    `FsHistoryProvider#checkForLogs` already has `FileStatus` when constructing `SingleFileEventLogFileReader`, and there is no need to get the `FileStatus` again when `SingleFileEventLogFileReader#fileSizeForLastIndex`.
    
    ### Why are the changes needed?
    This can reduce a lot of rpc calls and improve the speed of the history server.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    exist ut
    
    Closes apache#31187 from HeartSaVioR/SPARK-33790-branch-3.0.
    
    Lead-authored-by: sychen <sychen@ctrip.com>
    Co-authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
    Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
    cxzl25 and HeartSaVioR committed Jan 15, 2021
    Configuration menu
    Copy the full SHA
    d81f482 View commit details
    Browse the repository at this point in the history
  3. [SPARK-32598][SCHEDULER] Fix missing driver logs under UI App-Executo…

    …rs tab in standalone cluster mode
    
    ### What changes were proposed in this pull request?
    Fix  [SPARK-32598] (missing driver logs under UI-ApplicationDetails-Executors tab in standalone cluster mode) .
    
    The direct bug is: the original author forgot to implement `getDriverLogUrls` in `StandaloneSchedulerBackend`
    
    https://github.com/apache/spark/blob/1de272f98d0ff22d0dd151797f22b8faf310963a/core/src/main/scala/org/apache/spark/scheduler/SchedulerBackend.scala#L70-L75
    
    So we set DriverLogUrls as env in `DriverRunner`, and retrieve it at `StandaloneSchedulerBackend`.
    
    ### Why are the changes needed?
    Fix bug  [SPARK-32598].
    
    ### Does this PR introduce _any_ user-facing change?
    Yes. User will see driver logs (standalone cluster mode) under UI-ApplicationDetails-Executors tab now.
    
    Before:
    ![image](https://user-images.githubusercontent.com/17903517/93901055-b5de8600-fd28-11ea-879a-d97e6f70cc6e.png)
    
    After:
    ![image](https://user-images.githubusercontent.com/17903517/93901080-baa33a00-fd28-11ea-8895-3787c5efbf88.png)
    
    ### How was this patch tested?
    Re-check the real case in [SPARK-32598] and found this user-facing bug fixed.
    
    Closes apache#29644 from KevinSmile/kw-dev-master.
    
    Authored-by: KevinSmile <kevinwang013@hotmail.com>
    Signed-off-by: Sean Owen <srowen@gmail.com>
    (cherry picked from commit c75c29d)
    Signed-off-by: Sean Owen <srowen@gmail.com>
    KevinSmile authored and srowen committed Jan 15, 2021
    Configuration menu
    Copy the full SHA
    70fa108 View commit details
    Browse the repository at this point in the history
  4. [SPARK-33711][K8S][3.0] Avoid race condition between POD lifecycle ma…

    …nager and scheduler backend
    
    ### What changes were proposed in this pull request?
    
    Missing POD detection is extended by timestamp (and time limit) based check to avoid wrongfully detection of missing POD detection.
    
    The two new timestamps:
    - `fullSnapshotTs` is introduced for the `ExecutorPodsSnapshot` which only updated by the pod polling snapshot source
    - `registrationTs` is introduced for the `ExecutorData` and it is initialized at the executor registration at the scheduler backend
    
    Moreover a new config `spark.kubernetes.executor.missingPodDetectDelta` is used to specify the accepted delta between the two.
    
    ### Why are the changes needed?
    
    Watching a POD (`ExecutorPodsWatchSnapshotSource`) only inform about single POD changes. This could wrongfully lead to detecting of missing PODs (PODs known by scheduler backend but missing from POD snapshots) by the executor POD lifecycle manager.
    
    A key indicator of this error is seeing this log message:
    
    > "The executor with ID [some_id] was not found in the cluster but we didn't get a reason why. Marking the executor as failed. The executor may have been deleted but the driver missed the deletion event."
    
    So one of the problem is running the missing POD detection check even when a single POD is changed without having a full consistent snapshot about all the PODs (see `ExecutorPodsPollingSnapshotSource`).
    The other problem could be the race between the executor POD lifecycle manager and the scheduler backend: so even in case of a having a full snapshot the registration at the scheduler backend could precede the snapshot polling (and processing of those polled snapshots).
    
    ### Does this PR introduce any user-facing change?
    
    Yes. When the POD is missing then the reason message explaining the executor's exit is extended with both timestamps (the polling time and the executor registration time) and even the new config is mentioned.
    
    ### How was this patch tested?
    
    The existing unit tests are extended.
    
    (cherry picked from commit 6bd7a62)
    
    Closes apache#31195 from attilapiros/SPARK-33711-branch-3.0.
    
    Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    attilapiros authored and dongjoon-hyun committed Jan 15, 2021
    Configuration menu
    Copy the full SHA
    f7591e5 View commit details
    Browse the repository at this point in the history

Commits on Jan 16, 2021

  1. [SPARK-34060][SQL][FOLLOWUP] Preserve serializability of canonicalize…

    …d CatalogTable
    
    ### What changes were proposed in this pull request?
    Replace `toMap` by `map(identity).toMap` while getting canonicalized representation of `CatalogTable`. `CatalogTable` became not serializable after apache#31112 due to usage of `filterKeys`. The workaround was taken from scala/bug#7005.
    
    ### Why are the changes needed?
    This prevents the errors like:
    ```
    [info]   org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: scala.collection.immutable.MapLike$$anon$1
    [info]   Cause: java.io.NotSerializableException: scala.collection.immutable.MapLike$$anon$1
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    Should not.
    
    ### How was this patch tested?
    By running the test suite affected by apache#31112:
    ```
    $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite"
    ```
    
    Closes apache#31197 from MaxGekk/fix-caching-hive-table-2-followup.
    
    Authored-by: Max Gekk <max.gekk@gmail.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    (cherry picked from commit c3d81fb)
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    MaxGekk authored and dongjoon-hyun committed Jan 16, 2021
    Configuration menu
    Copy the full SHA
    1ab0f02 View commit details
    Browse the repository at this point in the history
  2. [MINOR][DOCS] Update Parquet website link

    ### What changes were proposed in this pull request?
    This PR aims to update the Parquet website link from http://parquet.io to https://parquet.apache.orc
    
    ### Why are the changes needed?
    The old website goes to the incubator site.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    N/A
    
    Closes apache#31208 from williamhyun/minor-parquet.
    
    Authored-by: William Hyun <williamhyun3@gmail.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    (cherry picked from commit 1cf09b7)
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    williamhyun authored and dongjoon-hyun committed Jan 16, 2021
    Configuration menu
    Copy the full SHA
    403a4ac View commit details
    Browse the repository at this point in the history

Commits on Jan 18, 2021

  1. [MINOR][DOCS] Fix typos in sql-ref-datatypes.md

    ### What changes were proposed in this pull request?
    Fixing typos in the docs sql-ref-datatypes.md.
    
    ### Why are the changes needed?
    To display '<element_type>' correctly.
    
    ### Does this PR introduce _any_ user-facing change?
    No.
    
    ### How was this patch tested?
    Manually run jekyll.
    
    before this fix
    ![image](https://user-images.githubusercontent.com/2217224/104865408-3df33600-597f-11eb-857b-c6223ff9159a.png)
    
    after this fix
    ![image](https://user-images.githubusercontent.com/2217224/104865458-62e7a900-597f-11eb-8a21-6d838eecaaf2.png)
    
    Closes apache#31221 from kariya-mitsuru/fix-typo.
    
    Authored-by: Mitsuru Kariya <Mitsuru.Kariya@oss.nttdata.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    (cherry picked from commit 536a725)
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    kariya-mitsuru authored and HyukjinKwon committed Jan 18, 2021
    Configuration menu
    Copy the full SHA
    d8ce224 View commit details
    Browse the repository at this point in the history
  2. [SPARK-33819][CORE][FOLLOWUP][3.0] Restore the constructor of SingleF…

    …ileEventLogFileReader to remove Mima exclusion
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to remove Mima exclusion via restoring the old constructor of SingleFileEventLogFileReader. This partially adopts the remaining parts of apache#30814 which was excluded while porting back.
    
    ### Why are the changes needed?
    
    To remove unnecessary Mima exclusion.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Pass CIs.
    
    Closes apache#31225 from HeartSaVioR/SPARK-33819-followup-branch-3.0.
    
    Authored-by: Dongjoon Hyun <dongjoon@apache.org>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    dongjoon-hyun authored and HyukjinKwon committed Jan 18, 2021
    Configuration menu
    Copy the full SHA
    70c0bc9 View commit details
    Browse the repository at this point in the history

Commits on Jan 19, 2021

  1. [SPARK-34027][SQL][3.0] Refresh cache in `ALTER TABLE .. RECOVER PART…

    …ITIONS`
    
    ### What changes were proposed in this pull request?
    Invoke `refreshTable()` from `CatalogImpl` which refreshes the cache in v1 `ALTER TABLE .. RECOVER PARTITIONS`.
    
    ### Why are the changes needed?
    This fixes the issues portrayed by the example:
    ```sql
    spark-sql> create table tbl (col int, part int) using parquet partitioned by (part);
    spark-sql> insert into tbl partition (part=0) select 0;
    spark-sql> cache table tbl;
    spark-sql> select * from tbl;
    0	0
    spark-sql> show table extended like 'tbl' partition(part=0);
    default	tbl	false	Partition Values: [part=0]
    Location: file:/Users/maximgekk/proj/recover-partitions-refresh-cache/spark-warehouse/tbl/part=0
    ...
    ```
    Create new partition by copying the existing one:
    ```
    $ cp -r /Users/maximgekk/proj/recover-partitions-refresh-cache/spark-warehouse/tbl/part=0 /Users/maximgekk/proj/recover-partitions-refresh-cache/spark-warehouse/tbl/part=1
    ```
    ```sql
    spark-sql> alter table tbl recover partitions;
    spark-sql> select * from tbl;
    0	0
    ```
    
    The last query must return `0	1` since it has been recovered by `ALTER TABLE .. RECOVER PARTITIONS`.
    
    ### Does this PR introduce _any_ user-facing change?
    Yes. After the changes for the example above:
    ```sql
    ...
    spark-sql> alter table tbl recover partitions;
    spark-sql> select * from tbl;
    0	0
    0	1
    ```
    
    ### How was this patch tested?
    By running the affected test suite:
    ```
    $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *CachedTableSuite"
    $ build/sbt -Phive -Phive-thriftserver "test:testOnly *HiveSchemaInferenceSuite"
    ```
    
    Authored-by: Max Gekk <max.gekkgmail.com>
    Signed-off-by: Wenchen Fan <wenchendatabricks.com>
    (cherry picked from commit dee596e)
    Signed-off-by: Max Gekk <max.gekkgmail.com>
    
    Closes apache#31236 from MaxGekk/recover-partitions-refresh-cache-3.0.
    
    Authored-by: Max Gekk <max.gekk@gmail.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    MaxGekk authored and cloud-fan committed Jan 19, 2021
    Configuration menu
    Copy the full SHA
    f705b65 View commit details
    Browse the repository at this point in the history
  2. [SPARK-34153][SQL][3.1][3.0] Remove unused getRawTable() from `Hive…

    …ExternalCatalog.alterPartitions()`
    
    Remove unused call of `getRawTable()` from `HiveExternalCatalog.alterPartitions()`.
    
    It reduces the number of calls to Hive External catalog.
    
    No
    
    By existing test suites.
    
    Authored-by: Max Gekk <max.gekkgmail.com>
    Signed-off-by: HyukjinKwon <gurwls223apache.org>
    (cherry picked from commit bea10a6)
    Signed-off-by: Max Gekk <max.gekkgmail.com>
    
    Closes apache#31241 from MaxGekk/remove-getRawTable-from-alterPartitions-3.1.
    
    Authored-by: Max Gekk <max.gekk@gmail.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    (cherry picked from commit 246ff31)
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    MaxGekk authored and HyukjinKwon committed Jan 19, 2021
    Configuration menu
    Copy the full SHA
    67b9f6c View commit details
    Browse the repository at this point in the history

Commits on Jan 20, 2021

  1. [MINOR][ML] Increase the timeout for StreamingLinearRegressionSuite t…

    …o 60s
    
    ### What changes were proposed in this pull request?
    
    Increase the timeout for StreamingLinearRegressionSuite to 60s to deflake the test.
    
    ### Why are the changes needed?
    
    Reduce merge conflict.
    
    ### Does this PR introduce _any_ user-facing change?
    
    ### How was this patch tested?
    
    Closes apache#31248 from liangz1/increase-timeout.
    
    Authored-by: Liang Zhang <liang.zhang@databricks.com>
    Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
    (cherry picked from commit f7ff7ff)
    Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
    liangz1 authored and WeichenXu123 committed Jan 20, 2021
    Configuration menu
    Copy the full SHA
    5a93bcb View commit details
    Browse the repository at this point in the history
  2. [SPARK-34115][CORE] Check SPARK_TESTING as lazy val to avoid slowdown

    ### What changes were proposed in this pull request?
    Check SPARK_TESTING as lazy val to avoid slow down when there are many environment variables
    
    ### Why are the changes needed?
    If there are many environment variables, sys.env slows is very slow. As Utils.isTesting is called very often during Dataframe-Optimization, this can slow down evaluation very much.
    
    An example for triggering the problem can be found in the bug ticket https://issues.apache.org/jira/browse/SPARK-34115
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    With the example provided in the ticket.
    
    Closes apache#31244 from nob13/bug/34115.
    
    Lead-authored-by: Norbert Schultz <norbert.schultz@reactivecore.de>
    Co-authored-by: Norbert Schultz <noschultz@gmail.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    (cherry picked from commit c3d8352)
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    2 people authored and HyukjinKwon committed Jan 20, 2021
    Configuration menu
    Copy the full SHA
    b5b1da9 View commit details
    Browse the repository at this point in the history
  3. [SPARK-34178][SQL] Copy tags for the new node created by MultiInstanc…

    …eRelation.newInstance
    
    ### What changes were proposed in this pull request?
    
    Call `copyTagsFrom` for the new node created by `MultiInstanceRelation.newInstance()`.
    
    ### Why are the changes needed?
    
    ```scala
    val df = spark.range(2)
    df.join(df, df("id") <=> df("id")).show()
    ```
    
    For this query, it's supposed to be non-ambiguous join by the rule `DetectAmbiguousSelfJoin` because of the same attribute reference in the condition:
    
    https://github.com/apache/spark/blob/537a49fc0966b0b289b67ac9c6ea20093165b0da/sql/core/src/main/scala/org/apache/spark/sql/execution/analysis/DetectAmbiguousSelfJoin.scala#L125
    
    However, `DetectAmbiguousSelfJoin` can not apply this prediction due to the right side plan doesn't contain the dataset_id TreeNodeTag, which is missing after `MultiInstanceRelation.newInstance`. That's why we should preserve the tags info for the copied node.
    
    Fortunately, the query is still considered as non-ambiguous join because `DetectAmbiguousSelfJoin` only checks the left side plan and the reference is the same as the left side plan. However, this's not the expected behavior but only a coincidence.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Updated a unit test
    
    Closes apache#31260 from Ngone51/fix-missing-tags.
    
    Authored-by: yi.wu <yi.wu@databricks.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    (cherry picked from commit f498977)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    Ngone51 authored and cloud-fan committed Jan 20, 2021
    Configuration menu
    Copy the full SHA
    89443ab View commit details
    Browse the repository at this point in the history

Commits on Jan 21, 2021

  1. [MINOR][TESTS] Increase tolerance to 0.2 for NaiveBayesSuite

    ### What changes were proposed in this pull request?
    This test fails flakily. I found it failing in 1 out of 80 runs.
    ```
      Expected -0.35667494393873245 and -0.41914521201224453 to be within 0.15 using relative tolerance.
    ```
    Increasing relative tolerance to 0.2 should improve flakiness.
    ```
    0.2 * 0.35667494393873245 = 0.071 > 0.062 = |-0.35667494393873245 - (-0.41914521201224453)|
    ```
    
    ### Why are the changes needed?
    
    ### Does this PR introduce _any_ user-facing change?
    
    ### How was this patch tested?
    
    Closes apache#31266 from Loquats/NaiveBayesSuite-reltol.
    
    Authored-by: Andy Zhang <yue.zhang@databricks.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    (cherry picked from commit c8c70d5)
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    (cherry picked from commit dad201e)
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    Loquats authored and dongjoon-hyun committed Jan 21, 2021
    Configuration menu
    Copy the full SHA
    4690063 View commit details
    Browse the repository at this point in the history
  2. Revert "[SPARK-34178][SQL] Copy tags for the new node created by Mult…

    …iInstanceRelation.newInstance"
    
    This reverts commit 89443ab.
    HyukjinKwon committed Jan 21, 2021
    Configuration menu
    Copy the full SHA
    4e80f8c View commit details
    Browse the repository at this point in the history
  3. [SPARK-34181][DOC] Update Prerequisites for build doc of ruby 3.0 issue

    ### What changes were proposed in this pull request?
    When ruby version is 3.0, jekyll server will failed with
    ```
    yi.zhu$ SKIP_API=1 jekyll serve --watch
    Configuration file: /Users/yi.zhu/Documents/project/Angerszhuuuu/spark/docs/_config.yml
                Source: /Users/yi.zhu/Documents/project/Angerszhuuuu/spark/docs
           Destination: /Users/yi.zhu/Documents/project/Angerszhuuuu/spark/docs/_site
     Incremental build: disabled. Enable with --incremental
          Generating...
                        done in 5.085 seconds.
     Auto-regeneration: enabled for '/Users/yi.zhu/Documents/project/Angerszhuuuu/spark/docs'
                        ------------------------------------------------
          Jekyll 4.2.0   Please append `--trace` to the `serve` command
                         for any additional information or backtrace.
                        ------------------------------------------------
    <internal:/usr/local/Cellar/ruby/3.0.0_1/lib/ruby/3.0.0/rubygems/core_ext/kernel_require.rb>:85:in `require': cannot load such file -- webrick (LoadError)
    	from <internal:/usr/local/Cellar/ruby/3.0.0_1/lib/ruby/3.0.0/rubygems/core_ext/kernel_require.rb>:85:in `require'
    	from /Users/yi.zhu/.gem/ruby/3.0.0/gems/jekyll-4.2.0/lib/jekyll/commands/serve/servlet.rb:3:in `<top (required)>'
    	from /Users/yi.zhu/.gem/ruby/3.0.0/gems/jekyll-4.2.0/lib/jekyll/commands/serve.rb:179:in `require_relative'
    	from /Users/yi.zhu/.gem/ruby/3.0.0/gems/jekyll-4.2.0/lib/jekyll/commands/serve.rb:179:in `setup'
    	from /Users/yi.zhu/.gem/ruby/3.0.0/gems/jekyll-4.2.0/lib/jekyll/commands/serve.rb:100:in `process'
    	from /Users/yi.zhu/.gem/ruby/3.0.0/gems/jekyll-4.2.0/lib/jekyll/command.rb:91:in `block in process_with_graceful_fail'
    	from /Users/yi.zhu/.gem/ruby/3.0.0/gems/jekyll-4.2.0/lib/jekyll/command.rb:91:in `each'
    	from /Users/yi.zhu/.gem/ruby/3.0.0/gems/jekyll-4.2.0/lib/jekyll/command.rb:91:in `process_with_graceful_fail'
    	from /Users/yi.zhu/.gem/ruby/3.0.0/gems/jekyll-4.2.0/lib/jekyll/commands/serve.rb:86:in `block (2 levels) in init_with_program'
    	from /Users/yi.zhu/.gem/ruby/3.0.0/gems/mercenary-0.4.0/lib/mercenary/command.rb:221:in `block in execute'
    	from /Users/yi.zhu/.gem/ruby/3.0.0/gems/mercenary-0.4.0/lib/mercenary/command.rb:221:in `each'
    	from /Users/yi.zhu/.gem/ruby/3.0.0/gems/mercenary-0.4.0/lib/mercenary/command.rb:221:in `execute'
    	from /Users/yi.zhu/.gem/ruby/3.0.0/gems/mercenary-0.4.0/lib/mercenary/program.rb:44:in `go'
    	from /Users/yi.zhu/.gem/ruby/3.0.0/gems/mercenary-0.4.0/lib/mercenary.rb:21:in `program'
    	from /Users/yi.zhu/.gem/ruby/3.0.0/gems/jekyll-4.2.0/exe/jekyll:15:in `<top (required)>'
    	from /usr/local/bin/jekyll:23:in `load'
    	from /usr/local/bin/jekyll:23:in `<main>'
    ```
    
    This issue is solved in jekyll/jekyll#8523
    
    ### Why are the changes needed?
    Fix build issue
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Not need
    
    Closes apache#31263 from AngersZhuuuu/SPARK-34181.
    
    Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
    Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    (cherry picked from commit faa4f0c)
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    AngersZhuuuu authored and HyukjinKwon committed Jan 21, 2021
    Configuration menu
    Copy the full SHA
    785998b View commit details
    Browse the repository at this point in the history

Commits on Jan 22, 2021

  1. [SPARK-33813][SQL] Fix the issue that JDBC source can't treat MS SQL …

    …Server's spatial types
    
    ### What changes were proposed in this pull request?
    
    This PR fixes the issue that reading tables which contain spatial datatypes from MS SQL Server fails.
    MS SQL server supports two non-standard spatial JDBC types, `geometry` and  `geography` but Spark SQL can't treat them
    
    ```
    java.sql.SQLException: Unrecognized SQL type -157
     at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getCatalystType(JdbcUtils.scala:251)
     at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$getSchema$1(JdbcUtils.scala:321)
     at scala.Option.getOrElse(Option.scala:189)
     at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:321)
     at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:63)
     at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226)
     at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
     at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364)
     at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:366)
     at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:355)
     at scala.Option.getOrElse(Option.scala:189)
     at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:355)
     at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240)
     at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:381)
    ```
    
    Considering the [data type mapping](https://docs.microsoft.com/ja-jp/sql/connect/jdbc/using-basic-data-types?view=sql-server-ver15) says, I think those spatial types can be mapped to Catalyst's `BinaryType`.
    
    ### Why are the changes needed?
    
    To provide better support.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes. MS SQL Server users can use `geometry` and `geography` types in datasource tables.
    
    ### How was this patch tested?
    
    New test case added to `MsSqlServerIntegrationSuite`.
    
    Closes apache#31283 from sarutak/mssql-spatial-types.
    
    Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    sarutak committed Jan 22, 2021
    Configuration menu
    Copy the full SHA
    c59a423 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    c36fe77 View commit details
    Browse the repository at this point in the history