Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merging master #9

Merged
merged 145 commits into from Dec 8, 2020
Merged

merging master #9

merged 145 commits into from Dec 8, 2020

Commits on Nov 24, 2020

  1. [SPARK-33287][SS][UI] Expose state custom metrics information on SS UI

    ### What changes were proposed in this pull request?
    Structured Streaming UI is not containing state custom metrics information. In this PR I've added it.
    
    ### Why are the changes needed?
    Missing state custom metrics information.
    
    ### Does this PR introduce _any_ user-facing change?
    Additional UI elements appear.
    
    ### How was this patch tested?
    Existing unit tests + manual test.
    ```
    #Compile Spark
    echo "spark.sql.streaming.ui.enabledCustomMetricList stateOnCurrentVersionSizeBytes" >> conf/spark-defaults.conf
    sbin/start-master.sh
    sbin/start-worker.sh spark://gsomogyi-MBP16:7077
    ./bin/spark-submit --master spark://gsomogyi-MBP16:7077 --deploy-mode client --class com.spark.Main ../spark-test/target/spark-test-1.0-SNAPSHOT-jar-with-dependencies.jar
    ```
    <img width="1119" alt="Screenshot 2020-11-18 at 12 45 36" src="https://user-images.githubusercontent.com/18561820/99527506-2f979680-299d-11eb-9187-4ae7fbd2596a.png">
    
    Closes #30336 from gaborgsomogyi/SPARK-33287.
    
    Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
    Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
    gaborgsomogyi authored and HeartSaVioR committed Nov 24, 2020
    Configuration menu
    Copy the full SHA
    95b6dab View commit details
    Browse the repository at this point in the history

Commits on Nov 25, 2020

  1. [SPARK-33457][PYTHON] Adjust mypy configuration

    ### What changes were proposed in this pull request?
    
    This pull request:
    
    - Adds following flags to the main mypy configuration:
      - [`strict_optional`](https://mypy.readthedocs.io/en/stable/config_file.html#confval-strict_optional)
      - [`no_implicit_optional`](https://mypy.readthedocs.io/en/stable/config_file.html#confval-no_implicit_optional)
      - [`disallow_untyped_defs`](https://mypy.readthedocs.io/en/stable/config_file.html#confval-disallow_untyped_calls)
    
    These flags are enabled only for public API and disabled for tests and internal modules.
    
    Additionally, these PR fixes missing annotations.
    
    ### Why are the changes needed?
    
    Primary reason to propose this changes is to use standard configuration as used by typeshed project. This will allow us to be more strict, especially when interacting with JVM code. See for example #29122 (review)
    
    Additionally, it will allow us to detect cases where annotations have unintentionally omitted.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Annotations only.
    
    ### How was this patch tested?
    
    `dev/lint-python`.
    
    Closes #30382 from zero323/SPARK-33457.
    
    Authored-by: zero323 <mszymkiewicz@gmail.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    zero323 authored and HyukjinKwon committed Nov 25, 2020
    Configuration menu
    Copy the full SHA
    665817b View commit details
    Browse the repository at this point in the history
  2. [SPARK-33252][PYTHON][DOCS] Migration to NumPy documentation style in…

    … MLlib (pyspark.mllib.*)
    
    ### What changes were proposed in this pull request?
    
    This PR proposes migration of `pyspark.mllib` to NumPy documentation style.
    
    ### Why are the changes needed?
    
    To improve documentation style.
    
    Before:
    
    ![old](https://user-images.githubusercontent.com/1554276/100097941-90234980-2e5d-11eb-8b4d-c25d98d85191.png)
    
    After:
    
    ![new](https://user-images.githubusercontent.com/1554276/100097966-987b8480-2e5d-11eb-9e02-07b18c327624.png)
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, this changes both rendered HTML docs and console representation (SPARK-33243).
    
    ### How was this patch tested?
    
    `dev/lint-python` and manual inspection.
    
    Closes #30413 from zero323/SPARK-33252.
    
    Authored-by: zero323 <mszymkiewicz@gmail.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    zero323 authored and HyukjinKwon committed Nov 25, 2020
    Configuration menu
    Copy the full SHA
    01321bc View commit details
    Browse the repository at this point in the history
  3. [SPARK-33494][SQL][AQE] Do not use local shuffle reader for repartition

    ### What changes were proposed in this pull request?
    
    This PR updates `ShuffleExchangeExec` to carry more information about how much we can change the partitioning. For `repartition(col)`, we should preserve the user-specified partitioning and don't apply the AQE local shuffle reader.
    
    ### Why are the changes needed?
    
    Similar to `repartition(number, col)`, we should respect the user-specified partitioning.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    a new test
    
    Closes #30432 from cloud-fan/aqe.
    
    Authored-by: Wenchen Fan <wenchen@databricks.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    cloud-fan committed Nov 25, 2020
    Configuration menu
    Copy the full SHA
    d1b4f06 View commit details
    Browse the repository at this point in the history
  4. [SPARK-33543][SQL] Migrate SHOW COLUMNS command to use UnresolvedTabl…

    …eOrView to resolve the identifier
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to migrate `SHOW COLUMNS` to use `UnresolvedTableOrView` to resolve the table/view identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing).
    
    Note that `SHOW COLUMNS` is not yet supported for v2 tables.
    
    ### Why are the changes needed?
    
    To use `UnresolvedTableOrView` for table/view resolution. Note that `ShowColumnsCommand` internally resolves to a temp view first, so there is no resolution behavior change with this PR.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Updated existing tests.
    
    Closes #30490 from imback82/show_columns.
    
    Authored-by: Terry Kim <yuminkim@gmail.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    imback82 authored and cloud-fan committed Nov 25, 2020
    Configuration menu
    Copy the full SHA
    b7f034d View commit details
    Browse the repository at this point in the history
  5. [SPARK-33224][SS][WEBUI] Add watermark gap information into SS UI page

    ### What changes were proposed in this pull request?
    
    This PR proposes to add the watermark gap information in SS UI page. Please refer below screenshots to see what we'd like to show in UI.
    
    ![Screen Shot 2020-11-19 at 6 56 38 PM](https://user-images.githubusercontent.com/1317309/99669306-3532d080-2ab2-11eb-9a93-03d2c6a54948.png)
    
    Please note that this PR doesn't plot the watermark value - knowing the gap between actual wall clock and watermark looks more useful than the absolute value.
    
    ### Why are the changes needed?
    
    Watermark is the one of major metrics the end users need to track for stateful queries. Watermark defines "when" the output will be emitted for append mode, hence knowing how much gap between wall clock and watermark (input data) is very helpful to make expectation of the output.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, SS UI query page will contain the watermark gap information.
    
    ### How was this patch tested?
    
    Basic UT added. Manually tested with two queries:
    
    > simple case
    
    You'll see consistent watermark gap with (15 seconds + a) = 10 seconds are from delay in watermark definition, 5 seconds are trigger interval.
    
    ```
    import org.apache.spark.sql.streaming.Trigger
    
    spark.conf.set("spark.sql.shuffle.partitions", "10")
    
    val query = spark
      .readStream
      .format("rate")
      .option("rowsPerSecond", 1000)
      .option("rampUpTime", "10s")
      .load()
      .selectExpr("timestamp", "mod(value, 100) as mod", "value")
      .withWatermark("timestamp", "10 seconds")
      .groupBy(window($"timestamp", "1 minute", "10 seconds"), $"mod")
      .agg(max("value").as("max_value"), min("value").as("min_value"), avg("value").as("avg_value"))
      .writeStream
      .format("console")
      .trigger(Trigger.ProcessingTime("5 seconds"))
      .outputMode("append")
      .start()
    
    query.awaitTermination()
    ```
    
    ![Screen Shot 2020-11-19 at 7 00 21 PM](https://user-images.githubusercontent.com/1317309/99669049-dbcaa180-2ab1-11eb-8789-10b35857dda0.png)
    
    > complicated case
    
    This randomizes the timestamp, hence producing random watermark gap. This won't be smaller than 15 seconds as I described earlier.
    
    ```
    import org.apache.spark.sql.streaming.Trigger
    
    spark.conf.set("spark.sql.shuffle.partitions", "10")
    
    val query = spark
      .readStream
      .format("rate")
      .option("rowsPerSecond", 1000)
      .option("rampUpTime", "10s")
      .load()
      .selectExpr("*", "CAST(CAST(timestamp AS BIGINT) - CAST((RAND() * 100000) AS BIGINT) AS TIMESTAMP) AS tsMod")
      .selectExpr("tsMod", "mod(value, 100) as mod", "value")
      .withWatermark("tsMod", "10 seconds")
      .groupBy(window($"tsMod", "1 minute", "10 seconds"), $"mod")
      .agg(max("value").as("max_value"), min("value").as("min_value"), avg("value").as("avg_value"))
      .writeStream
      .format("console")
      .trigger(Trigger.ProcessingTime("5 seconds"))
      .outputMode("append")
      .start()
    
    query.awaitTermination()
    ```
    
    ![Screen Shot 2020-11-19 at 6 56 47 PM](https://user-images.githubusercontent.com/1317309/99669029-d5d4c080-2ab1-11eb-9c63-d05b3e1ab391.png)
    
    Closes #30427 from HeartSaVioR/SPARK-33224.
    
    Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
    Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
    HeartSaVioR committed Nov 25, 2020
    Configuration menu
    Copy the full SHA
    edab094 View commit details
    Browse the repository at this point in the history
  6. [SPARK-33533][SQL] Fix the regression bug that ConnectionProviders do…

    …n't consider case-sensitivity for properties
    
    ### What changes were proposed in this pull request?
    
    This PR fixes an issue that `BasicConnectionProvider` doesn't consider case-sensitivity for properties.
    For example, the property `oracle.jdbc.mapDateToTimestamp` should be considered case-sensitivity but it is not considered.
    
    ### Why are the changes needed?
    
    This is a bug introduced by #29024 .
    Caused by this issue, `OracleIntegrationSuite` doesn't pass.
    
    ```
    [info] - SPARK-16625: General data types to be mapped to Oracle *** FAILED *** (32 seconds, 129 milliseconds)
    [info]   types.apply(9).equals(org.apache.spark.sql.types.DateType) was false (OracleIntegrationSuite.scala:238)
    [info]   org.scalatest.exceptions.TestFailedException:
    [info]   at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
    [info]   at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
    [info]   at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
    [info]   at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
    [info]   at org.apache.spark.sql.jdbc.OracleIntegrationSuite.$anonfun$new$4(OracleIntegrationSuite.scala:238)
    [info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
    [info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
    [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
    [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
    [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
    [info]   at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:190)
    [info]   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:176)
    [info]   at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:188)
    [info]   at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:200)
    [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
    [info]   at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:200)
    [info]   at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:182)
    [info]   at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:61)
    [info]   at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
    [info]   at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
    [info]   at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:61)
    [info]   at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:233)
    [info]   at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
    [info]   at scala.collection.immutable.List.foreach(List.scala:392)
    [info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
    [info]   at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
    [info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
    [info]   at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:233)
    [info]   at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:232)
    [info]   at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1563)
    [info]   at org.scalatest.Suite.run(Suite.scala:1112)
    [info]   at org.scalatest.Suite.run$(Suite.scala:1094)
    [info]   at org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1563)
    [info]   at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:237)
    [info]   at org.scalatest.SuperEngine.runImpl(Engine.scala:535)
    [info]   at org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:237)
    [info]   at org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:236)
    [info]   at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:61)
    [info]   at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
    [info]   at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
    [info]   at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
    [info]   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:61)
    [info]   at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:318)
    [info]   at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:513)
    [info]   at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:413)
    [info]   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    [info]   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    [info]   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    [info]   at java.lang.Thread.run(Thread.java:748)
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    With this change, I confirmed that `OracleIntegrationSuite` passes with the following command.
    ```
    $ git clone https://github.com/oracle/docker-images.git
    $ cd docker-images/OracleDatabase/SingleInstance/dockerfiles
    $ ./buildDockerImage.sh -v 18.4.0 -x
    $ ORACLE_DOCKER_IMAGE_NAME=oracle/database:18.4.0-xe build/sbt  -Pdocker-integration-tests -Phive -Phive-thriftserver "testOnly org.apache.spark.sql.jdbc.OracleIntegrationSuite"
    ```
    
    Closes #30485 from sarutak/fix-oracle-integration-suite.
    
    Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    sarutak authored and dongjoon-hyun committed Nov 25, 2020
    Configuration menu
    Copy the full SHA
    c3ce970 View commit details
    Browse the repository at this point in the history
  7. [SPARK-33477][SQL] Hive Metastore support filter by date type

    ### What changes were proposed in this pull request?
    
    Hive Metastore supports strings and integral types in filters. It could also support dates. Please see [HIVE-5679](apache/hive@5106bf1) for more details.
    
    This pr add support it.
    
    ### Why are the changes needed?
    
    Improve query performance.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Unit test.
    
    Closes #30408 from wangyum/SPARK-33477.
    
    Authored-by: Yuming Wang <yumwang@ebay.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    wangyum authored and HyukjinKwon committed Nov 25, 2020
    Configuration menu
    Copy the full SHA
    781e19c View commit details
    Browse the repository at this point in the history
  8. [SPARK-33549][SQL] Remove configuration spark.sql.legacy.allowCastNum…

    …ericToTimestamp
    
    ### What changes were proposed in this pull request?
    
    Remove SQL configuration spark.sql.legacy.allowCastNumericToTimestamp
    
    ### Why are the changes needed?
    
    In the current master branch, there is a new configuration `spark.sql.legacy.allowCastNumericToTimestamp` which controls whether to cast Numeric types to Timestamp or not. The default value is true.
    
    After #30260, the type conversion between Timestamp type and Numeric type is disallowed in ANSI mode. So, we don't need to a separate configuration `spark.sql.legacy.allowCastNumericToTimestamp` for disallowing the conversion. Users just need to set `spark.sql.ansi.enabled` for the behavior.
    
    As the configuration is not in any released yet, we should remove the configuration to make things simpler.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No, since the configuration is not released yet.
    
    ### How was this patch tested?
    
    Existing test cases
    
    Closes #30493 from gengliangwang/LEGACY_ALLOW_CAST_NUMERIC_TO_TIMESTAMP.
    
    Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    gengliangwang authored and cloud-fan committed Nov 25, 2020
    Configuration menu
    Copy the full SHA
    19f3b89 View commit details
    Browse the repository at this point in the history
  9. [SPARK-33509][SQL] List partition by names from a V2 table which supp…

    …orts partition management
    
    ### What changes were proposed in this pull request?
    1. Add new method `listPartitionByNames` to the `SupportsPartitionManagement` interface. It allows to list partitions by partition names and their values.
    2. Implement new method in `InMemoryPartitionTable` which is used in DSv2 tests.
    
    ### Why are the changes needed?
    Currently, the `SupportsPartitionManagement` interface exposes only `listPartitionIdentifiers` which allows to list partitions by partition values. And it requires to specify all values for partition schema fields in the prefix. This restriction does not allow to list partitions by some of partition names (not all of them).
    
    For example, the table `tableA` is partitioned by two column `year` and `month`
    ```
    CREATE TABLE tableA (price int, year int, month int)
    USING _
    partitioned by (year, month)
    ```
    and has the following partitions:
    ```
    PARTITION(year = 2015, month = 1)
    PARTITION(year = 2015, month = 2)
    PARTITION(year = 2016, month = 2)
    PARTITION(year = 2016, month = 3)
    ```
    If we want to list all partitions with `month = 2`, we have to specify `year` for **listPartitionIdentifiers()** which not always possible as we don't know all `year` values in advance. New method **listPartitionByNames()** allows to specify partition values only for `month`, and get two partitions:
    ```
    PARTITION(year = 2015, month = 2)
    PARTITION(year = 2016, month = 2)
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    By running the affected test suite `SupportsPartitionManagementSuite`.
    
    Closes #30452 from MaxGekk/column-names-listPartitionIdentifiers.
    
    Authored-by: Max Gekk <max.gekk@gmail.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    MaxGekk authored and cloud-fan committed Nov 25, 2020
    Configuration menu
    Copy the full SHA
    2c5cc36 View commit details
    Browse the repository at this point in the history
  10. [SPARK-27194][SPARK-29302][SQL] Fix commit collision in dynamic parti…

    …tion overwrite mode
    
    ### What changes were proposed in this pull request?
    
    When using dynamic partition overwrite, each task has its working dir under staging dir like `stagingDir/.spark-staging-{jobId}`, each task commits to `outputPath/.spark-staging-{jobId}/{partitionId}/part-{taskId}-{jobId}{ext}`.
    When speculation enable, multiple task attempts would be setup for one task, **they have same task id and they would commit to same file concurrently**. Due to host done or node preemption, the partly-committed files aren't cleaned up, a FileAlreadyExistsException would be raised in this situation, resulting in job failure.
    
    I don't try to change task commit process for dynamic partition overwrite, like adding attempt id to task working dir for each attempts and committing to final output dir via a new outputCommitCoordinator, here is reason:
    
    1. `FileOutputCommitter` already has commit coordinator for each task attempts, we can leverage it rather than build a new one.
    2. To say the least, we implement a coordinator solving task attempts commit conflict, suppose a severe case, application master failover, tasks with same attempt id and same task id would commit to same files, the `FileAlreadyExistsException` risk still exists
    
    In this pr, I leverage FileOutputCommitter to solve the problem:
    
    1. when initing a write job description, set `outputPath/.spark-staging-{jobId}` as the output dir
    2. each task attempt writes output to `outputPath/.spark-staging-{jobId}/_temporary/${appAttemptId}/_temporary/${taskAttemptId}/{partitionId}/part-{taskId}-{jobId}{ext}`
    3. leverage `FileOutputCommitter` coordinator, write job firstly commits output to `outputPath/.spark-staging-{jobId}/{partitionId}`
    4. for dynamic partition overwrite, write job finally move `outputPath/.spark-staging-{jobId}/{partitionId}` to `outputPath/{partitionId}`
    
    ### Why are the changes needed?
    
    Without this pr, dynamic partition overwrite would fail
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    added UT.
    
    Closes #29000 from WinkerDu/master-fix-dynamic-partition-multi-commit.
    
    Authored-by: duripeng <duripeng@baidu.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    WinkerDu authored and cloud-fan committed Nov 25, 2020
    Configuration menu
    Copy the full SHA
    7c59aee View commit details
    Browse the repository at this point in the history
  11. [SPARK-31257][SPARK-33561][SQL] Unify create table syntax

    ### What changes were proposed in this pull request?
    
    * Unify the create table syntax in the parser by merging Hive and DataSource clauses
    * Add `SerdeInfo` and `external` boolean to statement plans and update AstBuilder to produce them
    * Add conversion from create statement plan to v1 create plans in ResolveSessionCatalog
    * Support new statement clauses in ResolveCatalogs conversion to v2 create plans
    * Remove SparkSqlParser rules for Hive syntax
    * Add "option." namespace to distinguish SERDEPROPERTIES and OPTIONS in table properties
    
    ### Why are the changes needed?
    
    * Current behavior is confusing.
    * A way to pass the Hive create options to DSv2 is needed for a Hive source.
    
    ### Does this PR introduce any user-facing change?
    
    Not by default, but v2 sources will be able to handle STORED AS and other Hive clauses.
    
    ### How was this patch tested?
    
    Existing tests validate there are no behavior changes.
    
    Update unit tests for using a statement plan for Hive create syntax:
    * Move create tests from spark-sql DDLParserSuite into PlanResolutionSuite
    * Add parser tests to spark-catalyst DDLParserSuite
    
    Closes #28026 from rdblue/unify-create-table.
    
    Lead-authored-by: Ryan Blue <blue@apache.org>
    Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    rdblue and cloud-fan committed Nov 25, 2020
    Configuration menu
    Copy the full SHA
    6f68ccf View commit details
    Browse the repository at this point in the history
  12. [SPARK-33496][SQL] Improve error message of ANSI explicit cast

    ### What changes were proposed in this pull request?
    
    After #30260, there are some type conversions disallowed under ANSI mode.
    We should tell users what they can do if they have to use the disallowed casting.
    
    ### Why are the changes needed?
    
    Make it more user-friendly.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, the error message is improved on casting failure when ANSI mode is enabled
    ### How was this patch tested?
    
    Unit tests.
    
    Closes #30440 from gengliangwang/improveAnsiCastErrorMSG.
    
    Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
    Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
    gengliangwang committed Nov 25, 2020
    Configuration menu
    Copy the full SHA
    d691d85 View commit details
    Browse the repository at this point in the history
  13. [SPARK-33540][SQL] Subexpression elimination for interpreted predicate

    ### What changes were proposed in this pull request?
    
    This patch proposes to support subexpression elimination for interpreted predicate.
    
    ### Why are the changes needed?
    
    Similar to interpreted projection, there are use cases when codegen predicate is not able to work, e.g. too complex schema, non-codegen expression, etc. When there are frequently occurring expressions (subexpressions) among predicate expression, the performance is quite bad as we need to re-compute same expressions. We should be able to support subexpression elimination for interpreted predicate like interpreted projection.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No, this doesn't change user behavior.
    
    ### How was this patch tested?
    
    Unit test and benchmark.
    
    Closes #30497 from viirya/SPARK-33540.
    
    Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    viirya authored and dongjoon-hyun committed Nov 25, 2020
    Configuration menu
    Copy the full SHA
    9643eab View commit details
    Browse the repository at this point in the history
  14. [SPARK-31257][SPARK-33561][SQL][FOLLOWUP] Fix Scala 2.13 compilation

    ### What changes were proposed in this pull request?
    
    This PR is a follow-up to fix Scala 2.13 compilation.
    
    ### Why are the changes needed?
    
    To support Scala 2.13 in Apache Spark 3.1.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Pass the GitHub Action Scala 2.13 compilation job.
    
    Closes #30502 from dongjoon-hyun/SPARK-31257.
    
    Authored-by: Dongjoon Hyun <dongjoon@apache.org>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    dongjoon-hyun committed Nov 25, 2020
    Configuration menu
    Copy the full SHA
    7cf6a6f View commit details
    Browse the repository at this point in the history
  15. [SPARK-33525][SQL] Update hive-service-rpc to 3.1.2

    ### What changes were proposed in this pull request?
    
    We supported Hive metastore are 0.12.0 through 3.1.2, but we supported hive-jdbc are 0.12.0 through 2.3.7. It will throw `TProtocolException` if we use hive-jdbc 3.x:
    
    ```
    [rootspark-3267648 apache-hive-3.1.2-bin]# bin/beeline -u jdbc:hive2://localhost:10000/default
    Connecting to jdbc:hive2://localhost:10000/default
    Connected to: Spark SQL (version 3.1.0-SNAPSHOT)
    Driver: Hive JDBC (version 3.1.2)
    Transaction isolation: TRANSACTION_REPEATABLE_READ
    Beeline version 3.1.2 by Apache Hive
    0: jdbc:hive2://localhost:10000/default> create table t1(id int) using parquet;
    Unexpected end of file when reading from HS2 server. The root cause might be too many concurrent connections. Please ask the administrator to check the number of active connections, and adjust hive.server2.thrift.max.worker.threads if applicable.
    Error: org.apache.thrift.transport.TTransportException (state=08S01,code=0)
    ```
    ```
    org.apache.thrift.protocol.TProtocolException: Missing version in readMessageBegin, old client?
    	at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:234)
    	at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:27)
    	at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:53)
    	at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:310)
    	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
    	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
    	at java.base/java.lang.Thread.run(Thread.java:832)
    ```
    
    This pr upgrade hive-service-rpc to 3.1.2 to fix this issue.
    
    ### Why are the changes needed?
    
    To support hive-jdbc 3.x.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Manual test:
    ```
    [rootspark-3267648 apache-hive-3.1.2-bin]# bin/beeline -u jdbc:hive2://localhost:10000/default
    Connecting to jdbc:hive2://localhost:10000/default
    Connected to: Spark SQL (version 3.1.0-SNAPSHOT)
    Driver: Hive JDBC (version 3.1.2)
    Transaction isolation: TRANSACTION_REPEATABLE_READ
    Beeline version 3.1.2 by Apache Hive
    0: jdbc:hive2://localhost:10000/default> create table t1(id int) using parquet;
    +---------+
    | Result  |
    +---------+
    +---------+
    No rows selected (1.051 seconds)
    0: jdbc:hive2://localhost:10000/default> insert into t1 values(1);
    +---------+
    | Result  |
    +---------+
    +---------+
    No rows selected (2.08 seconds)
    0: jdbc:hive2://localhost:10000/default> select * from t1;
    +-----+
    | id  |
    +-----+
    | 1   |
    +-----+
    1 row selected (0.605 seconds)
    ```
    
    Closes #30478 from wangyum/SPARK-33525.
    
    Authored-by: Yuming Wang <yumwang@ebay.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    wangyum authored and dongjoon-hyun committed Nov 25, 2020
    Configuration menu
    Copy the full SHA
    1de3fc4 View commit details
    Browse the repository at this point in the history
  16. [SPARK-33565][BUILD][PYTHON] remove python3.8 and fix breakage

    ### What changes were proposed in this pull request?
    remove python 3.8 from python/run-tests.py and stop build breaks
    
    ### Why are the changes needed?
    the python tests are running against the bare-bones system install of python3, rather than an anaconda environment.
    
    ### Does this PR introduce _any_ user-facing change?
    no
    
    ### How was this patch tested?
    via jenkins
    
    Closes #30506 from shaneknapp/remove-py38.
    
    Authored-by: shane knapp <incomplete@gmail.com>
    Signed-off-by: shane knapp <incomplete@gmail.com>
    shaneknapp committed Nov 25, 2020
    Configuration menu
    Copy the full SHA
    c529426 View commit details
    Browse the repository at this point in the history
  17. [SPARK-33523][SQL][TEST][FOLLOWUP] Fix benchmark case name in SubExpr…

    …EliminationBenchmark
    
    ### What changes were proposed in this pull request?
    
    Fix the wrong benchmark case name.
    
    ### Why are the changes needed?
    
    The last commit to refactor the benchmark code missed a change of case name.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No, dev only.
    
    ### How was this patch tested?
    
    Unit test.
    
    Closes #30505 from viirya/SPARK-33523-followup.
    
    Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    viirya authored and dongjoon-hyun committed Nov 25, 2020
    Configuration menu
    Copy the full SHA
    fb7b870 View commit details
    Browse the repository at this point in the history

Commits on Nov 26, 2020

  1. [SPARK-33562][UI] Improve the style of the checkbox in executor page

    ### What changes were proposed in this pull request?
    
    1. Remove the fixed width style of class `container-fluid-div`. So that the UI looks clean when the text is long.
    2. Add one space between a checkbox and the text on the right side, which is consistent with the stage page.
    
    ### Why are the changes needed?
    
    The width of class `container-fluid-div` is set as 200px after #21688 . This makes the checkbox in the executor page messy.
    ![image](https://user-images.githubusercontent.com/1097932/100242069-3bc5ab80-2ee9-11eb-8c7d-96c221398fee.png)
    
    We should remove the width limit.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    ### How was this patch tested?
    
    Manual test.
    After the changes:
    ![image](https://user-images.githubusercontent.com/1097932/100257802-2f4a4e80-2efb-11eb-9eb0-92d6988ad14b.png)
    
    Closes #30500 from gengliangwang/reviseStyle.
    
    Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    gengliangwang authored and HyukjinKwon committed Nov 26, 2020
    Configuration menu
    Copy the full SHA
    919ea45 View commit details
    Browse the repository at this point in the history
  2. [SPARK-33565][INFRA][FOLLOW-UP] Keep the test coverage with Python 3.…

    …8 in GitHub Actions
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to keep the test coverage with Python 3.8 in GitHub Actions. It is not tested for now in Jenkins due to an env issue.
    
    **Before this change in GitHub Actions:**
    
    ```
    ========================================================================
    Running PySpark tests
    ========================================================================
    Running PySpark tests. Output is in /__w/spark/spark/python/unit-tests.log
    Will test against the following Python executables: ['python3.6', 'pypy3']
    ...
    ```
    
    **After this change in GitHub Actions:**
    
    ```
    ========================================================================
    Running PySpark tests
    ========================================================================
    Running PySpark tests. Output is in /__w/spark/spark/python/unit-tests.log
    Will test against the following Python executables: ['python3.6', 'python3.8', 'pypy3']
    ```
    
    ### Why are the changes needed?
    
    To keep the test coverage with Python 3.8 in GitHub Actions.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No, dev-only.
    
    ### How was this patch tested?
    
    GitHub Actions in this build will test.
    
    Closes #30510 from HyukjinKwon/SPARK-33565.
    
    Authored-by: HyukjinKwon <gurwls223@apache.org>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    HyukjinKwon committed Nov 26, 2020
    Configuration menu
    Copy the full SHA
    ed9e6fc View commit details
    Browse the repository at this point in the history
  3. [SPARK-33551][SQL] Do not use custom shuffle reader for repartition

    ### What changes were proposed in this pull request?
    
    This PR fixes an AQE issue where local shuffle reader, partition coalescing, or skew join optimization can be mistakenly applied to a shuffle introduced by repartition or a regular shuffle that logically replaces a repartition shuffle.
    The proposed solution checks for the presence of any repartition shuffle and filters out not applicable optimization rules for the final stage in an AQE plan.
    
    ### Why are the changes needed?
    
    Without the change, the output of a repartition query may not be correct.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Added UT.
    
    Closes #30494 from maryannxue/csr-repartition.
    
    Authored-by: Maryann Xue <maryann.xue@gmail.com>
    Signed-off-by: Xiao Li <gatorsmile@gmail.com>
    maryannxue authored and gatorsmile committed Nov 26, 2020
    Configuration menu
    Copy the full SHA
    dfa3978 View commit details
    Browse the repository at this point in the history

Commits on Nov 27, 2020

  1. [SPARK-33563][PYTHON][R][SQL] Expose inverse hyperbolic trig function…

    …s in PySpark and SparkR
    
    ### What changes were proposed in this pull request?
    
    This PR adds the following functions (introduced in Scala API with SPARK-33061):
    
    - `acosh`
    - `asinh`
    - `atanh`
    
    to Python and R.
    
    ### Why are the changes needed?
    
    Feature parity.
    
    ### Does this PR introduce _any_ user-facing change?
    
    New functions.
    
    ### How was this patch tested?
    
    New unit tests.
    
    Closes #30501 from zero323/SPARK-33563.
    
    Authored-by: zero323 <mszymkiewicz@gmail.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    zero323 authored and HyukjinKwon committed Nov 27, 2020
    Configuration menu
    Copy the full SHA
    d082ad0 View commit details
    Browse the repository at this point in the history
  2. [SPARK-33566][CORE][SQL][SS][PYTHON] Make unescapedQuoteHandling opti…

    …on configurable when read CSV
    
    ### What changes were proposed in this pull request?
    There are some differences between Spark CSV, opencsv and commons-csv, the typical case are described in SPARK-33566, When there are both unescaped quotes and unescaped qualifier in value,  the results of parsing are different.
    
    The reason for the difference is Spark use `STOP_AT_DELIMITER` as default `UnescapedQuoteHandling` to build `CsvParser` and it not configurable.
    
    On the other hand, opencsv and commons-csv use the parsing mechanism similar to `STOP_AT_CLOSING_QUOTE ` by default.
    
    So this pr make `unescapedQuoteHandling` option configurable to get the same parsing result as opencsv and commons-csv.
    
    ### Why are the changes needed?
    Make unescapedQuoteHandling option configurable when read CSV to make parsing more flexible。
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    
    - Pass the Jenkins or GitHub Action
    
    - Add a new case similar to that described in SPARK-33566
    
    Closes #30518 from LuciferYang/SPARK-33566.
    
    Authored-by: yangjie01 <yangjie01@baidu.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    LuciferYang authored and HyukjinKwon committed Nov 27, 2020
    Configuration menu
    Copy the full SHA
    433ae90 View commit details
    Browse the repository at this point in the history
  3. [SPARK-33575][SQL] Fix misleading exception for "ANALYZE TABLE ... FO…

    …R COLUMNS" on temporary views
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to fix the exception message for `ANALYZE TABLE ... FOR COLUMNS` on temporary views.
    
    The current behavior throws `NoSuchTableException` even if the temporary view exists:
    ```
    sql("CREATE TEMP VIEW t AS SELECT 1 AS id")
    sql("ANALYZE TABLE t COMPUTE STATISTICS FOR COLUMNS id")
    org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 't' not found in database 'db';
      at org.apache.spark.sql.execution.command.AnalyzeColumnCommand.analyzeColumnInTempView(AnalyzeColumnCommand.scala:76)
      at org.apache.spark.sql.execution.command.AnalyzeColumnCommand.run(AnalyzeColumnCommand.scala:54)
    ```
    
    After this PR, more reasonable exception is thrown:
    ```
    org.apache.spark.sql.AnalysisException: Temporary view `testView` is not cached for analyzing columns.;
    [info]   at org.apache.spark.sql.execution.command.AnalyzeColumnCommand.analyzeColumnInTempView(AnalyzeColumnCommand.scala:74)
    [info]   at org.apache.spark.sql.execution.command.AnalyzeColumnCommand.run(AnalyzeColumnCommand.scala:54)
    ```
    
    ### Why are the changes needed?
    
    To fix a misleading exception.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, the exception thrown is changed as shown above.
    
    ### How was this patch tested?
    
    Updated existing test.
    
    Closes #30519 from imback82/analyze_table_message.
    
    Authored-by: Terry Kim <yuminkim@gmail.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    imback82 authored and cloud-fan committed Nov 27, 2020
    Configuration menu
    Copy the full SHA
    8792280 View commit details
    Browse the repository at this point in the history
  4. [SPARK-33522][SQL] Improve exception messages while handling Unresolv…

    …edTableOrView
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to improve the exception messages while `UnresolvedTableOrView` is handled based on this suggestion: #30321 (comment).
    
    Currently, when an identifier is resolved to a temp view when a table/permanent view is expected, the following exception message is displayed (e.g., for `SHOW CREATE TABLE`):
    ```
    t is a temp view not table or permanent view.
    ```
    After this PR, the message will be:
    ```
    t is a temp view. 'SHOW CREATE TABLE' expects a table or permanent view.
    ```
    
    Also, if an identifier is not resolved, the following exception message is currently used:
    ```
    Table or view not found: t
    ```
    After this PR, the message will be:
    ```
    Table or permanent view not found for 'SHOW CREATE TABLE': t
    ```
    or
    ```
    Table or view not found for 'ANALYZE TABLE ... FOR COLUMNS ...': t
    ```
    
    ### Why are the changes needed?
    
    To improve the exception message.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, the exception message will be changed as described above.
    
    ### How was this patch tested?
    
    Updated existing tests.
    
    Closes #30475 from imback82/unresolved_table_or_view.
    
    Authored-by: Terry Kim <yuminkim@gmail.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    imback82 authored and cloud-fan committed Nov 27, 2020
    Configuration menu
    Copy the full SHA
    2c41d9d View commit details
    Browse the repository at this point in the history
  5. [SPARK-28645][SQL] ParseException is thrown when the window is redefined

    ### What changes were proposed in this pull request?
    Currently in Spark one could redefine a window. For instance:
    
    `select count(*) OVER w FROM tenk1 WINDOW w AS (ORDER BY unique1), w AS (ORDER BY unique1);`
    The window `w` is defined two times. In PgSQL, on the other hand, a thrown will happen:
    
    `ERROR:  window "w" is already defined`
    
    ### Why are the changes needed?
    The current implement gives the following window definitions a higher priority. But it wasn't Spark's intention and users can't know from any document of Spark.
    This PR fixes the bug.
    
    ### Does this PR introduce _any_ user-facing change?
    Yes.
    There is an example query output with/without this fix.
    ```
    SELECT
        employee_name,
        salary,
        first_value(employee_name) OVER w highest_salary,
        nth_value(employee_name, 2) OVER w second_highest_salary
    FROM
        basic_pays
    WINDOW
        w AS (ORDER BY salary DESC ROWS BETWEEN UNBOUNDED PRECEDING AND 1 FOLLOWING),
        w AS (ORDER BY salary DESC ROWS BETWEEN UNBOUNDED PRECEDING AND 2 FOLLOWING)
    ORDER BY salary DESC
    ```
    The output before this fix:
    ```
    Larry Bott	11798	Larry Bott	Gerard Bondur
    Gerard Bondur	11472	Larry Bott	Gerard Bondur
    Pamela Castillo	11303	Larry Bott	Gerard Bondur
    Barry Jones	10586	Larry Bott	Gerard Bondur
    George Vanauf	10563	Larry Bott	Gerard Bondur
    Loui Bondur	10449	Larry Bott	Gerard Bondur
    Mary Patterson	9998	Larry Bott	Gerard Bondur
    Steve Patterson	9441	Larry Bott	Gerard Bondur
    Julie Firrelli	9181	Larry Bott	Gerard Bondur
    Jeff Firrelli	8992	Larry Bott	Gerard Bondur
    William Patterson	8870	Larry Bott	Gerard Bondur
    Diane Murphy	8435	Larry Bott	Gerard Bondur
    Leslie Jennings	8113	Larry Bott	Gerard Bondur
    Gerard Hernandez	6949	Larry Bott	Gerard Bondur
    Foon Yue Tseng	6660	Larry Bott	Gerard Bondur
    Anthony Bow	6627	Larry Bott	Gerard Bondur
    Leslie Thompson	5186	Larry Bott	Gerard Bondur
    ```
    The output after this fix:
    ```
    struct<>
    -- !query output
    org.apache.spark.sql.catalyst.parser.ParseException
    
    The definition of window 'w' is repetitive(line 8, pos 0)
    ```
    
    ### How was this patch tested?
    Jenkins test.
    
    Closes #30512 from beliefer/SPARK-28645.
    
    Lead-authored-by: gengjiaan <gengjiaan@360.cn>
    Co-authored-by: beliefer <beliefer@163.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    2 people authored and cloud-fan committed Nov 27, 2020
    Configuration menu
    Copy the full SHA
    e432550 View commit details
    Browse the repository at this point in the history
  6. [SPARK-33498][SQL] Datetime parsing should fail if the input string c…

    …an't be parsed, or the pattern string is invalid
    
    ### What changes were proposed in this pull request?
    
    Datetime parsing should fail if the input string can't be parsed, or the pattern string is invalid, when ANSI mode is enable. This patch should update GetTimeStamp, UnixTimeStamp, ToUnixTimeStamp and Cast.
    
    ### Why are the changes needed?
    
    For ANSI mode.
    
    ### Does this PR introduce any user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Added UT and Existing UT.
    
    Closes #30442 from leanken/leanken-SPARK-33498.
    
    Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    leanken-zz authored and cloud-fan committed Nov 27, 2020
    Configuration menu
    Copy the full SHA
    b9f2f78 View commit details
    Browse the repository at this point in the history
  7. [SPARK-33141][SQL] Capture SQL configs when creating permanent views

    ### What changes were proposed in this pull request?
    This PR makes CreateViewCommand/AlterViewAsCommand capturing runtime SQL configs and store them as view properties. These configs will be applied during the parsing and analysis phases of the view resolution. Users can set `spark.sql.legacy.useCurrentConfigsForView` to `true` to restore the behavior before.
    
    ### Why are the changes needed?
    This PR is a sub-task of [SPARK-33138](https://issues.apache.org/jira/browse/SPARK-33138) that proposes to unify temp view and permanent view behaviors. This PR makes permanent views mimicking the temp view behavior that "fixes" view semantic by directly storing resolved LogicalPlan. For example, if a user uses spark 2.4 to create a view that contains null values from division-by-zero expressions, she may not want that other users' queries which reference her view throw exceptions when running on spark 3.x with ansi mode on.
    
    ### Does this PR introduce _any_ user-facing change?
    No.
    
    ### How was this patch tested?
    added UT + existing UTs (improved)
    
    Closes #30289 from luluorta/SPARK-33141.
    
    Authored-by: luluorta <luluorta@gmail.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    luluorta authored and cloud-fan committed Nov 27, 2020
    Configuration menu
    Copy the full SHA
    35ded12 View commit details
    Browse the repository at this point in the history
  8. Spelling r common dev mlib external project streaming resource manage…

    …rs python
    
    ### What changes were proposed in this pull request?
    
    This PR intends to fix typos in the sub-modules:
    * `R`
    * `common`
    * `dev`
    * `mlib`
    * `external`
    * `project`
    * `streaming`
    * `resource-managers`
    * `python`
    
    Split per srowen #30323 (comment)
    
    NOTE: The misspellings have been reported at jsoref@706a726#commitcomment-44064356
    
    ### Why are the changes needed?
    
    Misspelled words make it harder to read / understand content.
    
    ### Does this PR introduce _any_ user-facing change?
    
    There are various fixes to documentation, etc...
    
    ### How was this patch tested?
    
    No testing was performed
    
    Closes #30402 from jsoref/spelling-R_common_dev_mlib_external_project_streaming_resource-managers_python.
    
    Authored-by: Josh Soref <jsoref@users.noreply.github.com>
    Signed-off-by: Sean Owen <srowen@gmail.com>
    jsoref authored and srowen committed Nov 27, 2020
    Configuration menu
    Copy the full SHA
    13fd272 View commit details
    Browse the repository at this point in the history

Commits on Nov 28, 2020

  1. [SPARK-33570][SQL][TESTS] Set the proper version of gssapi plugin aut…

    …omatically for MariaDBKrbIntegrationSuite
    
    ### What changes were proposed in this pull request?
    
    This PR changes mariadb_docker_entrypoint.sh to set the proper version automatically for mariadb-plugin-gssapi-server.
    The proper version is based on the one of mariadb-server.
    Also, this PR enables to use arbitrary docker image by setting the environment variable `MARIADB_CONTAINER_IMAGE_NAME`.
    
    ### Why are the changes needed?
    
    For `MariaDBKrbIntegrationSuite`, the version of `mariadb-plugin-gssapi-server` is currently set to `10.5.5` in `mariadb_docker_entrypoint.sh` but it's no longer available in the official apt repository and `MariaDBKrbIntegrationSuite` doesn't pass for now.
    It seems that only the most recent three versions are available for each major version and they are `10.5.6`, `10.5.7` and `10.5.8` for now.
    Further, the release cycle of MariaDB seems to be very rapid (1 ~ 2 months) so I don't think it's a good idea to set to an specific version for `mariadb-plugin-gssapi-server`.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Confirmed that `MariaDBKrbIntegrationSuite` passes with the following commands.
    ```
    $  build/sbt -Pdocker-integration-tests -Phive -Phive-thriftserver package "testOnly org.apache.spark.sql.jdbc.MariaDBKrbIntegrationSuite"
    ```
    In this case, we can see what version of `mariadb-plugin-gssapi-server` is going to be installed in the following container log message.
    ```
    Installing mariadb-plugin-gssapi-server=1:10.5.8+maria~focal
    ```
    
    Or, we can set MARIADB_CONTAINER_IMAGE_NAME for a specific version of MariaDB.
    ```
    $ MARIADB_DOCKER_IMAGE_NAME=mariadb:10.5.6 build/sbt -Pdocker-integration-tests -Phive -Phive-thriftserver package "testOnly org.apache.spark.sql.jdbc.MariaDBKrbIntegrationSuite"
    ```
    ```
    Installing mariadb-plugin-gssapi-server=1:10.5.6+maria~focal
    ```
    
    Closes #30515 from sarutak/fix-MariaDBKrbIntegrationSuite.
    
    Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
    Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
    sarutak authored and maropu committed Nov 28, 2020
    Configuration menu
    Copy the full SHA
    cf98a76 View commit details
    Browse the repository at this point in the history
  2. [SPARK-33580][CORE] resolveDependencyPaths should use classifier attr…

    …ibute of artifact
    
    ### What changes were proposed in this pull request?
    
    This patch proposes to use classifier attribute to construct artifact path instead of type.
    
    ### Why are the changes needed?
    
    `resolveDependencyPaths` now takes artifact type to decide to add "-tests" postfix. However, the path pattern of ivy in `resolveMavenCoordinates` is `[organization]_[artifact][revision](-[classifier]).[ext]`. We should use classifier instead of type to construct file path.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    Unit test. Manual test.
    
    Closes #30524 from viirya/SPARK-33580.
    
    Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    viirya authored and dongjoon-hyun committed Nov 28, 2020
    Configuration menu
    Copy the full SHA
    3650a6b View commit details
    Browse the repository at this point in the history

Commits on Nov 29, 2020

  1. [MINOR][SQL] Remove getTables() from r.SQLUtils

    ### What changes were proposed in this pull request?
    Remove the unused method `getTables()` from `r.SQLUtils`. The method was used before the changes #17483 but R's `tables.default` was rewritten using `listTables()`: https://github.com/apache/spark/pull/17483/files#diff-2c01472a7bcb1d318244afcd621d726e00d36cd15dffe7e44fa96c54fce4cd9aR220-R223
    
    ### Why are the changes needed?
    To improve code maintenance, and remove the dead code.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    By R tests.
    
    Closes #30527 from MaxGekk/remove-getTables-in-r-SQLUtils.
    
    Authored-by: Max Gekk <max.gekk@gmail.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    MaxGekk authored and dongjoon-hyun committed Nov 29, 2020
    Configuration menu
    Copy the full SHA
    bfe9380 View commit details
    Browse the repository at this point in the history
  2. [SPARK-33581][SQL][TEST] Refactor HivePartitionFilteringSuite

    ### What changes were proposed in this pull request?
    
    This pr refactor HivePartitionFilteringSuite.
    
    ### Why are the changes needed?
    
    To make it easy to maintain.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    N/A
    
    Closes #30525 from wangyum/SPARK-33581.
    
    Authored-by: Yuming Wang <yumwang@ebay.com>
    Signed-off-by: Yuming Wang <yumwang@ebay.com>
    wangyum committed Nov 29, 2020
    Configuration menu
    Copy the full SHA
    ba178f8 View commit details
    Browse the repository at this point in the history
  3. [SPARK-33590][DOCS][SQL] Add missing sub-bullets in Spark SQL Guide

    ### What changes were proposed in this pull request?
    
    Add the missing sub-bullets in the left side of `Spark SQL Guide`
    
    ### Why are the changes needed?
    
    The three sub-bullets in the left side is not consistent with the contents (five bullets) in the right side.
    
    ![image](https://user-images.githubusercontent.com/1315079/100546388-7a21e880-32a4-11eb-922d-62a52f4f9f9b.png)
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, you can see more lines in the left menu.
    
    ### How was this patch tested?
    
    Manually build the doc as follows. This can be verified as attached:
    
    ```
    cd docs
    SKIP_API=1 jekyll build
    firefox _site/sql-pyspark-pandas-with-arrow.html
    ```
    
    ![image](https://user-images.githubusercontent.com/1315079/100546399-8ad25e80-32a4-11eb-80ac-44af0aebc717.png)
    
    Closes #30537 from kiszk/SPARK-33590.
    
    Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    kiszk authored and dongjoon-hyun committed Nov 29, 2020
    Configuration menu
    Copy the full SHA
    b94ff1e View commit details
    Browse the repository at this point in the history
  4. [SPARK-33587][CORE] Kill the executor on nested fatal errors

    ### What changes were proposed in this pull request?
    
    Currently we will kill the executor when hitting a fatal error. However, if the fatal error is wrapped by another exception, such as
    - java.util.concurrent.ExecutionException, com.google.common.util.concurrent.UncheckedExecutionException, com.google.common.util.concurrent.ExecutionError when using Guava cache or Java thread pool.
    - SparkException thrown from https://github.com/apache/spark/blob/cf98a761de677c733f3c33230e1c63ddb785d5c5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L231 or https://github.com/apache/spark/blob/cf98a761de677c733f3c33230e1c63ddb785d5c5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L296
    
    We will still keep the executor running. Fatal errors are usually unrecoverable (such as OutOfMemoryError), some components may be in a broken state when hitting a fatal error and it's hard to predicate the behaviors of a broken component. Hence, it's better to detect the nested fatal error as well and kill the executor. Then we can rely on Spark's fault tolerance to recover.
    
    ### Why are the changes needed?
    
    Fatal errors are usually unrecoverable (such as OutOfMemoryError), some components may be in a broken state when hitting a fatal error and it's hard to predicate the behaviors of a broken component. Hence, it's better to detect the nested fatal error as well and kill the executor. Then we can rely on Spark's fault tolerance to recover.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yep. There is a slight internal behavior change on when to kill an executor. We will kill the executor when detecting a nested fatal error in the exception chain. `spark.executor.killOnFatalError.depth` is added to allow users to turn off this change if the slight behavior change impacts them.
    
    ### How was this patch tested?
    
    The new method `Executor.isFatalError` is tested by `spark.executor.killOnNestedFatalError`.
    
    Closes #30528 from zsxwing/SPARK-33587.
    
    Authored-by: Shixiong Zhu <zsxwing@gmail.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    zsxwing authored and dongjoon-hyun committed Nov 29, 2020
    Configuration menu
    Copy the full SHA
    c8286ec View commit details
    Browse the repository at this point in the history
  5. [SPARK-33588][SQL] Respect the spark.sql.caseSensitive config while…

    … resolving partition spec in v1 `SHOW TABLE EXTENDED`
    
    ### What changes were proposed in this pull request?
    Perform partition spec normalization in `ShowTablesCommand` according to the table schema before getting partitions from the catalog. The normalization via `PartitioningUtils.normalizePartitionSpec()` adjusts the column names in partition specification, w.r.t. the real partition column names and case sensitivity.
    
    ### Why are the changes needed?
    Even when `spark.sql.caseSensitive` is `false` which is the default value, v1 `SHOW TABLE EXTENDED` is case sensitive:
    ```sql
    spark-sql> CREATE TABLE tbl1 (price int, qty int, year int, month int)
             > USING parquet
             > partitioned by (year, month);
    spark-sql> INSERT INTO tbl1 PARTITION(year = 2015, month = 1) SELECT 1, 1;
    spark-sql> SHOW TABLE EXTENDED LIKE 'tbl1' PARTITION(YEAR = 2015, Month = 1);
    Error in query: Partition spec is invalid. The spec (YEAR, Month) must match the partition spec (year, month) defined in table '`default`.`tbl1`';
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    Yes. After the changes, the `SHOW TABLE EXTENDED` command respects the SQL config. And for example above, it returns correct result:
    ```sql
    spark-sql> SHOW TABLE EXTENDED LIKE 'tbl1' PARTITION(YEAR = 2015, Month = 1);
    default	tbl1	false	Partition Values: [year=2015, month=1]
    Location: file:/Users/maximgekk/spark-warehouse/tbl1/year=2015/month=1
    Serde Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
    InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
    OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
    Storage Properties: [serialization.format=1, path=file:/Users/maximgekk/spark-warehouse/tbl1]
    Partition Parameters: {transient_lastDdlTime=1606595118, totalSize=623, numFiles=1}
    Created Time: Sat Nov 28 23:25:18 MSK 2020
    Last Access: UNKNOWN
    Partition Statistics: 623 bytes
    ```
    
    ### How was this patch tested?
    By running the modified test suite `v1/ShowTablesSuite`
    
    Closes #30529 from MaxGekk/show-table-case-sensitive-spec.
    
    Authored-by: Max Gekk <max.gekk@gmail.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    MaxGekk authored and dongjoon-hyun committed Nov 29, 2020
    Configuration menu
    Copy the full SHA
    0054fc9 View commit details
    Browse the repository at this point in the history
  6. [SPARK-33585][SQL][DOCS] Fix the comment for SQLContext.tables() an…

    …d mention the `database` column
    
    ### What changes were proposed in this pull request?
    Change the comments for `SQLContext.tables()` to "The returned DataFrame has three columns, database, tableName and isTemporary".
    
    ### Why are the changes needed?
    Currently, the comment mentions only 2 columns but `tables()` returns 3 columns actually:
    ```scala
    scala> spark.range(10).createOrReplaceTempView("view1")
    scala> val tables = spark.sqlContext.tables()
    tables: org.apache.spark.sql.DataFrame = [database: string, tableName: string ... 1 more field]
    
    scala> tables.printSchema
    root
     |-- database: string (nullable = false)
     |-- tableName: string (nullable = false)
     |-- isTemporary: boolean (nullable = false)
    
    scala> tables.show
    +--------+---------+-----------+
    |database|tableName|isTemporary|
    +--------+---------+-----------+
    | default|       t1|      false|
    | default|       t2|      false|
    | default|      ymd|      false|
    |        |    view1|       true|
    +--------+---------+-----------+
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    By running `./dev/scalastyle`
    
    Closes #30526 from MaxGekk/sqlcontext-tables-doc.
    
    Authored-by: Max Gekk <max.gekk@gmail.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    MaxGekk authored and dongjoon-hyun committed Nov 29, 2020
    Configuration menu
    Copy the full SHA
    a088a80 View commit details
    Browse the repository at this point in the history

Commits on Nov 30, 2020

  1. [SPARK-33517][SQL][DOCS] Fix the correct menu items and page links in…

    … PySpark Usage Guide for Pandas with Apache Arrow
    
    ### What changes were proposed in this pull request?
    
    Change "Apache Arrow in Spark" to "Apache Arrow in PySpark"
    and the link to “/sql-pyspark-pandas-with-arrow.html#apache-arrow-in-pyspark”
    
    ### Why are the changes needed?
    When I click on the menu item it doesn't point to the correct page, and from the parent menu I can infer that the correct menu item name and link should be "Apache Arrow in PySpark".
    like this:
     image
    ![image](https://user-images.githubusercontent.com/28332082/99954725-2b64e200-2dbe-11eb-9576-cf6a3d758980.png)
    
    ### Does this PR introduce any user-facing change?
    Yes, clicking on the menu item will take you to the correct guide page
    
    ### How was this patch tested?
    Manually build the doc. This can be verified as below:
    
    cd docs
    SKIP_API=1 jekyll build
    open _site/sql-pyspark-pandas-with-arrow.html
    
    Closes #30466 from liucht-inspur/master.
    
    Authored-by: liucht <liucht@inspur.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    liucht-inspur authored and HyukjinKwon committed Nov 30, 2020
    Configuration menu
    Copy the full SHA
    3d54774 View commit details
    Browse the repository at this point in the history
  2. [SPARK-33589][SQL] Close opened session if the initialization fails

    ### What changes were proposed in this pull request?
    
    This pr add try catch when opening session.
    
    ### Why are the changes needed?
    
    Close opened session if the initialization fails.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Manual test.
    
    Before this pr:
    
    ```
    [rootspark-3267648 spark]#  bin/beeline -u jdbc:hive2://localhost:10000/db_not_exist
    NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly.
    Connecting to jdbc:hive2://localhost:10000/db_not_exist
    log4j:WARN No appenders could be found for logger (org.apache.hive.jdbc.Utils).
    log4j:WARN Please initialize the log4j system properly.
    log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
    Error: Could not open client transport with JDBC Uri: jdbc:hive2://localhost:10000/db_not_exist: Database 'db_not_exist' not found; (state=08S01,code=0)
    Beeline version 2.3.7 by Apache Hive
    beeline>
    ```
    ![image](https://user-images.githubusercontent.com/5399861/100560975-73ba5d80-32f2-11eb-8f92-b2509e7a121f.png)
    
    After this pr:
    ```
    [rootspark-3267648 spark]#  bin/beeline -u jdbc:hive2://localhost:10000/db_not_exist
    NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly.
    log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
    log4j:WARN Please initialize the log4j system properly.
    log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
    Connecting to jdbc:hive2://localhost:10000/db_not_exist
    Error: Could not open client transport with JDBC Uri: jdbc:hive2://localhost:10000/db_not_exist: Failed to open new session: org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'db_not_exist' not found; (state=08S01,code=0)
    Beeline version 2.3.7 by Apache Hive
    beeline>
    ```
    ![image](https://user-images.githubusercontent.com/5399861/100560917-479edc80-32f2-11eb-986f-7a997f1163fc.png)
    
    Closes #30536 from wangyum/SPARK-33589.
    
    Authored-by: Yuming Wang <yumwang@ebay.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    wangyum authored and HyukjinKwon committed Nov 30, 2020
    Configuration menu
    Copy the full SHA
    f93d439 View commit details
    Browse the repository at this point in the history
  3. [SPARK-33582][SQL] Hive Metastore support filter by not-equals

    ### What changes were proposed in this pull request?
    
    This pr make partition predicate pushdown into Hive metastore support not-equals operator.
    
    Hive related changes:
    https://github.com/apache/hive/blob/b8bd4594bef718b1eeac9fceb437d7df7b480ed1/itests/hive-unit/src/test/java/org/apache/hadoop/hive/metastore/TestHiveMetaStore.java#L2194-L2207
    https://issues.apache.org/jira/browse/HIVE-2702
    
    ### Why are the changes needed?
    
    Improve query performance.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Unit test.
    
    Closes #30534 from wangyum/SPARK-33582.
    
    Authored-by: Yuming Wang <yumwang@ebay.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    wangyum authored and HyukjinKwon committed Nov 30, 2020
    Configuration menu
    Copy the full SHA
    a5e13ac View commit details
    Browse the repository at this point in the history
  4. [SPARK-33567][SQL] DSv2: Use callback instead of passing Spark sessio…

    …n and v2 relation for refreshing cache
    
    ### What changes were proposed in this pull request?
    
    This replaces Spark session and `DataSourceV2Relation` in V2 write plans by replacing them with a callback `afterWrite`.
    
    ### Why are the changes needed?
    
    Per discussion in #30429, it's better to not pass Spark session and `DataSourceV2Relation` through Spark plans. Instead we can use a callback which makes the interface cleaner.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    N/A
    
    Closes #30491 from sunchao/SPARK-33492-followup.
    
    Authored-by: Chao Sun <sunchao@apple.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    sunchao authored and cloud-fan committed Nov 30, 2020
    Configuration menu
    Copy the full SHA
    feda729 View commit details
    Browse the repository at this point in the history
  5. [MINOR] Spelling bin core docs external mllib repl

    ### What changes were proposed in this pull request?
    
    This PR intends to fix typos in the sub-modules:
    * `bin`
    * `core`
    * `docs`
    * `external`
    * `mllib`
    * `repl`
    * `pom.xml`
    
    Split per srowen #30323 (comment)
    
    NOTE: The misspellings have been reported at jsoref@706a726#commitcomment-44064356
    
    ### Why are the changes needed?
    
    Misspelled words make it harder to read / understand content.
    
    ### Does this PR introduce _any_ user-facing change?
    
    There are various fixes to documentation, etc...
    
    ### How was this patch tested?
    
    No testing was performed
    
    Closes #30530 from jsoref/spelling-bin-core-docs-external-mllib-repl.
    
    Authored-by: Josh Soref <jsoref@users.noreply.github.com>
    Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
    jsoref authored and maropu committed Nov 30, 2020
    Configuration menu
    Copy the full SHA
    4851453 View commit details
    Browse the repository at this point in the history
  6. [SPARK-32976][SQL] Support column list in INSERT statement

    ### What changes were proposed in this pull request?
    
    #### JIRA expectations
    ```
       INSERT currently does not support named column lists.
    
       INSERT INTO <table> (col1, col2,…) VALUES( 'val1', 'val2', … )
       Note, we assume the column list contains all the column names. Issue an exception if the list is not complete. The column order could be different from the column order defined in the table definition.
    ```
    #### implemetations
    In this PR, we add a column list  as an optional part to the `INSERT OVERWRITE/INTO` statements:
    ```
      /**
       * {{{
       *   INSERT OVERWRITE TABLE tableIdentifier [partitionSpec [IF NOT EXISTS]]? [identifierList] ...
       *   INSERT INTO [TABLE] tableIdentifier [partitionSpec]  [identifierList] ...
       * }}}
       */
    ```
    The column list represents all expected columns with an explicit order that you want to insert to the target table. **Particularly**,  we assume the column list contains all the column names in the current implementation, it will fail when the list is incomplete.
    
    In **Analyzer**, we add a code path to resolve the column list in the `ResolveOutputRelation` rule before it is transformed to v1 or v2 command. It will fail here if the list has any field that not belongs to the target table.
    
    Then, for v2 command, e.g. `AppendData`, we use the resolved column list and output of the target table to resolve the output of the source query `ResolveOutputRelation` rule. If the list has duplicated columns, we fail. If the list is not empty but the list size does not match the target table, we fail. If no other exceptions occur, we use the column list to map the output of the source query to the output of the target table.  The column list will be set to Nil and it will not hit the rule again after it is resolved.
    
    for v1 command, those all happen in the `PreprocessTableInsertion` rule
    
    ### Why are the changes needed?
     new feature support
    
    ### Does this PR introduce _any_ user-facing change?
    
    yes, insert into/overwrite table support specify column list
    ### How was this patch tested?
    
    new tests
    
    Closes #29893 from yaooqinn/SPARK-32976.
    
    Authored-by: Kent Yao <yaooqinn@hotmail.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    yaooqinn authored and cloud-fan committed Nov 30, 2020
    Configuration menu
    Copy the full SHA
    2da7259 View commit details
    Browse the repository at this point in the history
  7. [SPARK-33448][SQL] Support CACHE/UNCACHE TABLE commands for v2 tables

    ### What changes were proposed in this pull request?
    
    This PR proposes to support `CHACHE/UNCACHE TABLE` commands for v2 tables.
    
    In addtion, this PR proposes to migrate `CACHE/UNCACHE TABLE` to use `UnresolvedTableOrView` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing).
    
    ### Why are the changes needed?
    
    To support `CACHE/UNCACHE TABLE` commands for v2 tables.
    
    Note that `CACHE/UNCACHE TABLE` for v1 tables/views go through `SparkSession.table` to resolve identifier, which resolves temp views first, so there is no change in the behavior by moving to the new framework.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes. Now the user can run `CACHE/UNCACHE TABLE` commands on v2 tables.
    
    ### How was this patch tested?
    
    Added/updated existing tests.
    
    Closes #30403 from imback82/cache_table.
    
    Authored-by: Terry Kim <yuminkim@gmail.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    imback82 authored and cloud-fan committed Nov 30, 2020
    Configuration menu
    Copy the full SHA
    0fd9f57 View commit details
    Browse the repository at this point in the history
  8. [SPARK-33498][SQL][FOLLOW-UP] Deduplicate the unittest by using check…

    …CastWithParseError
    
    ### What changes were proposed in this pull request?
    
    Dup code removed in SPARK-33498 as follow-up.
    
    ### Why are the changes needed?
    
    Nit.
    
    ### Does this PR introduce any user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Existing UT.
    
    Closes #30540 from leanken/leanken-SPARK-33498.
    
    Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    leanken-zz authored and HyukjinKwon committed Nov 30, 2020
    Configuration menu
    Copy the full SHA
    225c2e2 View commit details
    Browse the repository at this point in the history
  9. [SPARK-28646][SQL] Fix bug of Count so as consistent with mainstream …

    …databases
    
    ### What changes were proposed in this pull request?
    Currently, Spark allows calls to `count` even for non parameterless aggregate function. For example, the following query actually works:
    `SELECT count() FROM tenk1;`
    On the other hand, mainstream databases will throw an error.
    **Oracle**
    `> ORA-00909: invalid number of arguments`
    **PgSQL**
    `ERROR:  count(*) must be used to call a parameterless aggregate function`
    **MySQL**
    `> 1064 - You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ')`
    
    ### Why are the changes needed?
    Fix a bug so that consistent with mainstream databases.
    There is an example query output with/without this fix.
    `SELECT count() FROM testData;`
    The output before this fix:
    `0`
    The output after this fix:
    ```
    org.apache.spark.sql.AnalysisException
    cannot resolve 'count()' due to data type mismatch: count requires at least one argument.; line 1 pos 7
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    Yes.
    If not specify parameter for `count`, will throw an error.
    
    ### How was this patch tested?
    Jenkins test.
    
    Closes #30541 from beliefer/SPARK-28646.
    
    Lead-authored-by: gengjiaan <gengjiaan@360.cn>
    Co-authored-by: beliefer <beliefer@163.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    2 people authored and HyukjinKwon committed Nov 30, 2020
    Configuration menu
    Copy the full SHA
    b665d58 View commit details
    Browse the repository at this point in the history
  10. [SPARK-33480][SQL] Support char/varchar type

    ### What changes were proposed in this pull request?
    
    This PR adds the char/varchar type which is kind of a variant of string type:
    1. Char type is fixed-length string. When comparing char type values, we need to pad the shorter one to the longer length.
    2. Varchar type is string with a length limitation.
    
    To implement the char/varchar semantic, this PR:
    1. Do string length check when writing to char/varchar type columns.
    2. Do string padding when reading char type columns. We don't do it at the writing side to save storage space.
    3. Do string padding when comparing char type column with string literal or another char type column. (string literal is fixed length so should be treated as char type as well)
    
    To simplify the implementation, this PR doesn't propagate char/varchar type info through functions/operators(e.g. `substring`). That said, a column can only be char/varchar type if it's a table column, not a derived column like `SELECT substring(col)`.
    
    To be safe, this PR doesn't add char/varchar type to the query engine(expression input check, internal row framework, codegen framework, etc.). We will replace char/varchar type by string type with metadata (`Attribute.metadata` or `StructField.metadata`) that includes the original type string before it goes into the query engine. That said, the existing code will not see char/varchar type but only string type.
    
    char/varchar type may come from several places:
    1. v1 table from hive catalog.
    2. v2 table from v2 catalog.
    3. user-specified schema in `spark.read.schema` and `spark.readStream.schema`
    4. `Column.cast`
    5. schema string in places like `from_json`, pandas UDF, etc. These places use SQL parser which replaces char/varchar with string already, even before this PR.
    
    This PR covers all the above cases, implements the length check and padding feature by looking at string type with special metadata.
    
    ### Why are the changes needed?
    
    char and varchar are standard SQL types. varchar is widely used in other databases instead of string type.
    
    ### Does this PR introduce _any_ user-facing change?
    
    For hive tables: now the table insertion fails if the value exceeds char/varchar length. Previously we truncate the value silently.
    
    For other tables:
    1. now char type is allowed.
    2. now we have length check when inserting to varchar columns. Previously we write the value as it is.
    
    ### How was this patch tested?
    
    new tests
    
    Closes #30412 from cloud-fan/char.
    
    Authored-by: Wenchen Fan <wenchen@databricks.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    cloud-fan committed Nov 30, 2020
    Configuration menu
    Copy the full SHA
    5cfbddd View commit details
    Browse the repository at this point in the history
  11. [SPARK-33579][UI] Fix executor blank page behind proxy

    ### What changes were proposed in this pull request?
    
    Fix some "hardcoded" API urls in Web UI.
    More specifically, we avoid the use of `location.origin` when constructing URLs for internal API calls within the JavaScript.
    Instead, we use `apiRoot` global variable.
    
    ### Why are the changes needed?
    
    On one hand, it allows us to build relative URLs. On the other hand, `apiRoot` reflects the Spark property `spark.ui.proxyBase` which can be set to change the root path of the Web UI.
    
    If `spark.ui.proxyBase` is actually set, original URLs become incorrect, and we end up with an executors blank page.
    I encounter this bug when accessing the Web UI behind a proxy (in my case a Kubernetes Ingress).
    
    See the following link for more context:
    jupyterhub/jupyter-server-proxy#57 (comment)
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, as all the changes introduced are in the JavaScript for the Web UI.
    
    ### How the changes have been tested ?
    I modified/debugged the JavaScript as in the commit with the help of the developer tools in Google Chrome, while accessing the Web UI of my Spark app behind my k8s ingress.
    
    Closes #30523 from pgillet/fix-executors-blank-page-behind-proxy.
    
    Authored-by: Pascal Gillet <pascal.gillet@stack-labs.com>
    Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
    Pascal Gillet authored and sarutak committed Nov 30, 2020
    Configuration menu
    Copy the full SHA
    6e5446e View commit details
    Browse the repository at this point in the history
  12. [SPARK-33452][SQL] Support v2 SHOW PARTITIONS

    ### What changes were proposed in this pull request?
    1. Remove V2 logical node `ShowPartitionsStatement `, and replace it by V2 `ShowPartitions`.
    2. Implement V2 execution node `ShowPartitionsExec` similar to V1 `ShowPartitionsCommand`.
    
    ### Why are the changes needed?
    To have feature parity with Datasource V1.
    
    ### Does this PR introduce _any_ user-facing change?
    Yes.
    
    Before the change, `SHOW PARTITIONS` fails in V2 table catalogs with the exception:
    ```
    org.apache.spark.sql.AnalysisException: SHOW PARTITIONS is only supported with v1 tables.
       at org.apache.spark.sql.catalyst.analysis.ResolveSessionCatalog.org$apache$spark$sql$catalyst$analysis$ResolveSessionCatalog$$parseV1Table(ResolveSessionCatalog.scala:628)
       at org.apache.spark.sql.catalyst.analysis.ResolveSessionCatalog$$anonfun$apply$1.applyOrElse(ResolveSessionCatalog.scala:466)
    ```
    
    ### How was this patch tested?
    By running the following test suites:
    1. Modified `ShowPartitionsParserSuite` where `ShowPartitionsStatement` is replaced by V2 `ShowPartitions`.
    2. `v2.ShowPartitionsSuite`
    
    Closes #30398 from MaxGekk/show-partitions-exec-v2.
    
    Authored-by: Max Gekk <max.gekk@gmail.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    MaxGekk authored and cloud-fan committed Nov 30, 2020
    Configuration menu
    Copy the full SHA
    0a612b6 View commit details
    Browse the repository at this point in the history
  13. [SPARK-33569][SQL] Remove getting partitions by an identifier prefix

    ### What changes were proposed in this pull request?
    1. Remove the method `listPartitionIdentifiers()` from the `SupportsPartitionManagement` interface. The method lists partitions by ident prefix.
    2. Rename `listPartitionByNames()` to `listPartitionIdentifiers()`.
    3. Re-implement the default method `partitionExists()` using new method.
    
    ### Why are the changes needed?
    Getting partitions by ident prefix only is not used, and it can be removed to improve code maintenance. Also this makes the `SupportsPartitionManagement` interface cleaner.
    
    ### Does this PR introduce _any_ user-facing change?
    Should not.
    
    ### How was this patch tested?
    By running the affected test suites:
    ```
    $ build/sbt "test:testOnly org.apache.spark.sql.connector.catalog.*"
    ```
    
    Closes #30514 from MaxGekk/remove-listPartitionIdentifiers.
    
    Authored-by: Max Gekk <max.gekk@gmail.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    MaxGekk authored and cloud-fan committed Nov 30, 2020
    Configuration menu
    Copy the full SHA
    6fd148f View commit details
    Browse the repository at this point in the history
  14. [SPARK-33569][SPARK-33452][SQL][FOLLOWUP] Fix a build error in `ShowP…

    …artitionsExec`
    
    ### What changes were proposed in this pull request?
    Use `listPartitionIdentifiers ` instead of `listPartitionByNames` in `ShowPartitionsExec`. The `listPartitionByNames` was renamed by #30514.
    
    ### Why are the changes needed?
    To fix build error.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    By running tests for the `SHOW PARTITIONS` command:
    ```
    $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *ShowPartitionsSuite"
    ```
    
    Closes #30553 from MaxGekk/fix-build-show-partitions-exec.
    
    Authored-by: Max Gekk <max.gekk@gmail.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    MaxGekk authored and cloud-fan committed Nov 30, 2020
    Configuration menu
    Copy the full SHA
    030b313 View commit details
    Browse the repository at this point in the history
  15. [SPARK-33185][YARN][FOLLOW-ON] Leverage RM's RPC API instead of REST …

    …to fetch driver log links in yarn.Client
    
    ### What changes were proposed in this pull request?
    This is a follow-on to PR #30096 which initially added support for printing direct links to the driver stdout/stderr logs from the application report output in `yarn.Client` using the `spark.yarn.includeDriverLogsLink` configuration. That PR made use of the ResourceManager's REST APIs to fetch the necessary information to construct the links. This PR proposes removing the dependency on the REST API, since the new logic is the only place in `yarn.Client` which makes use of this API, and instead leverages the RPC API via `YarnClient`, which brings the code in line with the rest of `yarn.Client`.
    
    ### Why are the changes needed?
    
    While the old logic worked okay when running a Spark application in a "standard" environment with full access to Kerberos credentials, it can fail when run in an environment with restricted Kerberos credentials. In our case, this environment is represented by [Azkaban](https://azkaban.github.io/), but it likely affects other job scheduling systems as well. In such an environment, the application has delegation tokens which enabled it to communicate with services such as YARN, but the RM REST API is not typically covered by such delegation tokens (note that although YARN does actually support accessing the RM REST API via a delegation token as documented [here](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Delegation_Tokens_API), it is a new feature in alpha phase, and most deployments are likely not retrieving this token today).
    
    Besides this enhancement, leveraging the `YarnClient` APIs greatly simplifies the processing logic, such as removing all JSON parsing.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Very minimal user-facing changes on top of PR #30096. Basically expands the scope of environments in which that feature will operate correctly.
    
    ### How was this patch tested?
    
    In addition to redoing the `spark-submit` testing as mentioned in PR #30096, I also tested this logic in a restricted-credentials environment (Azkaban). It succeeds where the previous logic would fail with a 401 error.
    
    Closes #30450 from xkrogen/xkrogen-SPARK-33185-driverlogs-followon.
    
    Authored-by: Erik Krogen <xkrogen@apache.org>
    Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
    xkrogen authored and Mridul Muralidharan committed Nov 30, 2020
    Configuration menu
    Copy the full SHA
    f3c2583 View commit details
    Browse the repository at this point in the history
  16. [SPARK-33545][CORE] Support Fallback Storage during Worker decommission

    ### What changes were proposed in this pull request?
    
    This PR aims to support storage migration to the fallback storage like cloud storage (`S3`) during worker decommission for the corner cases where the exceptions occur or there is no live peer left.
    
    Although this PR focuses on cloud storage like `S3` which has a TTL feature in order to simplify Spark's logic, we can use alternative fallback storages like HDFS/NFS(EFS) if the user provides a clean-up mechanism.
    
    ### Why are the changes needed?
    
    Currently, storage migration is not possible when there is no available executor. For example, when there is one executor, the executor cannot perform storage migration because it has no peer.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes. This is a new feature.
    
    ### How was this patch tested?
    
    Pass the CIs with newly added test cases.
    
    Closes #30492 from dongjoon-hyun/SPARK-33545.
    
    Authored-by: Dongjoon Hyun <dongjoon@apache.org>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    dongjoon-hyun committed Nov 30, 2020
    Configuration menu
    Copy the full SHA
    c699435 View commit details
    Browse the repository at this point in the history
  17. [SPARK-33440][CORE] Use current timestamp with warning log in HadoopF…

    …SDelegationTokenProvider when the issue date for token is not set up properly
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to use current timestamp with warning log when the issue date for token is not set up properly. The next section will explain the rationalization with details.
    
    ### Why are the changes needed?
    
    Unfortunately not every implementations respect the `issue date` in `AbstractDelegationTokenIdentifier`, which Spark relies on while calculating. The default value of issue date is 0L, which is far from actual issue date, breaking logic on calculating next renewal date under some circumstance, leading to 0 interval (immediate) on rescheduling token renewal.
    
    In HadoopFSDelegationTokenProvider, Spark calculates token renewal interval as below:
    
    https://github.com/apache/spark/blob/2c64b731ae6a976b0d75a95901db849b4a0e2393/core/src/main/scala/org/apache/spark/deploy/security/HadoopFSDelegationTokenProvider.scala#L123-L134
    
    The interval is calculated as `token.renew() - identifier.getIssueDate`, which is providing correct interval assuming both `token.renew()` and `identifier.getIssueDate` produce correct value, but it's going to be weird when `identifier.getIssueDate` provides 0L (default value), like below:
    
    ```
    20/10/13 06:34:19 INFO security.HadoopFSDelegationTokenProvider: Renewal interval is 1603175657000 for token S3ADelegationToken/IDBroker
    20/10/13 06:34:19 INFO security.HadoopFSDelegationTokenProvider: Renewal interval is 86400048 for token HDFS_DELEGATION_TOKEN
    ```
    
    Hopefully we pick the minimum value as safety guard (so in this case, `86400048` is being picked up), but the safety guard leads unintentional bad impact on this case.
    
    https://github.com/apache/spark/blob/2c64b731ae6a976b0d75a95901db849b4a0e2393/core/src/main/scala/org/apache/spark/deploy/security/HadoopFSDelegationTokenProvider.scala#L58-L71
    
    Spark leverages the interval being calculated in above, "minimum" value of intervals, and blindly adds the value to token's issue date to calculates the next renewal date for the token, and picks "minimum" value again. In problematic case, the value would be `86400048` (86400048 + 0) which is quite smaller than current timestamp.
    
    https://github.com/apache/spark/blob/2c64b731ae6a976b0d75a95901db849b4a0e2393/core/src/main/scala/org/apache/spark/deploy/security/HadoopDelegationTokenManager.scala#L228-L234
    
    The next renewal date is subtracted with current timestamp again to get the interval, and multiplexed by configured ratio to produce the final schedule interval. In problematic case, this value goes to negative.
    
    https://github.com/apache/spark/blob/2c64b731ae6a976b0d75a95901db849b4a0e2393/core/src/main/scala/org/apache/spark/deploy/security/HadoopDelegationTokenManager.scala#L180-L188
    
    There's a safety guard to not allow negative value, but that's simply 0 meaning schedule immediately. This triggers next calculation of next renewal date to calculate the schedule interval, lead to the same behavior, hence updating delegation token immediately and continuously.
    
    As we fetch token just before the calculation happens, the actual issue date is likely slightly before, hence it's not that dangerous to use current timestamp as issue date for the token the issue date has not been set up properly. Still, it's better not to leave the token implementation as it is, so we log warn message to let end users consult with token implementer.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes. End users won't encounter the tight loop of schedule of token renewal after the PR. In end users' perspective of reflection, there's nothing end users need to change.
    
    ### How was this patch tested?
    
    Manually tested with problematic environment.
    
    Closes #30366 from HeartSaVioR/SPARK-33440.
    
    Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
    Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
    HeartSaVioR committed Nov 30, 2020
    Configuration menu
    Copy the full SHA
    f5d2165 View commit details
    Browse the repository at this point in the history

Commits on Dec 1, 2020

  1. [SPARK-33556][ML] Add array_to_vector function for dataframe column

    ### What changes were proposed in this pull request?
    
    Add array_to_vector function for dataframe column
    
    ### Why are the changes needed?
    Utility function for array to vector conversion.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    scala unit test & doctest.
    
    Closes #30498 from WeichenXu123/array_to_vec.
    
    Lead-authored-by: Weichen Xu <weichen.xu@databricks.com>
    Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    WeichenXu123 and HyukjinKwon committed Dec 1, 2020
    Configuration menu
    Copy the full SHA
    596fbc1 View commit details
    Browse the repository at this point in the history
  2. [SPARK-33613][PYTHON][TESTS] Replace deprecated APIs in pyspark tests

    ### What changes were proposed in this pull request?
    
    This replaces deprecated API usage in PySpark tests with the preferred APIs. These have been deprecated for some time and usage is not consistent within tests.
    
    - https://docs.python.org/3/library/unittest.html#deprecated-aliases
    
    ### Why are the changes needed?
    
    For consistency and eventual removal of deprecated APIs.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    Existing tests
    
    Closes #30557 from BryanCutler/replace-deprecated-apis-in-tests.
    
    Authored-by: Bryan Cutler <cutlerb@gmail.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    BryanCutler authored and HyukjinKwon committed Dec 1, 2020
    Configuration menu
    Copy the full SHA
    aeb3649 View commit details
    Browse the repository at this point in the history
  3. [SPARK-33592] Fix: Pyspark ML Validator params in estimatorParamMaps …

    …may be lost after saving and reloading
    
    ### What changes were proposed in this pull request?
    Fix: Pyspark ML Validator params in estimatorParamMaps may be lost after saving and reloading
    
    When saving validator estimatorParamMaps, will check all nested stages in tuned estimator to get correct param parent.
    
    Two typical cases to manually test:
    ~~~python
    tokenizer = Tokenizer(inputCol="text", outputCol="words")
    hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
    lr = LogisticRegression()
    pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
    
    paramGrid = ParamGridBuilder() \
        .addGrid(hashingTF.numFeatures, [10, 100]) \
        .addGrid(lr.maxIter, [100, 200]) \
        .build()
    tvs = TrainValidationSplit(estimator=pipeline,
                               estimatorParamMaps=paramGrid,
                               evaluator=MulticlassClassificationEvaluator())
    
    tvs.save(tvsPath)
    loadedTvs = TrainValidationSplit.load(tvsPath)
    
    # check `loadedTvs.getEstimatorParamMaps()` restored correctly.
    ~~~
    
    ~~~python
    lr = LogisticRegression()
    ova = OneVsRest(classifier=lr)
    grid = ParamGridBuilder().addGrid(lr.maxIter, [100, 200]).build()
    evaluator = MulticlassClassificationEvaluator()
    tvs = TrainValidationSplit(estimator=ova, estimatorParamMaps=grid, evaluator=evaluator)
    
    tvs.save(tvsPath)
    loadedTvs = TrainValidationSplit.load(tvsPath)
    
    # check `loadedTvs.getEstimatorParamMaps()` restored correctly.
    ~~~
    
    ### Why are the changes needed?
    Bug fix.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Unit test.
    
    Closes #30539 from WeichenXu123/fix_tuning_param_maps_io.
    
    Authored-by: Weichen Xu <weichen.xu@databricks.com>
    Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
    WeichenXu123 authored and zhengruifeng committed Dec 1, 2020
    Configuration menu
    Copy the full SHA
    8016123 View commit details
    Browse the repository at this point in the history
  4. [SPARK-33607][SS][WEBUI] Input Rate timeline/histogram aren't rendere…

    …d if built with Scala 2.13
    
    ### What changes were proposed in this pull request?
    
    This PR fixes an issue that the histogram and timeline aren't rendered in the `Streaming Query Statistics` page if we built Spark with Scala 2.13.
    
    ![before-fix-the-issue](https://user-images.githubusercontent.com/4736016/100612855-f543d700-3356-11eb-90d9-ede57b8b3f4f.png)
    ![NaN_Error](https://user-images.githubusercontent.com/4736016/100612879-00970280-3357-11eb-97cf-43978bbe2d3a.png)
    
    The reason is [`maxRecordRate` can be `NaN`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/streaming/ui/StreamingQueryStatisticsPage.scala#L371) for Scala 2.13.
    
    The `NaN` is the result of [`query.recentProgress.map(_.inputRowsPerSecond).max`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/streaming/ui/StreamingQueryStatisticsPage.scala#L372) when the first element of `query.recentProgress.map(_.inputRowsPerSecond)` is `NaN`.
    Actually, the comparison logic for `Double` type was changed in Scala 2.13.
    scala/bug#12107
    scala/scala#6410
    
    So this issue happens as of Scala 2.13.
    
    The root cause of the `NaN` is [here](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala#L164).
    This `NaN` seems to be an initial value of `inputTimeSec` so I think `Double.PositiveInfinity` is suitable rather than `NaN` and this change can resolve this issue.
    
    ### Why are the changes needed?
    
    To make sure we can use the histogram/timeline with Scala 2.13.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    First, I built with the following commands.
    ```
    $ /dev/change-scala-version.sh 2.13
    $ build/sbt -Phive -Phive-thriftserver -Pscala-2.13 package
    ```
    
    Then, ran the following query (this is brought from #30427 ).
    ```
    import org.apache.spark.sql.streaming.Trigger
    val query = spark
      .readStream
      .format("rate")
      .option("rowsPerSecond", 1000)
      .option("rampUpTime", "10s")
      .load()
      .selectExpr("*", "CAST(CAST(timestamp AS BIGINT) - CAST((RAND() * 100000) AS BIGINT) AS TIMESTAMP) AS tsMod")
      .selectExpr("tsMod", "mod(value, 100) as mod", "value")
      .withWatermark("tsMod", "10 seconds")
      .groupBy(window($"tsMod", "1 minute", "10 seconds"), $"mod")
      .agg(max("value").as("max_value"), min("value").as("min_value"), avg("value").as("avg_value"))
      .writeStream
      .format("console")
      .trigger(Trigger.ProcessingTime("5 seconds"))
      .outputMode("append")
      .start()
    ```
    
    Finally, I confirmed that the timeline and histogram are rendered.
    ![after-fix-the-issue](https://user-images.githubusercontent.com/4736016/100612736-c9285600-3356-11eb-856d-7e53cc656c36.png)
    
    ```
    
    Closes #30546 from sarutak/ss-nan.
    
    Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
    Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
    sarutak authored and HeartSaVioR committed Dec 1, 2020
    Configuration menu
    Copy the full SHA
    c50fcac View commit details
    Browse the repository at this point in the history
  5. [SPARK-30900][SS] FileStreamSource: Avoid reading compact metadata lo…

    …g twice if the query restarts from compact batch
    
    ### What changes were proposed in this pull request?
    
    This patch addresses the case where compact metadata file is read twice in FileStreamSource during restarting query.
    
    When restarting the query, there is a case which the query starts from compaction batch, and the batch has source metadata file to read. One case is that the previous query succeeded to read from inputs, but not finalized the batch for various reasons.
    
    The patch finds the latest compaction batch when restoring from metadata log, and put entries for the batch into the file entry cache which would avoid reading compact batch file twice.
    
    FileStreamSourceLog doesn't know about offset / commit metadata in checkpoint so doesn't know which exactly batch to start from, but in practice, only couple of latest batches are candidates to
    be started from when restarting query. This patch leverages the fact to skip calculation if possible.
    
    ### Why are the changes needed?
    
    Spark incurs unnecessary cost on reading the compact metadata file twice on some case, which may not be ignorable when the query has been processed huge number of files so far.
    
    ### Does this PR introduce any user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    New UT.
    
    Closes #27649 from HeartSaVioR/SPARK-30900.
    
    Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
    Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
    HeartSaVioR committed Dec 1, 2020
    Configuration menu
    Copy the full SHA
    2af2da5 View commit details
    Browse the repository at this point in the history
  6. [SPARK-33530][CORE] Support --archives and spark.archives option nati…

    …vely
    
    ### What changes were proposed in this pull request?
    
    TL;DR:
    - This PR completes the support of archives in Spark itself instead of Yarn-only
      - It makes `--archives` option work in other cluster modes too and adds `spark.archives` configuration.
    -  After this PR, PySpark users can leverage Conda to ship Python packages together as below:
        ```python
        conda create -y -n pyspark_env -c conda-forge pyarrow==2.0.0 pandas==1.1.4 conda-pack==0.5.0
        conda activate pyspark_env
        conda pack -f -o pyspark_env.tar.gz
        PYSPARK_DRIVER_PYTHON=python PYSPARK_PYTHON=./environment/bin/python pyspark --archives pyspark_env.tar.gz#environment
       ```
    - Issue a warning that undocumented and hidden behavior of partial archive handling in `spark.files` / `SparkContext.addFile` will be deprecated, and users can use `spark.archives` and `SparkContext.addArchive`.
    
    This PR proposes to add Spark's native `--archives` in Spark submit, and `spark.archives` configuration. Currently, both are supported only in Yarn mode:
    
    ```bash
    ./bin/spark-submit --help
    ```
    
    ```
    Options:
    ...
     Spark on YARN only:
      --queue QUEUE_NAME          The YARN queue to submit to (Default: "default").
      --archives ARCHIVES         Comma separated list of archives to be extracted into the
                                  working directory of each executor.
    ```
    
    This `archives` feature is useful often when you have to ship a directory and unpack into executors. One example is native libraries to use e.g. JNI. Another example is to ship Python packages together by Conda environment.
    
    Especially for Conda, PySpark currently does not have a nice way to ship a package that works in general, please see also https://hyukjin-spark.readthedocs.io/en/stable/user_guide/python_packaging.html#using-zipped-virtual-environment (PySpark new documentation demo for 3.1.0).
    
    The neatest way is arguably to use Conda environment by shipping zipped Conda environment but this is currently dependent on this archive feature. NOTE that we are able to use `spark.files` by relying on its undocumented behaviour that untars `tar.gz` but I don't think we should document such ways and promote people to more rely on it.
    
    Also, note that this PR does not target to add the feature parity of `spark.files.overwrite`, `spark.files.useFetchCache`, etc. yet. I documented that this is an experimental feature as well.
    
    ### Why are the changes needed?
    
    To complete the feature parity, and to provide a better support of shipping Python libraries together with Conda env.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, this makes `--archives` works in Spark instead of Yarn-only, and adds a new configuration `spark.archives`.
    
    ### How was this patch tested?
    
    I added unittests. Also, manually tested in standalone cluster, local-cluster, and local modes.
    
    Closes #30486 from HyukjinKwon/native-archive.
    
    Authored-by: HyukjinKwon <gurwls223@apache.org>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    HyukjinKwon committed Dec 1, 2020
    Configuration menu
    Copy the full SHA
    1a042cc View commit details
    Browse the repository at this point in the history
  7. [SPARK-27188][SS] FileStreamSink: provide a new option to have retent…

    …ion on output files
    
    ### What changes were proposed in this pull request?
    
    This patch proposes to provide a new option to specify time-to-live (TTL) for output file entries in FileStreamSink. TTL is defined via current timestamp - the last modified time for the file.
    
    This patch will filter out outdated output files in metadata while compacting batches (other batches don't have functionality to clean entries), which helps metadata to not grow linearly, as well as filtered out files will be "eventually" no longer seen in reader queries which leverage File(Stream)Source.
    
    ### Why are the changes needed?
    
    The metadata log greatly helps to easily achieve exactly-once but given the output path is open to arbitrary readers, there's no way to compact the metadata log, which ends up growing the metadata file as query runs for long time, especially for compacted batch.
    
    Lots of end users have been reporting the issue: see comments in [SPARK-24295](https://issues.apache.org/jira/browse/SPARK-24295) and [SPARK-29995](https://issues.apache.org/jira/browse/SPARK-29995), and [SPARK-30462](https://issues.apache.org/jira/browse/SPARK-30462).
    (There're some reports from end users which include their workarounds: SPARK-24295)
    
    ### Does this PR introduce any user-facing change?
    
    No, as the configuration is new and by default it is not applied.
    
    ### How was this patch tested?
    
    New UT.
    
    Closes #28363 from HeartSaVioR/SPARK-27188-v2.
    
    Lead-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
    Co-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
    Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
    HeartSaVioR and HeartSaVioR committed Dec 1, 2020
    Configuration menu
    Copy the full SHA
    52e5cc4 View commit details
    Browse the repository at this point in the history
  8. [SPARK-33572][SQL] Datetime building should fail if the year, month, …

    …..., second combination is invalid
    
    ### What changes were proposed in this pull request?
    Datetime building should fail if the year, month, ..., second combination is invalid, when ANSI mode is enabled. This patch should update MakeDate, MakeTimestamp and MakeInterval.
    
    ### Why are the changes needed?
    For ANSI mode.
    
    ### Does this PR introduce _any_ user-facing change?
    No.
    
    ### How was this patch tested?
    Added UT and Existing UT.
    
    Closes #30516 from waitinfuture/SPARK-33498.
    
    Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
    Co-authored-by: waitinfuture <waitinfuture@gmail.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    2 people authored and cloud-fan committed Dec 1, 2020
    Configuration menu
    Copy the full SHA
    1034815 View commit details
    Browse the repository at this point in the history
  9. [SPARK-32032][SS] Avoid infinite wait in driver because of KafkaConsu…

    …mer.poll(long) API
    
    ### What changes were proposed in this pull request?
    Deprecated `KafkaConsumer.poll(long)` API calls may cause infinite wait in the driver. In this PR I've added a new `AdminClient` based offset fetching which is turned off by default. There is a new flag named `spark.sql.streaming.kafka.useDeprecatedOffsetFetching` (default: `true`) which can be set to `false` to reach the newly added functionality. The Structured Streaming migration guide contains more information what migration consideration must be done. Please see the following [doc](https://docs.google.com/document/d/1gAh0pKgZUgyqO2Re3sAy-fdYpe_SxpJ6DkeXE8R1P7E/edit?usp=sharing) for further details.
    
    The PR contains the following changes:
    * Added `AdminClient` based offset fetching
    * GroupId prefix feature removed from driver but only in `AdminClient` based approach (`AdminClient` doesn't need any GroupId)
    * GroupId override feature removed from driver but only in `AdminClient` based approach  (`AdminClient` doesn't need any GroupId)
    * Additional unit tests
    * Code comment changes
    * Minor bugfixes here and there
    * Removed Kafka auto topic creation feature but only in `AdminClient` based approach (please see doc for rationale). In short, it's super hidden, not sure anybody ever used in production + error prone.
    * Added documentation to `ss-migration-guide` and `structured-streaming-kafka-integration`
    
    ### Why are the changes needed?
    Driver may hang forever.
    
    ### Does this PR introduce _any_ user-facing change?
    No.
    
    ### How was this patch tested?
    Existing + additional unit tests.
    Cluster test with simple Kafka topic to another topic query.
    Documentation:
    ```
    cd docs/
    SKIP_API=1 jekyll build
    ```
    Manual webpage check.
    
    Closes #29729 from gaborgsomogyi/SPARK-32032.
    
    Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
    Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
    gaborgsomogyi authored and HeartSaVioR committed Dec 1, 2020
    Configuration menu
    Copy the full SHA
    e5bb293 View commit details
    Browse the repository at this point in the history
  10. [SPARK-32405][SQL][FOLLOWUP] Throw Exception if provider is specified…

    … in JDBCTableCatalog create table
    
    ### What changes were proposed in this pull request?
    Throw Exception if JDBC Table Catalog has provider in create table.
    
    ### Why are the changes needed?
    JDBC Table Catalog doesn't support provider and we should throw Exception. Previously CREATE TABLE syntax forces people to specify a provider so we have to add a `USING_`. Now the problem was fix and we will throw Exception for provider.
    
    ### Does this PR introduce _any_ user-facing change?
    Yes. We throw Exception if a provider is specified in CREATE TABLE for JDBC Table catalog.
    
    ### How was this patch tested?
    Existing tests (remove `USING _`)
    
    Closes #30544 from huaxingao/followup.
    
    Authored-by: Huaxin Gao <huaxing@us.ibm.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    huaxingao authored and cloud-fan committed Dec 1, 2020
    Configuration menu
    Copy the full SHA
    d38883c View commit details
    Browse the repository at this point in the history
  11. [SPARK-33045][SQL][FOLLOWUP] Support built-in function like_any and f…

    …ix StackOverflowError issue
    
    ### What changes were proposed in this pull request?
    Spark already support `LIKE ANY` syntax, but it will throw `StackOverflowError` if there are many elements(more than 14378 elements). We should implement built-in function for LIKE ANY to fix this issue.
    
    Why the stack overflow can happen in the current approach ?
    The current approach uses reduceLeft to connect each `Like(e, p)`, this will lead the the call depth of the thread is too large, causing `StackOverflowError` problems.
    
    Why the fix in this PR can avoid the error?
    This PR support built-in function for `LIKE ANY` and avoid this issue.
    
    ### Why are the changes needed?
    1.Fix the `StackOverflowError` issue.
    2.Support built-in function `like_any`.
    
    ### Does this PR introduce _any_ user-facing change?
    'No'.
    
    ### How was this patch tested?
    Jenkins test.
    
    Closes #30465 from beliefer/SPARK-33045-like_any-bak.
    
    Lead-authored-by: gengjiaan <gengjiaan@360.cn>
    Co-authored-by: beliefer <beliefer@163.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    2 people authored and cloud-fan committed Dec 1, 2020
    Configuration menu
    Copy the full SHA
    9273d42 View commit details
    Browse the repository at this point in the history
  12. [SPARK-33503][SQL] Refactor SortOrder class to allow multiple childrens

    ### What changes were proposed in this pull request?
    This is a followup of #30302 . As part of this PR, sameOrderExpressions set is made part of children of SortOrder node - so that they don't need any special handling as done in #30302 .
    
    ### Why are the changes needed?
    sameOrderExpressions should get same treatment as child. So making them part of children helps in transforming them easily.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Existing UTs
    
    Closes #30430 from prakharjain09/SPARK-33400-sortorder-refactor.
    
    Authored-by: Prakhar Jain <prakharjain09@gmail.com>
    Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
    prakharjain09 authored and maropu committed Dec 1, 2020
    Configuration menu
    Copy the full SHA
    cf4ad21 View commit details
    Browse the repository at this point in the history
  13. [SPARK-33608][SQL] Handle DELETE/UPDATE/MERGE in PullupCorrelatedPred…

    …icates
    
    ### What changes were proposed in this pull request?
    
    This PR adds logic to handle DELETE/UPDATE/MERGE plans in `PullupCorrelatedPredicates`.
    ### Why are the changes needed?
    
    Right now, `PullupCorrelatedPredicates` applies only to filters and unary nodes. As a result, correlated predicates in DELETE/UPDATE/MERGE are not rewritten.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    The PR adds 3 new test cases.
    
    Closes #30555 from aokolnychyi/spark-33608.
    
    Authored-by: Anton Okolnychyi <aokolnychyi@apple.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    aokolnychyi authored and cloud-fan committed Dec 1, 2020
    Configuration menu
    Copy the full SHA
    478fb7f View commit details
    Browse the repository at this point in the history
  14. [SPARK-33612][SQL] Add dataSourceRewriteRules batch to Optimizer

    ### What changes were proposed in this pull request?
    
    This PR adds a new batch to the optimizer for executing rules that rewrite plans for data sources.
    
    ### Why are the changes needed?
    
    Right now, we have a special place in the optimizer where we construct v2 scans. As time shows, we need more rewrite rules that would be executed after the operator optimization and before any stats-related rules for v2 tables. Not all rules will be specific to reads. One option is to rename the current batch into something more generic but it would require changing quite some places. That's why it seems better to introduce a new batch and use it for all rewrites. The name is generic so that we don't limit ourselves to v2 data sources only.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    The change is trivial and SPARK-23889 will depend on it.
    
    Closes #30558 from aokolnychyi/spark-33612.
    
    Authored-by: Anton Okolnychyi <aokolnychyi@apple.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    aokolnychyi authored and dongjoon-hyun committed Dec 1, 2020
    Configuration menu
    Copy the full SHA
    c24f2b2 View commit details
    Browse the repository at this point in the history
  15. [SPARK-33611][UI] Avoid encoding twice on the query parameter of rewr…

    …itten proxy URL
    
    ### What changes were proposed in this pull request?
    
    When running Spark behind a reverse proxy(e.g. Nginx, Apache HTTP server), the request URL can be encoded twice if we pass the query string directly to the constructor of `java.net.URI`:
    ```
    > val uri = "http://localhost:8081/test"
    > val query = "order%5B0%5D%5Bcolumn%5D=0"  // query string of URL from the reverse proxy
    > val rewrittenURI = URI.create(uri.toString())
    
    > new URI(rewrittenURI.getScheme(),
          rewrittenURI.getAuthority(),
          rewrittenURI.getPath(),
          query,
          rewrittenURI.getFragment()).toString
    result: http://localhost:8081/test?order%255B0%255D%255Bcolumn%255D=0
    ```
    
    In Spark's stage page, the URL of "/taskTable" contains query parameter order[0][dir]. After encoding twice, the query parameter becomes `order%255B0%255D%255Bdir%255D` and it will be decoded as `order%5B0%5D%5Bdir%5D` instead of `order[0][dir]`. As a result, there will be NullPointerException from https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/api/v1/StagesResource.scala#L176
    Other than that, the other parameter may not work as expected after encoded twice.
    
    This PR is to fix the bug by calling the method `URI.create(String URL)` directly. This convenience method can avoid encoding twice on the query parameter.
    ```
    > val uri = "http://localhost:8081/test"
    > val query = "order%5B0%5D%5Bcolumn%5D=0"
    > URI.create(s"$uri?$query").toString
    result: http://localhost:8081/test?order%5B0%5D%5Bcolumn%5D=0
    
    > URI.create(s"$uri?$query").getQuery
    result: order[0][column]=0
    ```
    
    ### Why are the changes needed?
    
    Fix a potential bug when Spark's reverse proxy is enabled.
    The bug itself is similar to #29271.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    Add a new unit test.
    Also, Manual UI testing for master, worker and app UI with an nginx proxy
    
    Spark config:
    ```
    spark.ui.port 8080
    spark.ui.reverseProxy=true
    spark.ui.reverseProxyUrl=/path/to/spark/
    ```
    nginx config:
    ```
    server {
        listen 9000;
        set $SPARK_MASTER http://127.0.0.1:8080;
        # split spark UI path into prefix and local path within master UI
        location ~ ^(/path/to/spark/) {
            # strip prefix when forwarding request
            rewrite /path/to/spark(/.*) $1  break;
            #rewrite /path/to/spark/ "/" ;
            # forward to spark master UI
            proxy_pass $SPARK_MASTER;
            proxy_intercept_errors on;
            error_page 301 302 307 = handle_redirects;
        }
        location handle_redirects {
            set $saved_redirect_location '$upstream_http_location';
            proxy_pass $saved_redirect_location;
        }
    }
    ```
    
    Closes #30552 from gengliangwang/decodeProxyRedirect.
    
    Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
    Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
    gengliangwang committed Dec 1, 2020
    Configuration menu
    Copy the full SHA
    5d0045e View commit details
    Browse the repository at this point in the history
  16. [SPARK-33622][R][ML] Add array_to_vector to SparkR

    ### What changes were proposed in this pull request?
    
    This PR adds `array_to_vector` to R API.
    
    ### Why are the changes needed?
    
    Feature parity.
    
    ### Does this PR introduce _any_ user-facing change?
    
    New function exposed in the public API.
    
    ### How was this patch tested?
    
    New unit test.
    Manual verification of the documentation examples.
    
    Closes #30561 from zero323/SPARK-33622.
    
    Authored-by: zero323 <mszymkiewicz@gmail.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    zero323 authored and dongjoon-hyun committed Dec 1, 2020
    Configuration menu
    Copy the full SHA
    5a1c5ac View commit details
    Browse the repository at this point in the history

Commits on Dec 2, 2020

  1. [SPARK-33544][SQL] Optimize size of CreateArray/CreateMap to be the s…

    …ize of its children
    
    ### What changes were proposed in this pull request?
    
    https://issues.apache.org/jira/browse/SPARK-32295 added in an optimization to insert a filter for not null and size > 0 when using inner explode/inline. This is fine in most cases but the extra filter is not needed if the explode is with a create array and not using Literals (it already handles LIterals).  When this happens you know that the values aren't null and it has a size.  It already handles the empty array.
    
    The not null check is already optimized out because Createarray and createMap are not nullable, that leaves the size > 0 check. To handle that this PR makes it so that the size > 0 check gets optimized in ConstantFolding to be the size of the children in the array or map.  That makes it a literal and then makes it ultimately be optimized out.
    
    ### Why are the changes needed?
    remove unneeded filter
    
    ### Does this PR introduce _any_ user-facing change?
    no
    
    ### How was this patch tested?
    Unit tests added and manually tested various cases
    
    Closes #30504 from tgravescs/SPARK-33544.
    
    Lead-authored-by: Thomas Graves <tgraves@nvidia.com>
    Co-authored-by: Thomas Graves <tgraves@apache.org>
    Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    3 people committed Dec 2, 2020
    Configuration menu
    Copy the full SHA
    f71f345 View commit details
    Browse the repository at this point in the history
  2. [SPARK-32863][SS] Full outer stream-stream join

    ### What changes were proposed in this pull request?
    
    This PR is to add full outer stream-stream join, and the implementation of full outer join is:
    * For left side input row, check if there's a match on right side state store.
      * if there's a match, output the joined row, o.w. output nothing. Put the row in left side state store.
    * For right side input row, check if there's a match on left side state store.
      * if there's a match, output the joined row, o.w. output nothing. Put the row in right side state store.
    * State store eviction: evict rows from left/right side state store below watermark, and output rows never matched before (a combination of left outer and right outer join).
    
    ### Why are the changes needed?
    
    Enable more use cases for spark stream-stream join.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Added unit tests in `UnsupportedOperationChecker.scala` and `StreamingJoinSuite.scala`.
    
    Closes #30395 from c21/stream-foj.
    
    Authored-by: Cheng Su <chengsu@fb.com>
    Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
    c21 authored and HeartSaVioR committed Dec 2, 2020
    Configuration menu
    Copy the full SHA
    51ebcd9 View commit details
    Browse the repository at this point in the history
  3. [MINOR][SS] Rename auxiliary protected methods in StreamingJoinSuite

    ### What changes were proposed in this pull request?
    
    Per request from #30395 (comment), here we remove `Windowed` from methods names `setupWindowedJoinWithRangeCondition` and `setupWindowedSelfJoin` as they don't join on time window.
    
    ### Why are the changes needed?
    
    There's no such official name for `windowed join`, so this is to help avoid confusion for future developers.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Existing unit tests.
    
    Closes #30563 from c21/stream-minor.
    
    Authored-by: Cheng Su <chengsu@fb.com>
    Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
    c21 authored and HeartSaVioR committed Dec 2, 2020
    Configuration menu
    Copy the full SHA
    a4788ee View commit details
    Browse the repository at this point in the history
  4. [SPARK-33618][CORE] Use hadoop-client instead of hadoop-client-api to…

    … make hadoop-aws work
    
    ### What changes were proposed in this pull request?
    
    This reverts commit SPARK-33212 (cb3fa6c) mostly with three exceptions:
    1. `SparkSubmitUtils` was updated recently by SPARK-33580
    2. `resource-managers/yarn/pom.xml` was updated recently by SPARK-33104 to add `hadoop-yarn-server-resourcemanager` test dependency.
    3. Adjust `com.fasterxml.jackson.module:jackson-module-jaxb-annotations` dependency in K8s module which is updated recently by SPARK-33471.
    
    ### Why are the changes needed?
    
    According to [HADOOP-16080](https://issues.apache.org/jira/browse/HADOOP-16080) since Apache Hadoop 3.1.1, `hadoop-aws` doesn't work with `hadoop-client-api`. It fails at write operation like the following.
    
    **1. Spark distribution with `-Phadoop-cloud`**
    
    ```scala
    $ bin/spark-shell --conf spark.hadoop.fs.s3a.access.key=$AWS_ACCESS_KEY_ID --conf spark.hadoop.fs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY
    20/11/30 23:01:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
    Spark context available as 'sc' (master = local[*], app id = local-1606806088715).
    Spark session available as 'spark'.
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /___/ .__/\_,_/_/ /_/\_\   version 3.1.0-SNAPSHOT
          /_/
    
    Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_272)
    Type in expressions to have them evaluated.
    Type :help for more information.
    
    scala> spark.read.parquet("s3a://dongjoon/users.parquet").show
    20/11/30 23:01:34 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
    +------+--------------+----------------+
    |  name|favorite_color|favorite_numbers|
    +------+--------------+----------------+
    |Alyssa|          null|  [3, 9, 15, 20]|
    |   Ben|           red|              []|
    +------+--------------+----------------+
    
    scala> Seq(1).toDF.write.parquet("s3a://dongjoon/out.parquet")
    20/11/30 23:02:14 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)/ 1]
    java.lang.NoSuchMethodError: org.apache.hadoop.util.SemaphoredDelegatingExecutor.<init>(Lcom/google/common/util/concurrent/ListeningExecutorService;IZ)V
    ```
    
    **2. Spark distribution without `-Phadoop-cloud`**
    ```scala
    $ bin/spark-shell --conf spark.hadoop.fs.s3a.access.key=$AWS_ACCESS_KEY_ID --conf spark.hadoop.fs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY -c spark.eventLog.enabled=true -c spark.eventLog.dir=s3a://dongjoon/spark-events/ --packages org.apache.hadoop:hadoop-aws:3.2.0,org.apache.hadoop:hadoop-common:3.2.0
    ...
    java.lang.NoSuchMethodError: org.apache.hadoop.util.SemaphoredDelegatingExecutor.<init>(Lcom/google/common/util/concurrent/ListeningExecutorService;IZ)V
      at org.apache.hadoop.fs.s3a.S3AFileSystem.create(S3AFileSystem.java:772)
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Pass the CI.
    
    Closes #30508 from dongjoon-hyun/SPARK-33212-REVERT.
    
    Authored-by: Dongjoon Hyun <dongjoon@apache.org>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    dongjoon-hyun authored and HyukjinKwon committed Dec 2, 2020
    Configuration menu
    Copy the full SHA
    290aa02 View commit details
    Browse the repository at this point in the history
  5. [SPARK-33557][CORE][MESOS][TEST] Ensure the relationship between STOR…

    …AGE_BLOCKMANAGER_HEARTBEAT_TIMEOUT and NETWORK_TIMEOUT
    
    ### What changes were proposed in this pull request?
    As described in SPARK-33557, `HeartbeatReceiver` and `MesosCoarseGrainedSchedulerBackend` will always use `Network.NETWORK_TIMEOUT.defaultValueString` as value of `STORAGE_BLOCKMANAGER_HEARTBEAT_TIMEOUT` when we configure `NETWORK_TIMEOUT` without configure `STORAGE_BLOCKMANAGER_HEARTBEAT_TIMEOUT`, this is different from the relationship described in `configuration.md`.
    
    To fix this problem,the main change of this pr as follow:
    
    - Remove the explicitly default value of `STORAGE_BLOCKMANAGER_HEARTBEAT_TIMEOUT`
    
    - Use actual value of `NETWORK_TIMEOUT` as `STORAGE_BLOCKMANAGER_HEARTBEAT_TIMEOUT` when `STORAGE_BLOCKMANAGER_HEARTBEAT_TIMEOUT` not configured in `HeartbeatReceiver` and `MesosCoarseGrainedSchedulerBackend`
    
    ### Why are the changes needed?
    To ensure the relationship between `NETWORK_TIMEOUT` and  `STORAGE_BLOCKMANAGER_HEARTBEAT_TIMEOUT` as we described in `configuration.md`
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    
    - Pass the Jenkins or GitHub Action
    
    - Manual test configure `NETWORK_TIMEOUT` and `STORAGE_BLOCKMANAGER_HEARTBEAT_TIMEOUT` locally
    
    Closes #30547 from LuciferYang/SPARK-33557.
    
    Authored-by: yangjie01 <yangjie01@baidu.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    LuciferYang authored and HyukjinKwon committed Dec 2, 2020
    Configuration menu
    Copy the full SHA
    084d38b View commit details
    Browse the repository at this point in the history
  6. [SPARK-33504][CORE] The application log in the Spark history server c…

    …ontains sensitive attributes should be redacted
    
    ### What changes were proposed in this pull request?
    To make sure the sensitive attributes to be redacted in the history server log.
    
    ### Why are the changes needed?
    We found the secure attributes like password  in SparkListenerJobStart and SparkListenerStageSubmitted events would not been redated, resulting in sensitive attributes can be viewd directly.
    The screenshot can be viewed in the attachment of JIRA spark-33504
    ### Does this PR introduce _any_ user-facing change?
    no
    
    ### How was this patch tested?
    muntual test works well, I have also added unit testcase.
    
    Closes #30446 from akiyamaneko/eventlog_unredact.
    
    Authored-by: neko <echohlne@gmail.com>
    Signed-off-by: Thomas Graves <tgraves@apache.org>
    echohlne authored and tgravescs committed Dec 2, 2020
    Configuration menu
    Copy the full SHA
    28dad1b View commit details
    Browse the repository at this point in the history
  7. [SPARK-33544][SQL][FOLLOW-UP] Rename NoSideEffect to NoThrow and clar…

    …ify the documentation more
    
    ### What changes were proposed in this pull request?
    
    This PR is a followup of #30504. It proposes:
    
    - Rename `NoSideEffect` to `NoThrow`, and use `Expression.deterministic` together where it is used.
    - Clarify, in the docs in the expressions, that it means they don't throw exceptions
    
    ### Why are the changes needed?
    
    `NoSideEffect` virtually means that `Expression.eval` does not throw an exception, and the expressions are deterministic.
    It's best to be explicit so `NoThrow` was proposed -  I looked if there's a similar name to represent this concept and borrowed the name of [nothrow](https://clang.llvm.org/docs/AttributeReference.html#nothrow).
    For determinism, we already have a way to note it under `Expression.deterministic`.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    Manually ran the existing unittests written.
    
    Closes #30570 from HyukjinKwon/SPARK-33544.
    
    Authored-by: HyukjinKwon <gurwls223@apache.org>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    HyukjinKwon authored and cloud-fan committed Dec 2, 2020
    Configuration menu
    Copy the full SHA
    df8d3f1 View commit details
    Browse the repository at this point in the history
  8. [SPARK-33619][SQL] Fix GetMapValueUtil code generation error

    ### What changes were proposed in this pull request?
    
    Code Gen bug fix that introduced by SPARK-33460
    
    ```
    GetMapValueUtil
    
    s"""throw new NoSuchElementException("Key " + $eval2 + " does not exist.");"""
    
    SHOULD BE
    
    s"""throw new java.util.NoSuchElementException("Key " + $eval2 + " does not exist.");"""
    ```
    
    And the reason why SPARK-33460 failed to detect this bug via UT,  it was because that `checkExceptionInExpression ` did not work as expect like `checkEvaluation` which will try eval expression with BOTH `CODEGEN_ONLY` and `NO_CODEGEN` mode, and in this PR, will also fix this Test bug, too.
    
    ### Why are the changes needed?
    
    Bug Fix.
    
    ### Does this PR introduce any user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Add UT and Existing UT.
    
    Closes #30560 from leanken/leanken-SPARK-33619.
    
    Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    leanken-zz authored and cloud-fan committed Dec 2, 2020
    Configuration menu
    Copy the full SHA
    58583f7 View commit details
    Browse the repository at this point in the history
  9. [SPARK-33626][K8S][TEST] Allow k8s integration tests to assert both d…

    …river and executor logs for expected log(s)
    
    ### What changes were proposed in this pull request?
    
    Allow k8s integration tests to assert both driver and executor logs for expected log(s)
    
    ### Why are the changes needed?
    
    Some of the tests will be able to provide full coverage of the use case, by asserting both driver and executor logs.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    TBD
    
    Closes #30568 from ScrapCodes/expectedDriverLogChanges.
    
    Authored-by: Prashant Sharma <prashsh1@in.ibm.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    ScrapCodes authored and dongjoon-hyun committed Dec 2, 2020
    Configuration menu
    Copy the full SHA
    91182d6 View commit details
    Browse the repository at this point in the history
  10. [SPARK-33071][SPARK-33536][SQL] Avoid changing dataset_id of LogicalP…

    …lan in join() to not break DetectAmbiguousSelfJoin
    
    ### What changes were proposed in this pull request?
    
    Currently, `join()` uses `withPlan(logicalPlan)` for convenient to call some Dataset functions. But it leads to the `dataset_id` inconsistent between the `logicalPlan` and the original `Dataset`(because `withPlan(logicalPlan)` will create a new Dataset with the new id and reset the `dataset_id` with the new id of the `logicalPlan`). As a result, it breaks the rule `DetectAmbiguousSelfJoin`.
    
    In this PR, we propose to drop the usage of `withPlan` but use the `logicalPlan` directly so its `dataset_id` doesn't change.
    
    Besides, this PR also removes related metadata (`DATASET_ID_KEY`,  `COL_POS_KEY`) when an `Alias` tries to construct its own metadata. Because the `Alias` is no longer a reference column after converting to an `Attribute`.  To achieve that, we add a new field, `deniedMetadataKeys`, to indicate the metadata that needs to be removed.
    
    ### Why are the changes needed?
    
    For the query below, it returns the wrong result while it should throws ambiguous self join exception instead:
    
    ```scala
    val emp1 = Seq[TestData](
      TestData(1, "sales"),
      TestData(2, "personnel"),
      TestData(3, "develop"),
      TestData(4, "IT")).toDS()
    val emp2 = Seq[TestData](
      TestData(1, "sales"),
      TestData(2, "personnel"),
      TestData(3, "develop")).toDS()
    val emp3 = emp1.join(emp2, emp1("key") === emp2("key")).select(emp1("*"))
    emp1.join(emp3, emp1.col("key") === emp3.col("key"), "left_outer")
      .select(emp1.col("*"), emp3.col("key").as("e2")).show()
    
    // wrong result
    +---+---------+---+
    |key|    value| e2|
    +---+---------+---+
    |  1|    sales|  1|
    |  2|personnel|  2|
    |  3|  develop|  3|
    |  4|       IT|  4|
    +---+---------+---+
    ```
    This PR fixes the wrong behaviour.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, users hit the exception instead of the wrong result after this PR.
    
    ### How was this patch tested?
    
    Added a new unit test.
    
    Closes #30488 from Ngone51/fix-self-join.
    
    Authored-by: yi.wu <yi.wu@databricks.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    Ngone51 authored and cloud-fan committed Dec 2, 2020
    Configuration menu
    Copy the full SHA
    a082f46 View commit details
    Browse the repository at this point in the history
  11. [SPARK-33627][SQL] Add new function UNIX_SECONDS, UNIX_MILLIS and UNI…

    …X_MICROS
    
    ### What changes were proposed in this pull request?
    
    As #28534 adds functions from [BigQuery](https://cloud.google.com/bigquery/docs/reference/standard-sql/timestamp_functions) for converting numbers to timestamp, this PR is to add functions UNIX_SECONDS, UNIX_MILLIS and UNIX_MICROS for converting timestamp to numbers.
    
    ### Why are the changes needed?
    
    1. Symmetry of the conversion functions
    2. Casting timestamp type to numeric types is disallowed in ANSI mode, we should provide functions for users to complete the conversion.
    
    ### Does this PR introduce _any_ user-facing change?
    
    3 new functions UNIX_SECONDS, UNIX_MILLIS and UNIX_MICROS for converting timestamp to long type.
    
    ### How was this patch tested?
    
    Unit tests.
    
    Closes #30566 from gengliangwang/timestampLong.
    
    Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    gengliangwang authored and dongjoon-hyun committed Dec 2, 2020
    Configuration menu
    Copy the full SHA
    b76c6b7 View commit details
    Browse the repository at this point in the history
  12. [SPARK-33631][DOCS][TEST] Clean up spark.core.connection.ack.wait.tim…

    …eout from configuration.md
    
    ### What changes were proposed in this pull request?
    SPARK-9767  remove `ConnectionManager` and related files, the configuration `spark.core.connection.ack.wait.timeout` previously used by `ConnectionManager` is no longer used by other Spark code, but it still exists in the `configuration.md`.
    
    So this pr cleans up the useless configuration item spark.core.connection.ack.wait.timeout` from `configuration.md`.
    
    ### Why are the changes needed?
    Clean up useless configuration from `configuration.md`.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Pass the Jenkins or GitHub Action
    
    Closes #30569 from LuciferYang/SPARK-33631.
    
    Authored-by: yangjie01 <yangjie01@baidu.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    LuciferYang authored and dongjoon-hyun committed Dec 2, 2020
    Configuration menu
    Copy the full SHA
    92bfbcb View commit details
    Browse the repository at this point in the history

Commits on Dec 3, 2020

  1. [MINOR][INFRA] Use the latest image for GitHub Action jobs

    ### What changes were proposed in this pull request?
    
    Currently, GitHub Action is using two docker images.
    
    ```
    $ git grep dongjoon/apache-spark-github-action-image
    .github/workflows/build_and_test.yml:      image: dongjoon/apache-spark-github-action-image:20201015
    .github/workflows/build_and_test.yml:      image: dongjoon/apache-spark-github-action-image:20201025
    ```
    
    This PR aims to make it consistent by using the latest one.
    ```
    - image: dongjoon/apache-spark-github-action-image:20201015
    + image: dongjoon/apache-spark-github-action-image:20201025
    ```
    
    ### Why are the changes needed?
    
    This is for better maintainability. The image size is almost the same.
    ```
    $ docker images | grep 202010
    dongjoon/apache-spark-github-action-image                       20201025               37adfa3d226a   5 weeks ago     2.18GB
    dongjoon/apache-spark-github-action-image                       20201015               ff6fee8dc36d   6 weeks ago     2.16GB
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Pass the GitHub Action.
    
    Closes #30578 from dongjoon-hyun/SPARK-MINOR.
    
    Authored-by: Dongjoon Hyun <dongjoon@apache.org>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    dongjoon-hyun authored and HyukjinKwon committed Dec 3, 2020
    Configuration menu
    Copy the full SHA
    f94cb53 View commit details
    Browse the repository at this point in the history
  2. [SPARK-31953][SS] Add Spark Structured Streaming History Server Support

    ### What changes were proposed in this pull request?
    
    Add Spark Structured Streaming History Server Support.
    
    ### Why are the changes needed?
    
    Add a streaming query history server plugin.
    
    ![image](https://user-images.githubusercontent.com/7402327/84248291-d26cfe80-ab3b-11ea-86d2-98205fa2bcc4.png)
    ![image](https://user-images.githubusercontent.com/7402327/84248347-e44ea180-ab3b-11ea-81de-eefe207656f2.png)
    ![image](https://user-images.githubusercontent.com/7402327/84248396-f0d2fa00-ab3b-11ea-9b0d-e410115471b0.png)
    
    - Follow-ups
      - Query duration should not update in history UI.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Update UT.
    
    Closes #28781 from uncleGen/SPARK-31953.
    
    Lead-authored-by: uncleGen <hustyugm@gmail.com>
    Co-authored-by: Genmao Yu <hustyugm@gmail.com>
    Co-authored-by: Yuanjian Li <yuanjian.li@databricks.com>
    Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
    2 people authored and zsxwing committed Dec 3, 2020
    Configuration menu
    Copy the full SHA
    4f96670 View commit details
    Browse the repository at this point in the history
  3. [SPARK-33610][ML] Imputer transform skip duplicate head() job

    ### What changes were proposed in this pull request?
    on each call of `transform`, a head() job will be triggered, which can be skipped by using a lazy var.
    
    ### Why are the changes needed?
    avoiding duplicate head() jobs
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    existing tests
    
    Closes #30550 from zhengruifeng/imputer_transform.
    
    Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
    Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
    zhengruifeng committed Dec 3, 2020
    Configuration menu
    Copy the full SHA
    90d4d7d View commit details
    Browse the repository at this point in the history
  4. [SPARK-32896][SS][FOLLOW-UP] Rename the API to toTable

    ### What changes were proposed in this pull request?
    As the discussion in #30521 (comment), rename the API to `toTable`.
    
    ### Why are the changes needed?
    Rename the API for further extension and accuracy.
    
    ### Does this PR introduce _any_ user-facing change?
    Yes, it's an API change but the new API is not released yet.
    
    ### How was this patch tested?
    Existing UT.
    
    Closes #30571 from xuanyuanking/SPARK-32896-follow.
    
    Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
    Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
    xuanyuanking authored and zsxwing committed Dec 3, 2020
    Configuration menu
    Copy the full SHA
    878cc0e View commit details
    Browse the repository at this point in the history
  5. [SPARK-22798][PYTHON][ML][FOLLOWUP] Add labelsArray to PySpark String…

    …Indexer
    
    ### What changes were proposed in this pull request?
    
    This is a followup to add missing `labelsArray` to PySpark `StringIndexer`.
    
    ### Why are the changes needed?
    
    `labelsArray` is for multi-column case for `StringIndexer`. We should provide this accessor at PySpark side too.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, `labelsArray` was missing in PySpark `StringIndexer` in Spark 3.0.
    
    ### How was this patch tested?
    
    Unit test.
    
    Closes #30579 from viirya/SPARK-22798-followup.
    
    Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    viirya authored and HyukjinKwon committed Dec 3, 2020
    Configuration menu
    Copy the full SHA
    0880989 View commit details
    Browse the repository at this point in the history
  6. [SPARK-33636][PYTHON][ML][FOLLOWUP] Update since tag of labelsArray i…

    …n StringIndexer
    
    ### What changes were proposed in this pull request?
    
    This is to update `labelsArray`'s since tag.
    
    ### Why are the changes needed?
    
    The original change was backported to branch-3.0 for 3.0.2 version. So it is better to update the since tag to reflect the fact.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    N/A. Just tag change.
    
    Closes #30582 from viirya/SPARK-33636-followup.
    
    Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    viirya authored and HyukjinKwon committed Dec 3, 2020
    Configuration menu
    Copy the full SHA
    3b2ff16 View commit details
    Browse the repository at this point in the history
  7. [SPARK-20044][SQL] Add new function DATE_FROM_UNIX_DATE and UNIX_DATE

    ### What changes were proposed in this pull request?
    
    Add new functions DATE_FROM_UNIX_DATE and UNIX_DATE for conversion between Date type and Numeric types.
    
    ### Why are the changes needed?
    
    1. Explicit conversion between Date type and Numeric types is disallowed in ANSI mode. We need to provide new functions for users to complete the conversion.
    
    2. We have introduced new functions from Bigquery for conversion between Timestamp type and Numeric types: TIMESTAMP_SECONDS, TIMESTAMP_MILLIS, TIMESTAMP_MICROS , UNIX_SECONDS, UNIX_MILLIS, and UNIX_MICROS. It makes sense to add functions for conversion between Date type and Numeric types as well.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, two new datetime functions are added.
    
    ### How was this patch tested?
    
    Unit tests
    
    Closes #30588 from gengliangwang/dateToNumber.
    
    Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    gengliangwang authored and cloud-fan committed Dec 3, 2020
    Configuration menu
    Copy the full SHA
    ff13f57 View commit details
    Browse the repository at this point in the history
  8. [SPARK-26218][SQL][FOLLOW UP] Fix the corner case of codegen when cas…

    …ting float to Integer
    
    ### What changes were proposed in this pull request?
    This is a followup of [#27151](#27151). It fixes the same issue for the codegen path.
    
    ### Why are the changes needed?
    Result corrupt.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Added Unit test.
    
    Closes #30585 from luluorta/SPARK-26218.
    
    Authored-by: luluorta <luluorta@gmail.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    luluorta authored and cloud-fan committed Dec 3, 2020
    Configuration menu
    Copy the full SHA
    512fb32 View commit details
    Browse the repository at this point in the history
  9. [SPARK-30098][SQL] Add a configuration to use default datasource as p…

    …rovider for CREATE TABLE command
    
    ### What changes were proposed in this pull request?
    
    For CRETE TABLE [AS SELECT] command, creates native Parquet table if neither USING nor STORE AS is specified and `spark.sql.legacy.createHiveTableByDefault` is false.
    
    This is a retry after we unify the CREATE TABLE syntax. It partially reverts d2bec5e
    
    This PR allows `CREATE EXTERNAL TABLE` when `LOCATION` is present. This was not allowed for data source tables before, which is an unnecessary behavior different with hive tables.
    
    ### Why are the changes needed?
    
    Changing from Hive text table to native Parquet table has many benefits:
    1. be consistent with `DataFrameWriter.saveAsTable`.
    2. better performance
    3. better support for nested types (Hive text table doesn't work well with nested types, e.g. `insert into t values struct(null)` actually inserts a null value not `struct(null)` if `t` is a Hive text table, which leads to wrong result)
    4. better interoperability as Parquet is a more popular open file format.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No by default. If the config is set, the behavior change is described below:
    
    Behavior-wise, the change is very small as the native Parquet table is also Hive-compatible. All the Spark DDL commands that works for hive tables also works for native Parquet tables, with two exceptions: `ALTER TABLE SET [SERDE | SERDEPROPERTIES]` and `LOAD DATA`.
    
    char/varchar behavior has been taken care by #30412, and there is no behavior difference between data source and hive tables.
    
    One potential issue is `CREATE TABLE ... LOCATION ...` while users want to directly access the files later. It's more like a corner case and the legacy config should be good enough.
    
    Another potential issue is users may use Spark to create the table and then use Hive to add partitions with different serde. This is not allowed for Spark native tables.
    
    ### How was this patch tested?
    
    Re-enable the tests
    
    Closes #30554 from cloud-fan/create-table.
    
    Authored-by: Wenchen Fan <wenchen@databricks.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    cloud-fan committed Dec 3, 2020
    Configuration menu
    Copy the full SHA
    0706e64 View commit details
    Browse the repository at this point in the history
  10. [SPARK-33629][PYTHON] Make spark.buffer.size configuration visible on…

    … driver side
    
    ### What changes were proposed in this pull request?
    `spark.buffer.size` not applied in driver from pyspark. In this PR I've fixed this issue.
    
    ### Why are the changes needed?
    Apply the mentioned config on driver side.
    
    ### Does this PR introduce _any_ user-facing change?
    No.
    
    ### How was this patch tested?
    Existing unit tests + manually.
    
    Added the following code temporarily:
    ```
    def local_connect_and_auth(port, auth_secret):
    ...
                sock.connect(sa)
                print("SPARK_BUFFER_SIZE: %d" % int(os.environ.get("SPARK_BUFFER_SIZE", 65536))) <- This is the addition
                sockfile = sock.makefile("rwb", int(os.environ.get("SPARK_BUFFER_SIZE", 65536)))
    ...
    ```
    
    Test:
    ```
    #Compile Spark
    
    echo "spark.buffer.size 10000" >> conf/spark-defaults.conf
    
    $ ./bin/pyspark
    Python 3.8.5 (default, Jul 21 2020, 10:48:26)
    [Clang 11.0.3 (clang-1103.0.32.62)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    20/12/03 13:38:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
    20/12/03 13:38:14 WARN SparkEnv: I/O encryption enabled without RPC encryption: keys will be visible on the wire.
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /__ / .__/\_,_/_/ /_/\_\   version 3.1.0-SNAPSHOT
          /_/
    
    Using Python version 3.8.5 (default, Jul 21 2020 10:48:26)
    Spark context Web UI available at http://192.168.0.189:4040
    Spark context available as 'sc' (master = local[*], app id = local-1606999094506).
    SparkSession available as 'spark'.
    >>> sc.setLogLevel("TRACE")
    >>> sc.parallelize([0, 2, 3, 4, 6], 5).glom().collect()
    ...
    SPARK_BUFFER_SIZE: 10000
    ...
    [[0], [2], [3], [4], [6]]
    >>>
    ```
    
    Closes #30592 from gaborgsomogyi/SPARK-33629.
    
    Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    gaborgsomogyi authored and HyukjinKwon committed Dec 3, 2020
    Configuration menu
    Copy the full SHA
    bd71186 View commit details
    Browse the repository at this point in the history
  11. [SPARK-33623][SQL] Add canDeleteWhere to SupportsDelete

    ### What changes were proposed in this pull request?
    
    This PR provides us with a way to check if a data source is going to reject the delete via `deleteWhere` at planning time.
    
    ### Why are the changes needed?
    
    The only way to support delete statements right now is to implement ``SupportsDelete``. According to its Javadoc, that interface is meant for cases when we can delete data without much effort (e.g. like deleting a complete partition in a Hive table).
    
    This PR actually provides us with a way to check if a data source is going to reject the delete via `deleteWhere` at planning time instead of just getting an exception during execution. In the future, we can use this functionality to decide whether Spark should rewrite this delete and execute a distributed query or it can just pass a set of filters.
    
    Consider an example of a partitioned Hive table. If we have a delete predicate like `part_col = '2020'`, we can just drop the matching partition to satisfy this delete. In this case, the data source should return `true` from `canDeleteWhere` and use the filters it accepts in `deleteWhere` to drop the partition. I consider this as a delete without significant effort. At the same time, if we have a delete predicate like `id = 10`, Hive tables would not be able to execute this delete using a metadata only operation without rewriting files. In that case, the data source should return `false` from `canDeleteWhere` and we should use a more sophisticated row-level API to find out which records should be removed (the API is yet to be discussed, but we need this PR as a basis).
    
    If we decide to support subqueries and all delete use cases by simply extending the existing API, this will mean all data sources will have to implement a lot of Spark logic to determine which records changed. I don't think we want to go that way as the Spark logic to determine which records should be deleted is independent of the underlying data source. So the assumption is that Spark will execute a plan to find which records must be deleted for data sources that return `false` from `canDeleteWhere`.
    ### Does this PR introduce _any_ user-facing change?
    
    Yes but it is backward compatible.
    
    ### How was this patch tested?
    
    This PR comes with a new test.
    
    Closes #30562 from aokolnychyi/spark-33623.
    
    Authored-by: Anton Okolnychyi <aokolnychyi@apple.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    aokolnychyi authored and dongjoon-hyun committed Dec 3, 2020
    Configuration menu
    Copy the full SHA
    aa13e20 View commit details
    Browse the repository at this point in the history
  12. [SPARK-33634][SQL][TESTS] Use Analyzer in PlanResolutionSuite

    ### What changes were proposed in this pull request?
    
    Instead of using several analyzer rules, this PR uses the actual analyzer to run tests in `PlanResolutionSuite`.
    
    ### Why are the changes needed?
    
    Make the test suite to match reality.
    
    ### Does this PR introduce _any_ user-facing change?
    
    no
    
    ### How was this patch tested?
    
    test-only
    
    Closes #30574 from cloud-fan/test.
    
    Authored-by: Wenchen Fan <wenchen@databricks.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    cloud-fan authored and dongjoon-hyun committed Dec 3, 2020
    Configuration menu
    Copy the full SHA
    63f9d47 View commit details
    Browse the repository at this point in the history

Commits on Dec 4, 2020

  1. [SPARK-33520][ML][PYSPARK] make CrossValidator/TrainValidateSplit/One…

    …VsRest Reader/Writer support Python backend estimator/evaluator
    
    ### What changes were proposed in this pull request?
    make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/model
    
    ### Why are the changes needed?
    Currently, pyspark support third-party library to define python backend estimator/evaluator, i.e., estimator that inherit `Estimator` instead of `JavaEstimator`, and only can be used in pyspark.
    
    CrossValidator and TrainValidateSplit support tuning these python backend estimator,
    but cannot support saving/load, becase CrossValidator and TrainValidateSplit writer implementation is use JavaMLWriter, which require to convert nested estimator and evaluator into java instance.
    
    OneVsRest saving/load now only support java backend classifier due to similar issue.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Unit test.
    
    Closes #30471 from WeichenXu123/support_pyio_tuning.
    
    Authored-by: Weichen Xu <weichen.xu@databricks.com>
    Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
    WeichenXu123 committed Dec 4, 2020
    Configuration menu
    Copy the full SHA
    7e759b2 View commit details
    Browse the repository at this point in the history
  2. [SPARK-33650][SQL] Fix the error from ALTER TABLE .. ADD/DROP PARTITI…

    …ON for non-supported partition management table
    
    ### What changes were proposed in this pull request?
    In the PR, I propose to change the order of post-analysis checks for the `ALTER TABLE .. ADD/DROP PARTITION` command, and perform the general check (does the table support partition management at all) before specific checks.
    
    ### Why are the changes needed?
    The error message for the table which doesn't support partition management can mislead users:
    ```java
    PartitionSpecs are not resolved;;
    'AlterTableAddPartition [UnresolvedPartitionSpec(Map(id -> 1),None)], false
    +- ResolvedTable org.apache.spark.sql.connector.InMemoryTableCatalog2fd64b11, ns1.ns2.tbl, org.apache.spark.sql.connector.InMemoryTable5d3ff859
    ```
    because it says nothing about the root cause of the issue.
    
    ### Does this PR introduce _any_ user-facing change?
    Yes. After the change, the error message will be:
    ```
    Table ns1.ns2.tbl can not alter partitions
    ```
    
    ### How was this patch tested?
    By running the affected test suite `AlterTablePartitionV2SQLSuite`.
    
    Closes #30594 from MaxGekk/check-order-AlterTablePartition.
    
    Authored-by: Max Gekk <max.gekk@gmail.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    MaxGekk authored and dongjoon-hyun committed Dec 4, 2020
    Configuration menu
    Copy the full SHA
    8594958 View commit details
    Browse the repository at this point in the history
  3. [SPARK-33649][SQL][DOC] Improve the doc of spark.sql.ansi.enabled

    ### What changes were proposed in this pull request?
    
    Improve the documentation of SQL configuration `spark.sql.ansi.enabled`
    
    ### Why are the changes needed?
    
    As there are more and more new features under the SQL configuration `spark.sql.ansi.enabled`, we should make it more clear about:
    1. what exactly it is
    2. where can users find all the features of the ANSI mode
    3. whether all the features are exactly from the SQL standard
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    It's just doc change.
    
    Closes #30593 from gengliangwang/reviseAnsiDoc.
    
    Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
    Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
    gengliangwang committed Dec 4, 2020
    Configuration menu
    Copy the full SHA
    29e415d View commit details
    Browse the repository at this point in the history
  4. [SPARK-32405][SQL][FOLLOWUP] Remove USING _ in CREATE TABLE in JDBCTa…

    …bleCatalog docker tests
    
    ### What changes were proposed in this pull request?
    remove USING _ in CREATE TABLE in JDBCTableCatalog docker tests
    
    ### Why are the changes needed?
    Previously CREATE TABLE syntax forces users to specify a provider so we have to add a USING _ . Now the problem was fix and we need to remove it.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Existing tests
    
    Closes #30599 from huaxingao/remove_USING.
    
    Authored-by: Huaxin Gao <huaxing@us.ibm.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    huaxingao authored and cloud-fan committed Dec 4, 2020
    Configuration menu
    Copy the full SHA
    e22ddb6 View commit details
    Browse the repository at this point in the history
  5. [SPARK-33142][SPARK-33647][SQL] Store SQL text for SQL temp view

    ### What changes were proposed in this pull request?
    Currently, in spark, the temp view is saved as its analyzed logical plan, while the permanent view
    is kept in HMS with its origin SQL text. As a result, permanent and temporary views have
    different behaviors in some cases. In this PR we store the SQL text for temporary view in order
    to unify the behavior between permanent and temporary views.
    
    ### Why are the changes needed?
    to unify the behavior between permanent and temporary views
    
    ### Does this PR introduce _any_ user-facing change?
    Yes, with this PR, the temporary view will be re-analyzed when it's referred. So if the
    underlying datasource changed, the view will also be updated.
    
    ### How was this patch tested?
    existing and newly added test cases
    
    Closes #30567 from linhongliu-db/SPARK-33142.
    
    Authored-by: Linhong Liu <linhong.liu@databricks.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    linhongliu-db authored and cloud-fan committed Dec 4, 2020
    Configuration menu
    Copy the full SHA
    e02324f View commit details
    Browse the repository at this point in the history
  6. [SPARK-33430][SQL] Support namespaces in JDBC v2 Table Catalog

    ### What changes were proposed in this pull request?
    Add namespaces support in JDBC v2 Table Catalog by making ```JDBCTableCatalog``` extends```SupportsNamespaces```
    
    ### Why are the changes needed?
    make v2 JDBC implementation complete
    
    ### Does this PR introduce _any_ user-facing change?
    Yes. Add the following to  ```JDBCTableCatalog```
    
    - listNamespaces
    - listNamespaces(String[] namespace)
    - namespaceExists(String[] namespace)
    - loadNamespaceMetadata(String[] namespace)
    - createNamespace
    - alterNamespace
    - dropNamespace
    
    ### How was this patch tested?
    Add new docker tests
    
    Closes #30473 from huaxingao/name_space.
    
    Authored-by: Huaxin Gao <huaxing@us.ibm.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    huaxingao authored and cloud-fan committed Dec 4, 2020
    Configuration menu
    Copy the full SHA
    15579ba View commit details
    Browse the repository at this point in the history
  7. [SPARK-33658][SQL] Suggest using Datetime conversion functions for in…

    …valid ANSI casting
    
    ### What changes were proposed in this pull request?
    
    Suggest users using Datetime conversion functions in the error message of invalid ANSI explicit casting.
    
    ### Why are the changes needed?
    
    In ANSI mode, explicit cast between DateTime types and Numeric types is not allowed.
    As of now, we have introduced new functions `UNIX_SECONDS`/`UNIX_MILLIS`/`UNIX_MICROS`/`UNIX_DATE`/`DATE_FROM_UNIX_DATE`, we can show suggestions to users so that they can complete these type conversions precisely and easily in ANSI mode.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, better error messages
    
    ### How was this patch tested?
    
    Unit test
    
    Closes #30603 from gengliangwang/improveErrorMsgOfExplicitCast.
    
    Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    gengliangwang authored and HyukjinKwon committed Dec 4, 2020
    Configuration menu
    Copy the full SHA
    e838066 View commit details
    Browse the repository at this point in the history
  8. [SPARK-33571][SQL][DOCS] Add a ref to INT96 config from the doc for `…

    …spark.sql.legacy.parquet.datetimeRebaseModeInWrite/Read`
    
    ### What changes were proposed in this pull request?
    For the SQL configs `spark.sql.legacy.parquet.datetimeRebaseModeInWrite` and `spark.sql.legacy.parquet.datetimeRebaseModeInRead`, improve their descriptions by:
    1. Explicitly document on which parquet types, those configs influence on
    2. Refer to corresponding configs for `INT96`
    
    ### Why are the changes needed?
    To avoid user confusions like reposted in SPARK-33571, and make the config description more precise.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    By running `./dev/scalastyle`.
    
    Closes #30596 from MaxGekk/clarify-rebase-docs.
    
    Authored-by: Max Gekk <max.gekk@gmail.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    MaxGekk authored and HyukjinKwon committed Dec 4, 2020
    Configuration menu
    Copy the full SHA
    94c144b View commit details
    Browse the repository at this point in the history
  9. [SPARK-33577][SS] Add support for V1Table in stream writer table API …

    …and create table if not exist by default
    
    ### What changes were proposed in this pull request?
    After SPARK-32896, we have table API for stream writer but only support DataSource v2 tables. Here we add the following enhancements:
    
    - Create non-existing tables by default
    - Support both managed and external V1Tables
    
    ### Why are the changes needed?
    Make the API covers more use cases. Especially for the file provider based tables.
    
    ### Does this PR introduce _any_ user-facing change?
    Yes, new features added.
    
    ### How was this patch tested?
    Add new UTs.
    
    Closes #30521 from xuanyuanking/SPARK-33577.
    
    Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
    Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
    xuanyuanking authored and HeartSaVioR committed Dec 4, 2020
    Configuration menu
    Copy the full SHA
    325abf7 View commit details
    Browse the repository at this point in the history
  10. [SPARK-33656][TESTS] Add option to keep container after tests finish …

    …for DockerJDBCIntegrationSuites for debug
    
    ### What changes were proposed in this pull request?
    
    This PR add an option to keep container after DockerJDBCIntegrationSuites (e.g. DB2IntegrationSuite, PostgresIntegrationSuite) finish.
    By setting a system property `spark.test.docker.keepContainer` to `true`, we can use this option.
    
    ### Why are the changes needed?
    
    If some error occur during the tests, it would be useful to keep the container for debug.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    I confirmed that the container is kept after the test by the following commands.
    ```
    # With sbt
    $ build/sbt -Dspark.test.docker.keepContainer=true -Pdocker-integration-tests -Phive -Phive-thriftserver package "testOnly org.apache.spark.sql.jdbc.MariaDBKrbIntegrationSuite"
    
    # With Maven
    $ build/mvn -Dspark.test.docker.keepContainer=true -Pdocker-integration-tests -Phive -Phive-thriftserver -Dtest=none -DwildcardSuites=org.apache.spark.sql.jdbc.MariaDBKrbIntegrationSuite test
    
    $ docker container ls
    ```
    
    I also confirmed that there are no regression for all the subclasses of `DockerJDBCIntegrationSuite` with sbt/Maven.
    * MariaDBKrbIntegrationSuite
    * DB2KrbIntegrationSuite
    * PostgresKrbIntegrationSuite
    * MySQLIntegrationSuite
    * PostgresIntegrationSuite
    * DB2IntegrationSuite
    * MsSqlServerintegrationsuite
    * OracleIntegrationSuite
    * v2.MySQLIntegrationSuite
    * v2.PostgresIntegrationSuite
    * v2.DB2IntegrationSuite
    * v2.MsSqlServerIntegrationSuite
    * v2.OracleIntegrationSuite
    
    NOTE: `DB2IntegrationSuite`, `v2.DB2IntegrationSuite` and `DB2KrbIntegrationSuite` can fail due to the too much short connection timeout. It's a separate issue and I'll fix it in #30583
    
    Closes #30601 from sarutak/keepContainer.
    
    Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    sarutak authored and dongjoon-hyun committed Dec 4, 2020
    Configuration menu
    Copy the full SHA
    91baab7 View commit details
    Browse the repository at this point in the history
  11. [SPARK-33640][TESTS] Extend connection timeout to DB server for DB2In…

    …tegrationSuite and its variants
    
    ### What changes were proposed in this pull request?
    
    This PR extends the connection timeout to the DB server for DB2IntegrationSuite and its variants.
    
    The container image ibmcom/db2 creates a database when it starts up.
    The database creation can take over 2 minutes.
    
    DB2IntegrationSuite and its variants use the container image but the connection timeout is set to 2 minutes so these suites almost always fail.
    ### Why are the changes needed?
    
    To pass those suites.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    I confirmed the suites pass with the following commands.
    ```
    $ build/sbt -Pdocker-integration-tests -Phive -Phive-thriftserver package "testOnly org.apache.spark.sql.jdbc.DB2IntegrationSuite"
    $ build/sbt -Pdocker-integration-tests -Phive -Phive-thriftserver package "testOnly org.apache.spark.sql.jdbc.v2.DB2IntegrationSuite"
    $ build/sbt -Pdocker-integration-tests -Phive -Phive-thriftserver package "testOnly org.apache.spark.sql.jdbc.DB2KrbIntegrationSuite"
    
    Closes #30583 from sarutak/extend-timeout-for-db2.
    
    Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    sarutak authored and dongjoon-hyun committed Dec 4, 2020
    Configuration menu
    Copy the full SHA
    976e897 View commit details
    Browse the repository at this point in the history
  12. [SPARK-27237][SS] Introduce State schema validation among query restart

    ## What changes were proposed in this pull request?
    
    Please refer the description of [SPARK-27237](https://issues.apache.org/jira/browse/SPARK-27237) to see rationalization of this patch.
    
    This patch proposes to introduce state schema validation, via storing key schema and value schema to `schema` file (for the first time) and verify new key schema and value schema for state are compatible with existing one. To be clear for definition of "compatible", state schema is "compatible" when number of fields are same and data type for each field is same - Spark has been allowing rename of field.
    
    This patch will prevent query run which has incompatible state schema, which would reduce the chance to get indeterministic behavior (actually renaming of field is also the smell of semantically incompatible, but end users could just modify its name so we can't say) as well as providing more informative error message.
    
    ## How was this patch tested?
    
    Added UTs.
    
    Closes #24173 from HeartSaVioR/SPARK-27237.
    
    Lead-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
    Co-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    2 people authored and HyukjinKwon committed Dec 4, 2020
    Configuration menu
    Copy the full SHA
    233a849 View commit details
    Browse the repository at this point in the history
  13. [SPARK-33615][K8S] Make 'spark.archives' working in Kubernates

    ### What changes were proposed in this pull request?
    
    This PR proposes to make `spark.archives` configuration working in Kubernates.
    It works without a problem in standalone cluster but there seems a bug in Kubernates.
    It fails to fetch the file on the driver side as below:
    
    ```
    20/12/03 13:33:53 INFO SparkContext: Added JAR file:/tmp/spark-75004286-c83a-4369-b624-14c5d2d2a748/spark-examples_2.12-3.1.0-SNAPSHOT.jar at spark://spark-test-app-48ae737628cee6f8-driver-svc.spark-integration-test.svc:7078/jars/spark-examples_2.12-3.1.0-SNAPSHOT.jar with timestamp 1607002432558
    20/12/03 13:33:53 INFO SparkContext: Added archive file:///tmp/tmp4542734800151332666.txt.tar.gz#test_tar_gz at spark://spark-test-app-48ae737628cee6f8-driver-svc.spark-integration-test.svc:7078/files/tmp4542734800151332666.txt.tar.gz with timestamp 1607002432558
    20/12/03 13:33:53 INFO TransportClientFactory: Successfully created connection to spark-test-app-48ae737628cee6f8-driver-svc.spark-integration-test.svc/172.17.0.4:7078 after 83 ms (47 ms spent in bootstraps)
    20/12/03 13:33:53 INFO Utils: Fetching spark://spark-test-app-48ae737628cee6f8-driver-svc.spark-integration-test.svc:7078/files/tmp4542734800151332666.txt.tar.gz to /tmp/spark-66573e24-27a3-427c-99f4-36f06d9e9cd5/fetchFileTemp2665785666227461849.tmp
    20/12/03 13:33:53 ERROR SparkContext: Error initializing SparkContext.
    java.lang.RuntimeException: Stream '/files/tmp4542734800151332666.txt.tar.gz' was not found.
    	at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:242)
    	at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:142)
    	at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
    ```
    
    This is because `spark.archives` was not actually added on the driver side correctly. The changes here fix it by adding and resolving URIs correctly.
    
    ### Why are the changes needed?
    
    `spark.archives` feature can be leveraged for many things such as Conda support. We should make it working in Kubernates as well.
    This is a bug fix too.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No, this feature is not out yet.
    
    ### How was this patch tested?
    
    I manually tested with Minikube 1.15.1. For an environment issue (?), I had to use a custom namespace, service account and roles. `default` service account does not work for me and complains it doesn't have permissions to get/list pods, etc.
    
    ```bash
    minikube delete
    minikube start --cpus 12 --memory 16384
    kubectl create namespace spark-integration-test
    cat <<EOF | kubectl apply -f -
    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: spark
      namespace: spark-integration-test
    EOF
    kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=spark-integration-test:spark --namespace=spark-integration-test
    dev/make-distribution.sh --pip --tgz -Pkubernetes
    resource-managers/kubernetes/integration-tests/dev/dev-run-integration-tests.sh --spark-tgz `pwd`/spark-3.1.0-SNAPSHOT-bin-3.2.0.tgz  --service-account spark --namespace spark-integration-test
    ```
    
    Closes #30581 from HyukjinKwon/SPARK-33615.
    
    Authored-by: HyukjinKwon <gurwls223@apache.org>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    HyukjinKwon committed Dec 4, 2020
    Configuration menu
    Copy the full SHA
    990bee9 View commit details
    Browse the repository at this point in the history
  14. [SPARK-33141][SQL][FOLLOW-UP] Store the max nested view depth in Anal…

    …ysisContext
    
    ### What changes were proposed in this pull request?
    
    This is a followup of #30289. It removes the hack in `View.effectiveSQLConf`, by putting the max nested view depth in `AnalysisContext`. Then we don't get the max nested view depth from the active SQLConf, which keeps changing during nested view resolution.
    
    ### Why are the changes needed?
    
    remove hacks.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    If I just remove the hack, `SimpleSQLViewSuite.restrict the nested level of a view` fails. With this fix, it passes again.
    
    Closes #30575 from cloud-fan/view.
    
    Authored-by: Wenchen Fan <wenchen@databricks.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    cloud-fan committed Dec 4, 2020
    Configuration menu
    Copy the full SHA
    acc211d View commit details
    Browse the repository at this point in the history
  15. [SPARK-33660][DOCS][SS] Fix Kafka Headers Documentation

    ### What changes were proposed in this pull request?
    
    Update kafka headers documentation, type is not longer a map but an array
    
    [jira](https://issues.apache.org/jira/browse/SPARK-33660)
    
    ### Why are the changes needed?
    To help users
    
    ### Does this PR introduce _any_ user-facing change?
    no
    
    ### How was this patch tested?
    
    It is only documentation
    
    Closes #30605 from Gschiavon/SPARK-33660-fix-kafka-headers-documentation.
    
    Authored-by: german <germanschiavon@gmail.com>
    Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
    Gschiavon authored and HeartSaVioR committed Dec 4, 2020
    Configuration menu
    Copy the full SHA
    d671e05 View commit details
    Browse the repository at this point in the history
  16. [SPARK-33662][BUILD] Setting version to 3.2.0-SNAPSHOT

    ### What changes were proposed in this pull request?
    
    This PR aims to update `master` branch version to 3.2.0-SNAPSHOT.
    
    ### Why are the changes needed?
    
    Start to prepare Apache Spark 3.2.0.
    
    ### Does this PR introduce _any_ user-facing change?
    
    N/A.
    
    ### How was this patch tested?
    
    Pass the CIs.
    
    Closes #30606 from dongjoon-hyun/SPARK-3.2.
    
    Authored-by: Dongjoon Hyun <dongjoon@apache.org>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    dongjoon-hyun committed Dec 4, 2020
    Configuration menu
    Copy the full SHA
    de9818f View commit details
    Browse the repository at this point in the history
  17. [SPARK-33141][SQL][FOLLOW-UP] Fix Scala 2.13 compilation

    ### What changes were proposed in this pull request?
    
    This PR aims to fix Scala 2.13 compilation.
    
    ### Why are the changes needed?
    
    To recover Scala 2.13.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Pass GitHub Action Scala 2.13 build job.
    
    Closes #30611 from dongjoon-hyun/SPARK-33141.
    
    Authored-by: Dongjoon Hyun <dongjoon@apache.org>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    dongjoon-hyun committed Dec 4, 2020
    Configuration menu
    Copy the full SHA
    b6b45bc View commit details
    Browse the repository at this point in the history
  18. [SPARK-33472][SQL][FOLLOW-UP] Update RemoveRedundantSorts comment

    ### What changes were proposed in this pull request?
    This PR is a follow-up for #30373 that updates the comment for RemoveRedundantSorts in QueryExecution.
    
    ### Why are the changes needed?
    To update an incorrect comment.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    N/A
    
    Closes #30584 from allisonwang-db/spark-33472-followup.
    
    Authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    allisonwang-db authored and dongjoon-hyun committed Dec 4, 2020
    Configuration menu
    Copy the full SHA
    960d6af View commit details
    Browse the repository at this point in the history

Commits on Dec 5, 2020

  1. [SPARK-33651][SQL] Allow CREATE EXTERNAL TABLE with LOCATION for data…

    … source tables
    
    ### What changes were proposed in this pull request?
    
    This PR removes the restriction and allows CREATE EXTERNAL TABLE with LOCATION for data source tables. It also moves the check from the analyzer rule `ResolveSessionCatalog` to `SessionCatalog`, so that v2 session catalog can overwrite it.
    
    ### Why are the changes needed?
    
    It's an unnecessary behavior difference that Hive serde table can be created with `CREATE EXTERNAL TABLE` if LOCATION is present, while data source table doesn't allow `CREATE EXTERNAL TABLE` at all.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, now `CREATE EXTERNAL TABLE ... USING ... LOCATION ...` is allowed.
    
    ### How was this patch tested?
    
    new tests
    
    Closes #30595 from cloud-fan/minor.
    
    Authored-by: Wenchen Fan <wenchen@databricks.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    cloud-fan authored and dongjoon-hyun committed Dec 5, 2020
    Configuration menu
    Copy the full SHA
    1b4e35d View commit details
    Browse the repository at this point in the history

Commits on Dec 6, 2020

  1. [MINOR] Fix string interpolation in CommandUtils.scala and KafkaDataC…

    …onsumer.scala
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to fix a string interpolation in `CommandUtils.scala` and `KafkaDataConsumer.scala`.
    
    ### Why are the changes needed?
    
    To fix a string interpolation bug.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, the string will be correctly constructed.
    
    ### How was this patch tested?
    
    Existing tests since they were used in exception/log messages.
    
    Closes #30609 from imback82/fix_cache_str_interporlation.
    
    Authored-by: Terry Kim <yuminkim@gmail.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    imback82 authored and HyukjinKwon committed Dec 6, 2020
    Configuration menu
    Copy the full SHA
    154f604 View commit details
    Browse the repository at this point in the history
  2. [SPARK-33668][K8S][TEST] Fix flaky test "Verify logging configuration…

    … is picked from the provided
    
    ### What changes were proposed in this pull request?
    Fix flaky test "Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties."
    The test is flaking, with multiple flaked instances - the reason for the failure has been similar to:
    
    ```
    
    The code passed to eventually never returned normally. Attempted 109 times over 3.0079882413999997 minutes. Last failure message: Failure executing: GET at:
    https://192.168.39.167:8443/api/v1/namespaces/b37fc72a991b49baa68a2eaaa1516463/pods/spark-pi-97a9bc76308e7fe3-exec-1/log?pretty=false. Message: pods "spark-pi-97a9bc76308e7fe3-exec-1" not found. Received status: Status(apiVersion=v1, code=404, details=StatusDetails(causes=[], group=null, kind=pods, name=spark-pi-97a9bc76308e7fe3-exec-1, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=pods "spark-pi-97a9bc76308e7fe3-exec-1" not found, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=NotFound, status=Failure, additionalProperties={}).. (KubernetesSuite.scala:402)
    
    ```
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36854/console
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36852/console
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36850/console
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36848/console
    From the above failures, it seems, that executor finishes too quickly and is removed by spark before the test can complete.
    So, in order to mitigate this situation, one way is to turn on the flag
       "spark.kubernetes.executor.deleteOnTermination"
    
    ### Why are the changes needed?
    
    Fixes a flaky test.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    Existing tests.
    May be a few runs of jenkins integration test, may reveal if the problem is resolved or not.
    
    Closes #30616 from ScrapCodes/SPARK-33668/fix-flaky-k8s-integration-test.
    
    Authored-by: Prashant Sharma <prashsh1@in.ibm.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    ScrapCodes authored and dongjoon-hyun committed Dec 6, 2020
    Configuration menu
    Copy the full SHA
    6317ba2 View commit details
    Browse the repository at this point in the history
  3. [SPARK-33652][SQL] DSv2: DeleteFrom should refresh cache

    ### What changes were proposed in this pull request?
    
    This changes `DeleteFromTableExec` to also refresh caches referencing the original table, by passing the `refreshCache` callback to the class. Note that in order to construct the callback, I have to change `DataSourceV2ScanRelation` to contain a `DataSourceV2Relation` instead of a `Table`.
    
    ### Why are the changes needed?
    
    Currently DSv2 delete from table doesn't refresh caches. This could lead to correctness issue if the staled cache is queried later.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes. Now delete from table in v2 also refreshes cache.
    
    ### How was this patch tested?
    
    Added a test case.
    
    Closes #30597 from sunchao/SPARK-33652.
    
    Authored-by: Chao Sun <sunchao@apple.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    sunchao authored and dongjoon-hyun committed Dec 6, 2020
    Configuration menu
    Copy the full SHA
    e857e06 View commit details
    Browse the repository at this point in the history
  4. [SPARK-33256][PYTHON][DOCS] Clarify PySpark follows NumPy documentati…

    …on style
    
    ### What changes were proposed in this pull request?
    
    This PR adds few lines about docstring style to document that PySpark follows [NumPy documentation style](https://numpydoc.readthedocs.io/en/latest/format.html). We all completed the migration to NumPy documentation style at SPARK-32085.
    
    Ideally we should have a page like https://pandas.pydata.org/docs/development/contributing_docstring.html but I would like to leave it as a future work.
    
    ### Why are the changes needed?
    
    To tell developers that PySpark now follows NumPy documentation style.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No, it's a change in unreleased branches yet.
    
    ### How was this patch tested?
    
    Manually tested via `make clean html` under `python/docs`:
    
    ![Screen Shot 2020-12-06 at 1 34 50 PM](https://user-images.githubusercontent.com/6477701/101271623-d5ce0380-37c7-11eb-93ac-da73caa50c37.png)
    
    Closes #30622 from HyukjinKwon/SPARK-33256.
    
    Authored-by: HyukjinKwon <gurwls223@apache.org>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    HyukjinKwon authored and dongjoon-hyun committed Dec 6, 2020
    Configuration menu
    Copy the full SHA
    5250841 View commit details
    Browse the repository at this point in the history
  5. [SPARK-33667][SQL] Respect the spark.sql.caseSensitive config while…

    … resolving partition spec in v1 `SHOW PARTITIONS`
    
    ### What changes were proposed in this pull request?
    Preprocess the partition spec passed to the V1 SHOW PARTITIONS implementation `ShowPartitionsCommand`, and normalize the passed spec according to the partition columns w.r.t the case sensitivity flag  **spark.sql.caseSensitive**.
    
    ### Why are the changes needed?
    V1 SHOW PARTITIONS is case sensitive in fact, and doesn't respect the SQL config **spark.sql.caseSensitive** which is false by default, for instance:
    ```sql
    spark-sql> CREATE TABLE tbl1 (price int, qty int, year int, month int)
             > USING parquet
             > PARTITIONED BY (year, month);
    spark-sql> INSERT INTO tbl1 PARTITION(year = 2015, month = 1) SELECT 1, 1;
    spark-sql> SHOW PARTITIONS tbl1 PARTITION(YEAR = 2015, Month = 1);
    Error in query: Non-partitioning column(s) [YEAR, Month] are specified for SHOW PARTITIONS;
    ```
    The `SHOW PARTITIONS` command must show the partition `year = 2015, month = 1` specified by `YEAR = 2015, Month = 1`.
    
    ### Does this PR introduce _any_ user-facing change?
    Yes. After the changes, the command above works as expected:
    ```sql
    spark-sql> SHOW PARTITIONS tbl1 PARTITION(YEAR = 2015, Month = 1);
    year=2015/month=1
    ```
    
    ### How was this patch tested?
    By running the affected test suites:
    - `v1/ShowPartitionsSuite`
    - `v2/ShowPartitionsSuite`
    
    Closes #30615 from MaxGekk/show-partitions-case-sensitivity-test.
    
    Authored-by: Max Gekk <max.gekk@gmail.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    MaxGekk authored and dongjoon-hyun committed Dec 6, 2020
    Configuration menu
    Copy the full SHA
    4829781 View commit details
    Browse the repository at this point in the history
  6. [SPARK-33674][TEST] Show Slowpoke notifications in SBT tests

    ### What changes were proposed in this pull request?
    This PR is to show Slowpoke notifications in the log when running tests using SBT.
    
    For example, the test case "zero sized blocks" in ExternalShuffleServiceSuite enters the infinite loop. After this change, the log file will have a notification message every 5 minute when the test case running longer than two minutes. Below is an example message.
    
    ```
    [info] ExternalShuffleServiceSuite:
    [info] - groupByKey without compression (101 milliseconds)
    [info] - shuffle non-zero block size (3 seconds, 186 milliseconds)
    [info] - shuffle serializer (3 seconds, 189 milliseconds)
    [info] *** Test still running after 2 minute, 1 seconds: suite name: ExternalShuffleServiceSuite, test name: zero sized blocks.
    [info] *** Test still running after 7 minute, 1 seconds: suite name: ExternalShuffleServiceSuite, test name: zero sized blocks.
    [info] *** Test still running after 12 minutes, 1 seconds: suite name: ExternalShuffleServiceSuite, test name: zero sized blocks.
    [info] *** Test still running after 17 minutes, 1 seconds: suite name: ExternalShuffleServiceSuite, test name: zero sized blocks.
    ```
    
    ### Why are the changes needed?
    When the tests/code has bug and enters the infinite loop, it is hard to tell which test cases hit some issues from the log, especially when we are running the tests in parallel. It would be nice to show the Slowpoke notifications.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Manual testing in my local dev environment.
    
    Closes #30621 from gatorsmile/addSlowpoke.
    
    Authored-by: Xiao Li <gatorsmile@gmail.com>
    Signed-off-by: Yuming Wang <yumwang@ebay.com>
    gatorsmile authored and wangyum committed Dec 6, 2020
    Configuration menu
    Copy the full SHA
    b94ecf0 View commit details
    Browse the repository at this point in the history

Commits on Dec 7, 2020

  1. [SPARK-33663][SQL] Uncaching should not be called on non-existing tem…

    …p views
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to fix a misleading logs in the following scenario when uncaching is called on non-existing views:
    ```
    scala> sql("CREATE TABLE table USING parquet AS SELECT 2")
    res0: org.apache.spark.sql.DataFrame = []
    
    scala> val df = spark.table("table")
    df: org.apache.spark.sql.DataFrame = [2: int]
    
    scala> df.createOrReplaceTempView("t2")
    20/12/04 10:16:24 WARN CommandUtils: Exception when attempting to uncache $name
    org.apache.spark.sql.AnalysisException: Table or view not found: t2;;
    'UnresolvedRelation [t2], [], false
    
    	at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
    	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:113)
    	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:93)
    	at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:183)
    	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:93)
    	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:90)
    	at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:152)
    	at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:172)
    	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:214)
    	at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:169)
    	at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:73)
    	at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
    	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:138)
    	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:768)
    	at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:138)
    	at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:73)
    	at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:71)
    	at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:63)
    	at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:90)
    	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:768)
    	at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:88)
    	at org.apache.spark.sql.DataFrameReader.table(DataFrameReader.scala:889)
    	at org.apache.spark.sql.SparkSession.table(SparkSession.scala:589)
    	at org.apache.spark.sql.internal.CatalogImpl.uncacheTable(CatalogImpl.scala:476)
    	at org.apache.spark.sql.execution.command.CommandUtils$.uncacheTableOrView(CommandUtils.scala:392)
    	at org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:124)
    ```
    Since `t2` does not exist yet, it shouldn't try to uncache.
    
    ### Why are the changes needed?
    
    To fix misleading message.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, the above message will not be displayed if the view doesn't exist yet.
    
    ### How was this patch tested?
    
    Manually tested since this is a log message printed.
    
    Closes #30608 from imback82/fix_cache_message.
    
    Authored-by: Terry Kim <yuminkim@gmail.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    imback82 authored and HyukjinKwon committed Dec 7, 2020
    Configuration menu
    Copy the full SHA
    119539f View commit details
    Browse the repository at this point in the history
  2. [SPARK-33675][INFRA] Add GitHub Action job to publish snapshot

    ### What changes were proposed in this pull request?
    
    This PR aims to add `GitHub Action` job to publish daily snapshot for **master** branch.
    - https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.12/3.2.0-SNAPSHOT/
    
    For the other branches, I'll make adjusted backports.
    - For `branch-3.1`, we can specify the checkout `ref` to `branch-3.1`.
    - For `branch-2.4` and `branch-3.0`, we can publish at every commit since the traffic is low.
      - #30630 (branch-3.0)
      - #30629 (branch-2.4 LTS)
    
    ### Why are the changes needed?
    
    After this series of jobs, this will reduce our maintenance burden permanently from AmpLab Jenkins by removing the following completely.
    
    https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/
    
    For now, AmpLab Jenkins doesn't have a job for `branch-3.1`. We can do it by ourselves by `GitHub Action`.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    The snapshot publishing is tested here at PR trigger. Since this PR adds a scheduled job, we cannot test in this PR.
    - https://github.com/dongjoon-hyun/spark/runs/1505792859
    
    Apache Infra team finished the setup here.
    - https://issues.apache.org/jira/browse/INFRA-21167
    
    Closes #30623 from dongjoon-hyun/SPARK-33675.
    
    Authored-by: Dongjoon Hyun <dongjoon@apache.org>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    dongjoon-hyun authored and HyukjinKwon committed Dec 7, 2020
    Configuration menu
    Copy the full SHA
    e32de29 View commit details
    Browse the repository at this point in the history
  3. [SPARK-33670][SQL] Verify the partition provider is Hive in v1 SHOW T…

    …ABLE EXTENDED
    
    ### What changes were proposed in this pull request?
    Invoke the check `DDLUtils.verifyPartitionProviderIsHive()` from V1 implementation of `SHOW TABLE EXTENDED` when partition specs are specified.
    
    This PR is some kind of follow up #16373 and #15515.
    
    ### Why are the changes needed?
    To output an user friendly error with recommendation like
    **"
    ... partition metadata is not stored in the Hive metastore. To import this information into the metastore, run `msck repair table tableName`
    "**
    instead of silently output an empty result.
    
    ### Does this PR introduce _any_ user-facing change?
    Yes.
    
    ### How was this patch tested?
    By running the affected test suites, in particular:
    ```
    $ build/sbt -Phive-2.3 -Phive-thriftserver "hive/test:testOnly *PartitionProviderCompatibilitySuite"
    ```
    
    Closes #30618 from MaxGekk/show-table-extended-verifyPartitionProviderIsHive.
    
    Authored-by: Max Gekk <max.gekk@gmail.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    MaxGekk authored and HyukjinKwon committed Dec 7, 2020
    Configuration menu
    Copy the full SHA
    29096a8 View commit details
    Browse the repository at this point in the history
  4. [SPARK-33683][INFRA] Remove -Djava.version=11 from Scala 2.13 build i…

    …n GitHub Actions
    
    ### What changes were proposed in this pull request?
    
    This PR removes `-Djava.version=11` from the build command for Scala 2.13 in the GitHub Actions' job.
    
    In the GitHub Actions' job, the build command for Scala 2.13 is defined as follows.
    ```
    ./build/sbt -Pyarn -Pmesos -Pkubernetes -Phive -Phive-thriftserver -Phadoop-cloud -Pkinesis-asl -Djava.version=11 -Pscala-2.13 compile test:compile
    ```
    
    Though, Scala 2.13 build uses Java 8 rather than 11 so let's remove `-Djava.version=11`.
    
    ### Why are the changes needed?
    
    To build with consistent configuration.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Should be done by GitHub Actions' workflow.
    
    Closes #30633 from sarutak/scala-213-java11.
    
    Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    sarutak authored and dongjoon-hyun committed Dec 7, 2020
    Configuration menu
    Copy the full SHA
    e88f0d4 View commit details
    Browse the repository at this point in the history
  5. [SPARK-33680][SQL][TESTS] Fix PrunePartitionSuiteBase/BucketedReadWit…

    …hHiveSupportSuite not to depend on the default conf
    
    ### What changes were proposed in this pull request?
    
    This PR updates `PrunePartitionSuiteBase/BucketedReadWithHiveSupportSuite` to have the require conf explicitly.
    
    ### Why are the changes needed?
    
    The unit test should not depend on the default configurations.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    According to #30628 , this seems to be the only ones.
    
    Pass the CIs.
    
    Closes #30631 from dongjoon-hyun/SPARK-CONF-AGNO.
    
    Authored-by: Dongjoon Hyun <dongjoon@apache.org>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    dongjoon-hyun committed Dec 7, 2020
    Configuration menu
    Copy the full SHA
    73412ff View commit details
    Browse the repository at this point in the history
  6. [SPARK-33684][BUILD] Upgrade httpclient from 4.5.6 to 4.5.13

    ### What changes were proposed in this pull request?
    
    This PR upgrades `commons.httpclient` from `4.5.6` to `4.5.13`.
    4.5.6 is released over 2 years ago and now we can use more stable `4.5.13`.
    https://archive.apache.org/dist/httpcomponents/httpclient/RELEASE_NOTES-4.5.x.txt
    
    ### Why are the changes needed?
    
    To follow the more stable release.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Should be done by the existing tests.
    
    Closes #30634 from sarutak/upgrade-httpclient.
    
    Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    sarutak authored and dongjoon-hyun committed Dec 7, 2020
    Configuration menu
    Copy the full SHA
    d48ef34 View commit details
    Browse the repository at this point in the history
  7. [SPARK-33671][SQL] Remove VIEW checks from V1 table commands

    ### What changes were proposed in this pull request?
    Remove VIEW checks from the following V1 commands:
    - `SHOW PARTITIONS`
    - `TRUNCATE TABLE`
    - `LOAD DATA`
    
    The checks are performed earlier at:
    https://github.com/apache/spark/blob/acc211d2cf0e6ab94f6578e1eb488766fd20fa4e/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L885-L889
    
    ### Why are the changes needed?
    To improve code maintenance, and remove dead codes.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    By existing test suites like `v1/ShowPartitionsSuite`.
    
    1. LOAD DATA:
    https://github.com/apache/spark/blob/acc211d2cf0e6ab94f6578e1eb488766fd20fa4e/sql/core/src/test/scala/org/apache/spark/sql/execution/SQLViewSuite.scala#L176-L179
    2. TRUNCATE TABLE:
    https://github.com/apache/spark/blob/acc211d2cf0e6ab94f6578e1eb488766fd20fa4e/sql/core/src/test/scala/org/apache/spark/sql/execution/SQLViewSuite.scala#L180-L183
    3. SHOW PARTITIONS:
    - v1/ShowPartitionsSuite
    
    Closes #30620 from MaxGekk/show-table-check-view.
    
    Authored-by: Max Gekk <max.gekk@gmail.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    MaxGekk authored and dongjoon-hyun committed Dec 7, 2020
    Configuration menu
    Copy the full SHA
    87c0560 View commit details
    Browse the repository at this point in the history
  8. [SPARK-33676][SQL] Require exact matching of partition spec to the sc…

    …hema in V2 `ALTER TABLE .. ADD/DROP PARTITION`
    
    ### What changes were proposed in this pull request?
    Check that partitions specs passed to v2 `ALTER TABLE .. ADD/DROP PARTITION` exactly match to the partition schema (all partition fields from the schema are specified in partition specs).
    
    ### Why are the changes needed?
    1. To have the same behavior as V1 `ALTER TABLE .. ADD/DROP PARTITION` that output the error:
    ```sql
    spark-sql> create table tab1 (id int, a int, b int) using parquet partitioned by (a, b);
    spark-sql> ALTER TABLE tab1 ADD PARTITION (A='9');
    Error in query: Partition spec is invalid. The spec (a) must match the partition spec (a, b) defined in table '`default`.`tab1`';
    ```
    2. To prevent future errors caused by not fully specified partition specs.
    
    ### Does this PR introduce _any_ user-facing change?
    Yes. The V2 implementation of `ALTER TABLE .. ADD/DROP PARTITION` output the same error as V1 commands.
    
    ### How was this patch tested?
    By running the test suite with new UT:
    ```
    $ build/sbt "test:testOnly *AlterTablePartitionV2SQLSuite"
    ```
    
    Closes #30624 from MaxGekk/add-partition-full-spec.
    
    Authored-by: Max Gekk <max.gekk@gmail.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    MaxGekk authored and cloud-fan committed Dec 7, 2020
    Configuration menu
    Copy the full SHA
    26c0493 View commit details
    Browse the repository at this point in the history
  9. [SPARK-33617][SQL] Add default parallelism configuration for Spark SQ…

    …L queries
    
    ### What changes were proposed in this pull request?
    
    This pr add default parallelism configuration(`spark.sql.default.parallelism`) for Spark SQL and make it effective for `LocalTableScan`.
    
    ### Why are the changes needed?
    
    Avoid generating small files for INSERT INTO TABLE from VALUES, for example:
    ```sql
    CREATE TABLE t1(id int) USING parquet;
    INSERT INTO TABLE t1 VALUES (1), (2), (3), (4), (5), (6), (7), (8);
    ```
    
    Before this pr:
    ```
    -rw-r--r-- 1 root root 421 Dec  1 01:54 part-00000-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet
    -rw-r--r-- 1 root root 421 Dec  1 01:54 part-00001-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet
    -rw-r--r-- 1 root root 421 Dec  1 01:54 part-00002-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet
    -rw-r--r-- 1 root root 421 Dec  1 01:54 part-00003-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet
    -rw-r--r-- 1 root root 421 Dec  1 01:54 part-00004-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet
    -rw-r--r-- 1 root root 421 Dec  1 01:54 part-00005-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet
    -rw-r--r-- 1 root root 421 Dec  1 01:54 part-00006-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet
    -rw-r--r-- 1 root root 421 Dec  1 01:54 part-00007-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet
    -rw-r--r-- 1 root root   0 Dec  1 01:54 _SUCCESS
    ```
    
    After this pr and set `spark.sql.files.minPartitionNum` to 1:
    ```
    -rw-r--r-- 1 root root 452 Dec  1 01:59 part-00000-6de50c79-e305-4f8d-b6ae-39f46b2619c6-c000.snappy.parquet
    -rw-r--r-- 1 root root   0 Dec  1 01:59 _SUCCESS
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Unit test.
    
    Closes #30559 from wangyum/SPARK-33617.
    
    Lead-authored-by: Yuming Wang <yumwang@ebay.com>
    Co-authored-by: Yuming Wang <yumwang@apache.org>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    2 people authored and HyukjinKwon committed Dec 7, 2020
    Configuration menu
    Copy the full SHA
    1e0c006 View commit details
    Browse the repository at this point in the history
  10. [SPARK-32680][SQL] Don't Preprocess V2 CTAS with Unresolved Query

    ### What changes were proposed in this pull request?
    The analyzer rule `PreprocessTableCreation` will preprocess table creation related logical plan. But for
    CTAS, if the sub-query can't be resolved, preprocess it will cause "Invalid call to toAttribute on unresolved
    object" (instead of a user-friendly error msg: "table or view not found").
    This PR fixes this wrongly preprocess for CTAS using V2 catalog.
    
    ### Why are the changes needed?
    bug fix
    
    ### Does this PR introduce _any_ user-facing change?
    The error message for CTAS with a non-exists table changed from:
    `UnresolvedException: Invalid call to toAttribute on unresolved object, tree: xxx` to
    `AnalysisException: Table or view not found: xxx`
    
    ### How was this patch tested?
    added test
    
    Closes #30637 from linhongliu-db/fix-ctas.
    
    Authored-by: Linhong Liu <linhong.liu@databricks.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    linhongliu-db authored and cloud-fan committed Dec 7, 2020
    Configuration menu
    Copy the full SHA
    d730b6b View commit details
    Browse the repository at this point in the history
  11. [SPARK-33641][SQL] Invalidate new char/varchar types in public APIs t…

    …hat produce incorrect results
    
    ### What changes were proposed in this pull request?
    
    In this PR, we suppose to narrow the use cases of the char/varchar data types, of which are invalid now or later
    
    ### Why are the changes needed?
    1. udf
    ```scala
    scala> spark.udf.register("abcd", () => "12345", org.apache.spark.sql.types.VarcharType(2))
    
    scala> spark.sql("select abcd()").show
    scala.MatchError: CharType(2) (of class org.apache.spark.sql.types.VarcharType)
      at org.apache.spark.sql.catalyst.encoders.RowEncoder$.externalDataTypeFor(RowEncoder.scala:215)
      at org.apache.spark.sql.catalyst.encoders.RowEncoder$.externalDataTypeForInput(RowEncoder.scala:212)
      at org.apache.spark.sql.catalyst.expressions.objects.ValidateExternalType.<init>(objects.scala:1741)
      at org.apache.spark.sql.catalyst.encoders.RowEncoder$.$anonfun$serializerFor$3(RowEncoder.scala:175)
      at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
      at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
      at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
      at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
      at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
      at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
      at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:198)
      at org.apache.spark.sql.catalyst.encoders.RowEncoder$.serializerFor(RowEncoder.scala:171)
      at org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:66)
      at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99)
      at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:768)
      at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:96)
      at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:611)
      at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:768)
      at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:606)
      ... 47 elided
    ```
    
    2. spark.createDataframe
    
    ```
    scala> spark.createDataFrame(spark.read.text("README.md").rdd, new org.apache.spark.sql.types.StructType().add("c", "char(1)")).show
    +--------------------+
    |                   c|
    +--------------------+
    |      # Apache Spark|
    |                    |
    |Spark is a unifie...|
    |high-level APIs i...|
    |supports general ...|
    |rich set of highe...|
    |MLlib for machine...|
    |and Structured St...|
    |                    |
    |<https://spark.ap...|
    |                    |
    |[![Jenkins Build]...|
    |[![AppVeyor Build...|
    |[![PySpark Covera...|
    |                    |
    |                    |
    ```
    
    3. reader.schema
    
    ```
    scala> spark.read.schema("a varchar(2)").text("./README.md").show(100)
    +--------------------+
    |                   a|
    +--------------------+
    |      # Apache Spark|
    |                    |
    |Spark is a unifie...|
    |high-level APIs i...|
    |supports general ...|
    ```
    4. etc
    
    ### Does this PR introduce _any_ user-facing change?
    
    NO, we intend to avoid protentical breaking change
    
    ### How was this patch tested?
    
    new tests
    
    Closes #30586 from yaooqinn/SPARK-33641.
    
    Authored-by: Kent Yao <yaooqinn@hotmail.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    yaooqinn authored and cloud-fan committed Dec 7, 2020
    Configuration menu
    Copy the full SHA
    da72b87 View commit details
    Browse the repository at this point in the history
  12. [MINOR] Spelling sql not core

    ### What changes were proposed in this pull request?
    
    This PR intends to fix typos in the sub-modules:
    * `sql/catalyst`
    * `sql/hive-thriftserver`
    * `sql/hive`
    
    Split per srowen #30323 (comment)
    
    NOTE: The misspellings have been reported at jsoref@706a726#commitcomment-44064356
    
    ### Why are the changes needed?
    
    Misspelled words make it harder to read / understand content.
    
    ### Does this PR introduce _any_ user-facing change?
    
    There are various fixes to documentation, etc...
    
    ### How was this patch tested?
    
    No testing was performed
    
    Closes #30532 from jsoref/spelling-sql-not-core.
    
    Authored-by: Josh Soref <jsoref@users.noreply.github.com>
    Signed-off-by: Sean Owen <srowen@gmail.com>
    jsoref authored and srowen committed Dec 7, 2020
    Configuration menu
    Copy the full SHA
    c62b84a View commit details
    Browse the repository at this point in the history
  13. [SPARK-33693][SQL] deprecate spark.sql.hive.convertCTAS

    ### What changes were proposed in this pull request?
    
    This is a followup of #30554 . Now we have a new config for converting CREATE TABLE, we don't need the old config that only works for CTAS.
    
    ### Why are the changes needed?
    
    It's confusing for having two config while one can cover another completely.
    
    ### Does this PR introduce _any_ user-facing change?
    
    no, it's deprecating not removing.
    
    ### How was this patch tested?
    
    N/A
    
    Closes #30651 from cloud-fan/minor.
    
    Authored-by: Wenchen Fan <wenchen@databricks.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    cloud-fan authored and dongjoon-hyun committed Dec 7, 2020
    Configuration menu
    Copy the full SHA
    6aff215 View commit details
    Browse the repository at this point in the history
  14. [SPARK-33480][SQL][FOLLOWUP] do not expose user data in error message

    ### What changes were proposed in this pull request?
    
    This is a followup of #30412. This PR updates the error message of char/varchar table insertion length check, to not expose user data.
    
    ### Why are the changes needed?
    
    This is risky to expose user data in the error message, especially the string data, as it may contain sensitive data.
    
    ### Does this PR introduce _any_ user-facing change?
    
    no
    
    ### How was this patch tested?
    
    updated tests
    
    Closes #30653 from cloud-fan/minor2.
    
    Authored-by: Wenchen Fan <wenchen@databricks.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    cloud-fan authored and dongjoon-hyun committed Dec 7, 2020
    Configuration menu
    Copy the full SHA
    c0874ba View commit details
    Browse the repository at this point in the history
  15. [SPARK-33621][SQL] Add a way to inject data source rewrite rules

    ### What changes were proposed in this pull request?
    
    This PR adds a way to inject data source rewrite rules.
    
    ### Why are the changes needed?
    
    Right now `SparkSessionExtensions` allow us to inject optimization rules but they are added to operator optimization batch. There are cases when users need to run rules after the operator optimization batch (e.g. cases when a rule relies on the fact that expressions have been optimized). Currently, this is not possible.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes.
    
    ### How was this patch tested?
    
    This PR comes with a new test.
    
    Closes #30577 from aokolnychyi/spark-33621-v3.
    
    Authored-by: Anton Okolnychyi <aokolnychyi@apple.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    aokolnychyi authored and dongjoon-hyun committed Dec 7, 2020
    Configuration menu
    Copy the full SHA
    02508b6 View commit details
    Browse the repository at this point in the history

Commits on Dec 8, 2020

  1. [SPARK-32320][PYSPARK] Remove mutable default arguments

    This is bad practice, and might lead to unexpected behaviour:
    https://florimond.dev/blog/articles/2018/08/python-mutable-defaults-are-the-source-of-all-evil/
    
    ```
    fokkodriesprongFan spark % grep -R "={}" python | grep def
    
    python/pyspark/resource/profile.py:    def __init__(self, _java_resource_profile=None, _exec_req={}, _task_req={}):
    python/pyspark/sql/functions.py:def from_json(col, schema, options={}):
    python/pyspark/sql/functions.py:def to_json(col, options={}):
    python/pyspark/sql/functions.py:def schema_of_json(json, options={}):
    python/pyspark/sql/functions.py:def schema_of_csv(csv, options={}):
    python/pyspark/sql/functions.py:def to_csv(col, options={}):
    python/pyspark/sql/functions.py:def from_csv(col, schema, options={}):
    python/pyspark/sql/avro/functions.py:def from_avro(data, jsonFormatSchema, options={}):
    ```
    
    ```
    fokkodriesprongFan spark % grep -R "=\[\]" python | grep def
    python/pyspark/ml/tuning.py:    def __init__(self, bestModel, avgMetrics=[], subModels=None):
    python/pyspark/ml/tuning.py:    def __init__(self, bestModel, validationMetrics=[], subModels=None):
    ```
    
    ### What changes were proposed in this pull request?
    
    Removing the mutable default arguments.
    
    ### Why are the changes needed?
    
    Removing the mutable default arguments, and changing the signature to `Optional[...]`.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No 👍
    
    ### How was this patch tested?
    
    Using the Flake8 bugbear code analysis plugin.
    
    Closes #29122 from Fokko/SPARK-32320.
    
    Authored-by: Fokko Driesprong <fokko@apache.org>
    Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
    Fokko authored and zhengruifeng committed Dec 8, 2020
    Configuration menu
    Copy the full SHA
    e4d1c10 View commit details
    Browse the repository at this point in the history
  2. [SPARK-33680][SQL][TESTS][FOLLOWUP] Fix more test suites to have expl…

    …icit confs
    
    ### What changes were proposed in this pull request?
    
    This is a follow-up for SPARK-33680 to remove the assumption on the default value of `spark.sql.adaptive.enabled` .
    
    ### Why are the changes needed?
    
    According to the test result #30628 (comment), the [previous run](#30628 (comment)) didn't run all tests.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Pass the CIs.
    
    Closes #30655 from dongjoon-hyun/SPARK-33680.
    
    Authored-by: Dongjoon Hyun <dongjoon@apache.org>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    dongjoon-hyun committed Dec 8, 2020
    Configuration menu
    Copy the full SHA
    b2a7930 View commit details
    Browse the repository at this point in the history
  3. [SPARK-33609][ML] word2vec reduce broadcast size

    ### What changes were proposed in this pull request?
    1, directly use float vectors instead of converting to double vectors, this is about 2x faster than using vec.axpy;
    2, mark `wordList` and `wordVecNorms` lazy
    3, avoid slicing in computation of `wordVecNorms`
    
    ### Why are the changes needed?
    halve broadcast size
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    existing testsuites
    
    Closes #30548 from zhengruifeng/w2v_float32_transform.
    
    Lead-authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
    Co-authored-by: zhengruifeng <ruifengz@foxmail.com>
    Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
    zhengruifeng committed Dec 8, 2020
    Configuration menu
    Copy the full SHA
    ebd8b93 View commit details
    Browse the repository at this point in the history
  4. [SPARK-33698][BUILD][TESTS] Fix the build error of OracleIntegrationS…

    …uite for Scala 2.13
    
    ### What changes were proposed in this pull request?
    
    This PR fixes a build error of `OracleIntegrationSuite` with Scala 2.13.
    
    ### Why are the changes needed?
    
    Build should pass with Scala 2.13.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    I confirmed that the build pass with the following command.
    ```
    $ build/sbt -Pdocker-integration-tests -Pscala-2.13  "docker-integration-tests/test:compile"
    ```
    
    Closes #30660 from sarutak/fix-docker-integration-tests-for-scala-2.13.
    
    Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    sarutak authored and dongjoon-hyun committed Dec 8, 2020
    Configuration menu
    Copy the full SHA
    8bcebfa View commit details
    Browse the repository at this point in the history
  5. [SPARK-33664][SQL] Migrate ALTER TABLE ... RENAME TO to use Unresolve…

    …dTableOrView to resolve identifier
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to migrate `ALTER [TABLE|ViEW] ... RENAME TO` to use `UnresolvedTableOrView` to resolve the table/view identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing).
    
    ### Why are the changes needed?
    
    To use `UnresolvedTableOrView` for table/view resolution. Note that `AlterTableRenameCommand` internally resolves to a temp view first, so there is no resolution behavior change with this PR.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Updated existing tests.
    
    Closes #30610 from imback82/rename_v2.
    
    Authored-by: Terry Kim <yuminkim@gmail.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    imback82 authored and cloud-fan committed Dec 8, 2020
    Configuration menu
    Copy the full SHA
    5aefc49 View commit details
    Browse the repository at this point in the history
  6. [MINOR][INFRA] Add -Pdocker-integration-tests to GitHub Action Scala …

    …2.13 build job
    
    ### What changes were proposed in this pull request?
    
    This aims to add `-Pdocker-integration-tests` at GitHub Action job for Scala 2.13 compilation.
    
    ### Why are the changes needed?
    
    We fixed Scala 2.13 compilation of this module at #30660 . This PR will prevent accidental regression at that module.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Pass GitHub Action Scala 2.13 job.
    
    Closes #30661 from dongjoon-hyun/SPARK-DOCKER-IT.
    
    Authored-by: Dongjoon Hyun <dongjoon@apache.org>
    Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
    dongjoon-hyun authored and sarutak committed Dec 8, 2020
    Configuration menu
    Copy the full SHA
    3a6546d View commit details
    Browse the repository at this point in the history
  7. [SPARK-33679][SQL] Enable spark.sql.adaptive.enabled by default

    ### What changes were proposed in this pull request?
    
    This PR aims to enable `spark.sql.adaptive.enabled` by default for Apache Spark **3.2.0**.
    
    ### Why are the changes needed?
    
    By switching the default for Apache Spark 3.2, the whole community can focus more on the stabilizing this feature in the various situation more seriously.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, but this is an improvement and it's supposed to have no bugs.
    
    ### How was this patch tested?
    
    Pass the CIs.
    
    Closes #30628 from dongjoon-hyun/SPARK-33679.
    
    Authored-by: Dongjoon Hyun <dongjoon@apache.org>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    dongjoon-hyun committed Dec 8, 2020
    Configuration menu
    Copy the full SHA
    031c5ef View commit details
    Browse the repository at this point in the history
  8. [SPARK-33677][SQL] Skip LikeSimplification rule if pattern contains a…

    …ny escapeChar
    
    ### What changes were proposed in this pull request?
    `LikeSimplification` rule does not work correctly for many cases that have patterns containing escape characters, for example:
    
    `SELECT s LIKE 'm%aca' ESCAPE '%' FROM t`
    `SELECT s LIKE 'maacaa' ESCAPE 'a' FROM t`
    
    For simpilicy, this PR makes this rule just be skipped if `pattern` contains any `escapeChar`.
    
    ### Why are the changes needed?
    Result corrupt.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Added Unit test.
    
    Closes #30625 from luluorta/SPARK-33677.
    
    Authored-by: luluorta <luluorta@gmail.com>
    Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
    luluorta authored and maropu committed Dec 8, 2020
    Configuration menu
    Copy the full SHA
    99613cd View commit details
    Browse the repository at this point in the history
  9. [SPARK-33688][SQL] Migrate SHOW TABLE EXTENDED to new resolution fram…

    …ework
    
    ### What changes were proposed in this pull request?
    1. Remove old statement `ShowTableStatement`
    2. Introduce new command `ShowTableExtended` for  `SHOW TABLE EXTENDED`.
    
    This PR is the first step of new V2 implementation of `SHOW TABLE EXTENDED`, see SPARK-33393.
    
    ### Why are the changes needed?
    This is a part of effort to make the relation lookup behavior consistent: SPARK-29900.
    
    ### Does this PR introduce _any_ user-facing change?
    The changes should not affect V1 tables. For V2, Spark outputs the error:
    ```
    SHOW TABLE EXTENDED is not supported for v2 tables.
    ```
    
    ### How was this patch tested?
    By running `SHOW TABLE EXTENDED` tests:
    ```
    $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *ShowTablesSuite"
    ```
    
    Closes #30645 from MaxGekk/show-table-extended-statement.
    
    Authored-by: Max Gekk <max.gekk@gmail.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    MaxGekk authored and cloud-fan committed Dec 8, 2020
    Configuration menu
    Copy the full SHA
    2b30dde View commit details
    Browse the repository at this point in the history
  10. [SPARK-33685][SQL] Migrate DROP VIEW command to use UnresolvedView to…

    … resolve the identifier
    
    ### What changes were proposed in this pull request?
    
    This PR introduces `UnresolvedView` in the resolution framework to resolve the identifier.
    
    This PR then migrates `DROP VIEW` to use `UnresolvedView` to resolve the table/view identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing).
    
    ### Why are the changes needed?
    
    To use `UnresolvedView` for view resolution. Note that there is no resolution behavior change with this PR.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Updated existing tests.
    
    Closes #30636 from imback82/drop_view_v2.
    
    Authored-by: Terry Kim <yuminkim@gmail.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    imback82 authored and cloud-fan committed Dec 8, 2020
    Configuration menu
    Copy the full SHA
    c05ee06 View commit details
    Browse the repository at this point in the history
  11. [MINOR] Spelling sql/core

    ### What changes were proposed in this pull request?
    
    This PR intends to fix typos in the sub-modules:
    * `sql/core`
    
    Split per srowen #30323 (comment)
    
    NOTE: The misspellings have been reported at jsoref@706a726#commitcomment-44064356
    
    ### Why are the changes needed?
    
    Misspelled words make it harder to read / understand content.
    
    ### Does this PR introduce _any_ user-facing change?
    
    There are various fixes to documentation, etc...
    
    ### How was this patch tested?
    
    No testing was performed
    
    Closes #30531 from jsoref/spelling-sql-core.
    
    Authored-by: Josh Soref <jsoref@users.noreply.github.com>
    Signed-off-by: Sean Owen <srowen@gmail.com>
    jsoref authored and srowen committed Dec 8, 2020
    Configuration menu
    Copy the full SHA
    a093d6f View commit details
    Browse the repository at this point in the history