Skip to content

[SPARK-36496][SQL] Remove literals from grouping expressions when using the DataFrame withColumn API#33723

Closed
tanelk wants to merge 654 commits intoapache:masterfrom
tanelk:SPARK-36496_remove_grouping_literals
Closed

[SPARK-36496][SQL] Remove literals from grouping expressions when using the DataFrame withColumn API#33723
tanelk wants to merge 654 commits intoapache:masterfrom
tanelk:SPARK-36496_remove_grouping_literals

Conversation

@tanelk
Copy link
Contributor

@tanelk tanelk commented Aug 12, 2021

What changes were proposed in this pull request?

Move the RemoveLiteralFromGroupExpressions and RemoveRepetitionFromGroupExpressions rules from a separate batch to the operatorOptimizationBatch.

Why are the changes needed?

The RemoveLiteralFromGroupExpressions does not work in some cases if it is in a separate batch.
The added UT would fail with:

[info] - SPARK-36496: Remove literals from grouping expressions *** FAILED *** (2 seconds, 955 milliseconds)
[info]   == FAIL: Plans do not match ===
[info]   !Aggregate [*id#0L, null], [*id#0L, null AS a#0, count(1) AS count#0L]   Aggregate [*id#0L], [*id#0L, null AS a#0, count(1) AS count#0L]
[info]    +- Range (0, 100, step=1, splits=Some(2))                               +- Range (0, 100, step=1, splits=Some(2)) (PlanTest.scala:174)

Does this PR introduce any user-facing change?

No

How was this patch tested?

New UT

@github-actions github-actions bot added the SQL label Aug 12, 2021
@tanelk tanelk changed the title [SPARK-36496][SQL] Remove literals from grouping expressions when using the DataFrame API [SPARK-36496][SQL] Remove literals from grouping expressions when using the DataFrame withColumn API Aug 12, 2021
@SparkQA
Copy link

SparkQA commented Aug 12, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46888/

@SparkQA
Copy link

SparkQA commented Aug 12, 2021

Test build #142381 has finished for PR 33723 at commit c7dfb3c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tanelk
Copy link
Contributor Author

tanelk commented Aug 17, 2021

pinging @maropu

@SparkQA
Copy link

SparkQA commented Aug 17, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47037/

@SparkQA
Copy link

SparkQA commented Aug 17, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47037/

@SparkQA
Copy link

SparkQA commented Aug 17, 2021

Test build #142536 has finished for PR 33723 at commit 38d98ed.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class SessionWindow(timeColumn: Expression, gapDuration: Expression) extends Expression
  • case class UnresolvedWith(
  • case class CTERelationDef(child: LogicalPlan, id: Long = CTERelationDef.newId) extends UnaryNode
  • case class CTERelationRef(
  • case class WithCTE(plan: LogicalPlan, cteDefs: Seq[CTERelationDef]) extends LogicalPlan
  • class DefaultDateFormatter(
  • class DefaultTimestampFormatter(

@SparkQA
Copy link

SparkQA commented Oct 19, 2021

Test build #144405 has finished for PR 33723 at commit d67a3f0.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 19, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48878/

@SparkQA
Copy link

SparkQA commented Oct 19, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48878/

huaxingao and others added 19 commits March 1, 2022 20:05
…ata filter, partition filter)

### What changes were proposed in this pull request?
Add test coverage for filter OR which contains both data filter and partition filter
e.g. p is partition col and id is data col
```
SELECT * FROM tmp WHERE (p = 0 AND id > 0) OR (p = 1 AND id = 2)
```

### Why are the changes needed?
Test coverage

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
New UT

Closes apache#35703 from huaxingao/spark-37593.

Authored-by: huaxingao <huaxin_gao@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…annotations

### What changes were proposed in this pull request?

This PR aims to support `APP_ID` and `EXECUTOR_ID` placeholder in K8s annotation in the same way we did for `EXECUTOR_JAVA_OPTIONS`.

### Why are the changes needed?

Although Apache Spark provides `spark-app-id` already, some custom schedulers are not able to recognize them.

### Does this PR introduce _any_ user-facing change?

No because the pattern strings are very specific.

### How was this patch tested?

Pass the CIs and K8s IT.

This passed like the following on `Docker Desktop K8s`.
```
$ build/sbt -Psparkr -Pkubernetes -Pkubernetes-integration-tests -Dtest.exclude.tags=minikube -Dspark.kubernetes.test.deployMode=docker-for-desktop "kubernetes-integration-tests/test"
[info] KubernetesSuite:
[info] - Run SparkPi with no resources (8 seconds, 789 milliseconds)
[info] - Run SparkPi with no resources & statefulset allocation (8 seconds, 903 milliseconds)
[info] - Run SparkPi with a very long application name. (8 seconds, 586 milliseconds)
[info] - Use SparkLauncher.NO_RESOURCE (8 seconds, 409 milliseconds)
[info] - Run SparkPi with a master URL without a scheme. (8 seconds, 586 milliseconds)
[info] - Run SparkPi with an argument. (8 seconds, 708 milliseconds)
[info] - Run SparkPi with custom labels, annotations, and environment variables. (8 seconds, 626 milliseconds)
[info] - All pods have the same service account by default (8 seconds, 595 milliseconds)
[info] - Run extraJVMOptions check on driver (4 seconds, 324 milliseconds)
[info] - Run SparkRemoteFileTest using a remote data file (8 seconds, 424 milliseconds)
[info] - Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j2.properties (13 seconds, 42 milliseconds)
[info] - Run SparkPi with env and mount secrets. (16 seconds, 600 milliseconds)
[info] - Run PySpark on simple pi.py example (11 seconds, 479 milliseconds)
[info] - Run PySpark to test a pyfiles example (10 seconds, 669 milliseconds)
[info] - Run PySpark with memory customization (8 seconds, 604 milliseconds)
[info] - Run in client mode. (7 seconds, 349 milliseconds)
[info] - Start pod creation from template (8 seconds, 779 milliseconds)
[info] - Test basic decommissioning (42 seconds, 970 milliseconds)
[info] - Test basic decommissioning with shuffle cleanup (42 seconds, 650 milliseconds)
[info] - Test decommissioning with dynamic allocation & shuffle cleanups (2 minutes, 41 seconds)
[info] - Test decommissioning timeouts (43 seconds, 340 milliseconds)
[info] - SPARK-37576: Rolling decommissioning (1 minute, 6 seconds)
[info] - Run SparkR on simple dataframe.R example (11 seconds, 645 milliseconds)
```

Closes apache#35704 from dongjoon-hyun/SPARK-38383.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This pr use `Ivy.retrieve(ModuleRevisionId, RetrieveOptions)` instead of  deprecated `Ivy.retrieve(ModuleRevisionId, String, RetrieveOptions)` to  clean up deprecation compilation warning.

The refactor way refer to the Implementation of  `RetrieveEngine#retrieve` method as follows:

```java
    Deprecated
    public int retrieve(ModuleRevisionId mrid, String destFilePattern, RetrieveOptions options)
            throws IOException {
        RetrieveOptions retrieveOptions = new RetrieveOptions(options);
        retrieveOptions.setDestArtifactPattern(destFilePattern);

        RetrieveReport result = retrieve(mrid, retrieveOptions);
        return result.getNbrArtifactsCopied();
    }
```

### Why are the changes needed?
Clean up deprecated api usage.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GA

Closes apache#35672 from LuciferYang/cleanup-deprecation-ivy.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
… `TIMESTAMPDIFF()`

### What changes were proposed in this pull request?
In the PR, I propose to add two aliases for the `TIMESTAMPDIFF()` function introduced by apache#35607:
- `DATEDIFF()`
- `DATE_DIFF()`

### Why are the changes needed?
1. To make the migration process from other systems to Spark SQL easier.
2. To achieve feature parity with other DBMSs.

### Does this PR introduce _any_ user-facing change?
No. The new aliases just extend Spark SQL API.

### How was this patch tested?
1. By running the existing test suites:
```
$ build/sbt "test:testOnly *SQLKeywordSuite"
```
3. and new checks:
```
$ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z date.sql"
$ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z datetime-legacy.sql"
```

Closes apache#35709 from MaxGekk/datediff.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
…ver` suffix to drivers during IT

### What changes were proposed in this pull request?

There are two small proposals:
1) prefix the name of the temporary k8s namespaces with `"spark-"` so that the output of `kubectl get ns" is more clear.
2) unify the name of the driver pod in non-test and IT tests to always use `-driver` as a suffix.

### Why are the changes needed?

At the moment the name of the temporary namespace is just UUID without the `-`s. When one reads the result of `kubectl get ns` it is a bit cryptic to see UUIDs.

The names of the driver pods in ITs are not telling me that they are Drivers.
In non-test (i.e. production) the driver pod names are suffixed with `-driver`. I propose the same for IT tests.
Executor pods always use `-exec-` in their pod names, both in non-test and ITs.

### Does this PR introduce _any_ user-facing change?

Yes! Developers who debug IT tests will see more clear names now.

### How was this patch tested?

Manually with `kubectl get ns --watch` and `kubectl get po --watch`.

Closes apache#35711 from martin-g/k8s-test-names-improvement.

Authored-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?

SPARK-31007 introduce an auxiliary statistics to speed up computation in KMeasn.

However, it needs a array of size `k * (k + 1) / 2`, which may cause overflow or OOM when k is too large.

So we should skip this optimization in this case.

### Why are the changes needed?

avoid overflow or OOM when k is too large (like 50,000)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
existing testsuites

Closes apache#35457 from zhengruifeng/kmean_k_limit.

Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: huaxingao <huaxin_gao@apple.com>
…le-datasource

### What changes were proposed in this pull request?

Add more examples to sql-ref-syntax-ddl-create-table-datasource:
1. Create partitioned and bucketed table through CTAS.
2. Create bucketed table through CTAS and CTE

### Why are the changes needed?

Improve doc.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual test.

Closes apache#35712 from wangyum/sql-ref-syntax-ddl-create-table-datasource.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: huaxingao <huaxin_gao@apple.com>
…lean up redundant type cast

### What changes were proposed in this pull request?
This pr aims to clean up redundant type cast in Spark code.

### Why are the changes needed?
Code simplification

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- Pass GA
- Manually build a client, check `org.apache.spark.examples.DriverSubmissionTest` and `org.apache.spark.examples.mllib.LDAExample` passed

Closes apache#35592 from LuciferYang/redundant-cast.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: huaxingao <huaxin_gao@apple.com>
… needs to be initialized once in HadoopMapReduceCommitProtocol

### What changes were proposed in this pull request?

This pr follows up the apache#35492, try to use a stagingDir constant instead of the  stagingDir method in HadoopMapReduceCommitProtocol.

### Why are the changes needed?

In the apache#35492 (comment)

```
./build/sbt -mem 4096 -Phadoop-2 "sql/testOnly org.apache.spark.sql.sources.PartitionedWriteSuite -- -z SPARK-27194"
...
[info]   Cause: org.apache.spark.SparkException: Task not serializable
...
[info]   Cause: java.io.NotSerializableException: org.apache.hadoop.fs.Path
...

```
It's because org.apache.hadoop.fs.Path is serializable in Hadoop3 but not in Hadoop2.  So, we should make the stagingDir  transient to avoid that.

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?

Passed `./build/sbt -mem 4096 -Phadoop-2 "sql/testOnly org.apache.spark.sql.sources.PartitionedWriteSuite -- -z SPARK-27194"`

Pass the CIs.

Closes apache#35693 from weixiuli/staging-directory.

Authored-by: weixiuli <weixiuli@jd.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?

Apache Spark has been supporting many K8s features via `spark.kubernetes.driver.podTemplateFile` and `spark.kubernetes.executor.podTemplateFile` in an extensible way. This PR aims to add an integration test case for `priorityClassName` pod spec.

In this test case, we use one of the K8s built-in priority classes because we want to run this test on heterogenous K8s environments. In addition, `schedule` test tag is added for some esoteric K8s environments without `system-node-critical` priority class or `system-node-critical` with different values.
- https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/#marking-pod-as-critical
```
$ k get priorityclass
NAME                      VALUE        GLOBAL-DEFAULT   AGE
system-cluster-critical   2000000000   false            4h19m
system-node-critical      2000001000   false            4h19m
```

### Why are the changes needed?

We don't need to enumerate all K8s spec via `spark.kubernetes.xxx` configurations. PodTemplate can do many things.
This example will help the future works for customer schedulers.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the K8s IT. This is tested like the following.

```
$ build/sbt -Psparkr -Pkubernetes -Pkubernetes-integration-tests -Dtest.exclude.tags=minikube -Dspark.kubernetes.test.deployMode=docker-for-desktop "kubernetes-integration-tests/test"
...
[info] KubernetesSuite:
[info] - Run SparkPi with no resources (8 seconds, 866 milliseconds)
[info] - Run SparkPi with no resources & statefulset allocation (10 seconds, 700 milliseconds)
[info] - Run SparkPi with a very long application name. (8 seconds, 634 milliseconds)
[info] - Use SparkLauncher.NO_RESOURCE (8 seconds, 628 milliseconds)
[info] - Run SparkPi with a master URL without a scheme. (8 seconds, 626 milliseconds)
[info] - Run SparkPi with an argument. (8 seconds, 821 milliseconds)
[info] - Run SparkPi with custom labels, annotations, and environment variables. (9 seconds, 675 milliseconds)
[info] - All pods have the same service account by default (8 seconds, 692 milliseconds)
[info] - Run extraJVMOptions check on driver (4 seconds, 599 milliseconds)
[info] - Run SparkRemoteFileTest using a remote data file (8 seconds, 767 milliseconds)
[info] - Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j2.properties (14 seconds, 140 milliseconds)
[info] - Run SparkPi with env and mount secrets. (19 seconds, 62 milliseconds)
[info] - Run PySpark on simple pi.py example (9 seconds, 821 milliseconds)
[info] - Run PySpark to test a pyfiles example (11 seconds, 713 milliseconds)
[info] - Run PySpark with memory customization (9 seconds, 630 milliseconds)
[info] - Run in client mode. (7 seconds, 289 milliseconds)
[info] - Start pod creation from template (8 seconds, 720 milliseconds)
[info] - SPARK-38398: Schedule pod creation from template (8 seconds, 728 milliseconds)
...
```

Closes apache#35716 from dongjoon-hyun/SPARK-38398.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
… for Pandas API on Spark

### What changes were proposed in this pull request?

- Add magic method \_\_enter\_\_ and \_\_exit\_\_ into **the special_function list**

### Why are the changes needed?

- Improve the usage data accuracy for **with statement** so that external \_\_enter\_\_ and \_\_exit\_\_ calls are captured instead of internal calls

For example, for the code below:

```python
pdf = pd.DataFrame(
    [(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"]
)
psdf = ps.from_pandas(pdf)

with psdf.spark.cache() as cached_df:
    self.assert_eq(isinstance(cached_df, CachedDataFrame), True)
    self.assert_eq(
        repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, False, True))
    )
 ```

Pandas-on-Spark usage logger records the internal call [self.spark.unpersist()](https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518) since \_\_enter\_\_ and \_\_exit\_\_ methods of [CachedDataFrame](https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492) are not instrumented.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing unit tests

Closes apache#35687 from heyihong/SPARK-38353.

Authored-by: Yihong He <yihong.he@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…eparate Parser and Lexer files

### What changes were proposed in this pull request?

Separating the mixed parser grammar defined in `SqlBase.g4` into separate parser and lexer grammars.
* The parser grammar disallows any literal definitions. Thus all literals are replaced with names defined in the new lexer.
* The lexer and parser has to be provided in two files. As ANTLR only allows one grammar per file.

### Why are the changes needed?

This gives us a cleaner separation of parser and lexer in the original grammar.
It also enables us to use the full power of ANTLR parser and lexer grammars:
* Access to lexer specific rules: lexer specific rules (e.g. [Lexer mode](https://github.com/antlr/antlr4/blob/master/doc/lexer-rules.md#lexical-modes)) can be used for new SQL features.
* Ability to reuse lexer rules: we can now use inheritance to have multiple lexers sharing a common set of lexical rules.
* Clear order of tokens: the order the tokens are tokenized by the lexer in the order they appear in the lexer.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests. The refactoring should not break any tests.

Closes apache#35701 from zhenlineo/parser-lexer.

Authored-by: Zhen Li <zhen.li@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This change is to refactor exceptions thrown in GraphiteSink to use error class framework.

### Why are the changes needed?
This is to follow the error class framework.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added unit tests.

Closes apache#35643 from bozhang2820/error-class.

Authored-by: Bo Zhang <bo.zhang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?

Replace symbols like `'abc` with the more verbose `Symbol("abc") in the test code.

### Why are the changes needed?

Building with Scala 2.13 produces a lot of warnings like the following ones:

```
[warn] /home/runner/work/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/execution/BaseScriptTransformationSuite.scala:562:11: [deprecation   | origin= | version=2.13.0] symbol literal is deprecated; use Symbol("d") instead
[warn]           'd.cast("string"),
[warn]           ^
[warn] /home/runner/work/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/execution/BaseScriptTransformationSuite.scala:563:11: [deprecation   | origin= | version=2.13.0] symbol literal is deprecated; use Symbol("e") instead
[warn]           'e.cast("string")).collect())
```

This should make it easier to upgrade to Scala 3 later.

### Does this PR introduce _any_ user-facing change?

No! The PR touches only test classes!

### How was this patch tested?

The build at CI must be green!

Closes apache#35560 from martin-g/dont-use-deprecate-symbol-api.

Authored-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Introduce SQL function ARRAY_SIZE.

ARRAY_SIZE works the same as SIZE when the input is an array except for:
- ARRAY_SIZE raises an exception for non-array input.
- ARRAY_SIZE always returns null for null input.

### Why are the changes needed?
Counting elements within an array is a common use case. ARRAY_SIZE ensures the input to be an array and then returns the size.

Other DBRMS like Snowflake supports that as well: [Snowflake ARRAY_SIZE](https://docs.snowflake.com/en/sql-reference/functions/array_size.html). Implementing that improves compatibility with DBMS and makes migration easier.

### Does this PR introduce _any_ user-facing change?
Yea. `array_size` is available now.

```
scala> spark.sql("select array_size(array(2, 1))").show()
+-----------------------+
|array_size(array(2, 1))|
+-----------------------+
|                      2|
+-----------------------+

scala> spark.sql("select array_size(map('a', 1, 'b', 2))").show()
org.apache.spark.sql.AnalysisException: cannot resolve 'array_size(map('a', 1, 'b', 2))' due to data type mismatch: argument 1 requires array type, however, 'map('a', 1, 'b', 2)' is of map<string,int> type.; line 1 pos 7;
'Project [unresolvedalias(array_size(map(a, 1, b, 2), None), None)]
```

### How was this patch tested?
Unit tests.

Closes apache#35671 from xinrong-databricks/array_size.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…e expression by self way

### What changes were proposed in this pull request?
apache#35248 provides a new framework to represent catalyst expressions in DS V2 APIs.
Because the framework translate all catalyst expressions to a unified SQL string and cannot keep compatibility between different JDBC database, the framework works not good.

This PR reactor the framework so as JDBC dialect could compile expression by self way.
First, The framework translate catalyst expressions to DS V2 expression.
Second, The JDBC dialect could compile DS V2 expression to different SQL syntax.

The java doc looks show below:
![image](https://user-images.githubusercontent.com/8486025/156579584-f56cafb5-641f-4c5b-a06e-38f4369051c3.png)

### Why are the changes needed?
Make  the framework be more common use.

### Does this PR introduce _any_ user-facing change?
'No'.
The feature is not released.

### How was this patch tested?
Exists tests.

Closes apache#35494 from beliefer/SPARK-37960_followup.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…dencies` API

### What changes were proposed in this pull request?

This PR aims to remove `Experimental` from `RDD.cleanShuffleDependencies` API at Apache Spark 3.3.

### Why are the changes needed?

This API has been used since Apache Spark 3.1.0.

### Does this PR introduce _any_ user-facing change?

No. This has been used for a long time in 3.1.1 ~ 3.2.1 since April 7, 2020.
- https://spark.apache.org/docs/3.1.1/api/scala/org/apache/spark/rdd/RDD.html#cleanShuffleDependencies(blocking:Boolean):Unit

### How was this patch tested?

Manual review because this is a human-oriented doc change.

Closes apache#35736 from dongjoon-hyun/SPARK-38417.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…er API

### What changes were proposed in this pull request?

This PR aims to add `cleanShuffleDependencies` developer API to PySpark RDD like Scala.

### Why are the changes needed?

This API has been documented and used since Apache Spark 3.1.0 and we removed `Experimental` tag at Apache Spark 3.3.0 via SPARK-38417.
- https://spark.apache.org/docs/latest/api/scala/org/apache/spark/rdd/RDD.html#cleanShuffleDependencies(blocking:Boolean):Unit

This is required for a feature parity in PySpark 3.3.0.

### Does this PR introduce _any_ user-facing change?

Yes, but this is a new API addition.

### How was this patch tested?

Pass the CIs.

Closes apache#35737 from dongjoon-hyun/SPARK-38418.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
dongjoon-hyun and others added 9 commits March 30, 2022 08:26
### What changes were proposed in this pull request?

This PR replaces `new Path(fileUri.getPath)` with `new Path(fileUri)`.
By using `Path` class constructor with URI parameter, we can preserve file scheme.

### Why are the changes needed?

If we use, `Path` class constructor with `String` parameter, it loses file scheme information.
Although the original code works so far, it fails at Apache Hadoop 3.3.2 and breaks dependency upload feature which is covered by K8s Minikube integration tests.

```scala
test("uploadFileUri") {
   val fileUri = org.apache.spark.util.Utils.resolveURI("/tmp/1.txt")
   assert(new Path(fileUri).toString == "file:/private/tmp/1.txt")
   assert(new Path(fileUri.getPath).toString == "/private/tmp/1.txt")
}
```

### Does this PR introduce _any_ user-facing change?

No, this will prevent a regression at Apache Spark 3.3.0 instead.

### How was this patch tested?

Pass the CIs.

In addition, this PR and apache#36009 will recover K8s IT `DepsTestsSuite`.

Closes apache#36010 from dongjoon-hyun/SPARK-38652.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
There are some code patterns in Spark Java UTs:

```java
Test
  public void testAuthReplay() throws Exception {
    try {
      doSomeOperation();
      fail("Should have failed");
    } catch (Exception e) {
      assertTrue(doExceptionCheck(e));
    }
  }
```
or
```java
  Test(expected = SomeException.class)
  public void testAuthReplay() throws Exception {
    try {
      doSomeOperation();
      fail("Should have failed");
    } catch (Exception e) {
      assertTrue(doExceptionCheck(e));
      throw e;
    }
  }
```
This pr  use Junit `assertThrows` Api to simplify the similar patterns.

### Why are the changes needed?
Simplify code.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GA

Closes apache#36008 from LuciferYang/SPARK-38694.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
…n command

### What changes were proposed in this pull request?

This PR refactors `CreateFunctionCommand` and `DropFunctionCommand` to use `FunctionIdentifier`.

### Why are the changes needed?

To make the code cleaner.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing unit tests.

Closes apache#36016 from allisonwang-db/spark-38705-refactor-func-identifier.

Authored-by: allisonwang-db <allison.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?

This PR aims to use URI in `FallbackStorage.copy` method.

### Why are the changes needed?

Like the case of SPARK-38652, the current fallback feature is broken with `S3A` due to Hadoop 3.3.2's `org.apache.hadoop.fs.PathIOException`.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Manually start one master and executor and decommission the executor.

```
spark.decommission.enabled                          true
spark.storage.decommission.enabled                  true
spark.storage.decommission.shuffleBlocks.enabled    true
spark.storage.decommission.fallbackStorage.path     s3a://spark/storage/
```

```
$ curl -v -X POST -d "host=hostname" http://hostname:8080/workers/kill/
```

Closes apache#36017 from williamhyun/fallbackstorage.

Authored-by: William Hyun <william@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…iv/Reminder/Pmod

### What changes were proposed in this pull request?

Provide SQL query context in the following runtime error:

- Divide: divide by 0 error, including numeric types and ANSI interval types
- Integral Divide: divide by 0 error and overflow error
- Reminder: divide by 0 error
- Pmod: divide by 0 error

Example1:
```
== SQL(line 1, position 7) ==
select smallint('100') / bigint('0')
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
```

Example 2:
```
== SQL(line 1, position 7) ==
select interval '2' year / 0
       ^^^^^^^^^^^^^^^^^^^^^
```
### Why are the changes needed?

Provide SQL query context of runtime errors to users, so that they can understand it better.

### Does this PR introduce _any_ user-facing change?

Yes, improve the runtime error message of Divide/Div/Reminder/Pmod
### How was this patch tested?

UT

Closes apache#36013 from gengliangwang/divideError.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
…ut length

### What changes were proposed in this pull request?

This PR improves the error messages for the char / varchar / character datatypes without length. It also added related testcases.

#### Details
We support char and varchar types. But when users input the type without length, the message is confusing and not helpful at all:

```
> SELECT cast('a' as CHAR)

DataType char is not supported.(line 1, pos 19)

== SQL ==
SELECT cast('a' AS CHAR)
-------------------^^^
```
In the after case, the messages would be:
```
Datatype char requires a length parameter, for example char(10). Please specify the length.

== SQL ==
SELECT cast('a' AS CHAR)
-------------------^^^
```

### Why are the changes needed?
To improve error messages for better usability.

### Does this PR introduce _any_ user-facing change?
If error messages are considered as user-facing changes, then yes. It improves the messages as above.

### How was this patch tested?
It's tested by newly added unit tests.

Closes apache#35966 from anchovYu/better-msg-for-char.

Authored-by: Xinyi Yu <xinyi.yu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…ULT columns

### What changes were proposed in this pull request?

Extend INSERT INTO statements to support omitting default values or referring to them explicitly with the DEFAULT keyword, in which case the Spark analyzer will automatically insert the appropriate corresponding values in the right places.

Example:
```
CREATE TABLE T(a INT DEFAULT 4, b INT NOT NULL DEFAULT 5);
INSERT INTO T VALUES (1);
INSERT INTO T VALUES (1, DEFAULT);
INSERT INTO T VALUES (DEFAULT, 6);
SELECT * FROM T;
(1, 5)
(1, 5)
(4, 6)
```

### Why are the changes needed?

This helps users issue INSERT INTO statements with less effort, and helps people creating or updating tables to add custom optional columns for use in specific circumstances as desired.

### How was this patch tested?

This change is covered by new and existing unit test coverage as well as new INSERT INTO query test cases covering a variety of positive and negative scenarios.

Closes apache#35982 from dtenedor/default-columns-insert-into.

Authored-by: Daniel Tenedorio <daniel.tenedorio@databricks.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
…elk/spark into SPARK-36496_remove_grouping_literals
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.