[SPARK-43549][SQL] Convert _LEGACY_ERROR_TEMP_0036 to INVALID_SQL_SYNTAX#41214
[SPARK-43549][SQL] Convert _LEGACY_ERROR_TEMP_0036 to INVALID_SQL_SYNTAX#41214panbingkun wants to merge 58 commits intoapache:masterfrom
_LEGACY_ERROR_TEMP_0036 to INVALID_SQL_SYNTAX#41214Conversation
_LEGACY_ERROR_TEMP_0036 to INVALID_SQL_SYNTAX
…ssions and missing columns
### What changes were proposed in this pull request?
In the PR, I propose to propagate all tags in a `Project` while resolving of expressions and missing columns in `ColumnResolutionHelper.resolveExprsAndAddMissingAttrs()`.
### Why are the changes needed?
To fix the bug reproduced by the query below:
```sql
spark-sql (default)> WITH
> t1 AS (select key from values ('a') t(key)),
> t2 AS (select key from values ('a') t(key))
> SELECT t1.key
> FROM t1 FULL OUTER JOIN t2 USING (key)
> WHERE t1.key NOT LIKE 'bb.%';
[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `t1`.`key` cannot be resolved. Did you mean one of the following? [`key`].; line 4 pos 7;
```
### Does this PR introduce _any_ user-facing change?
No. It fixes a bug, and outputs the expected result: `a`.
### How was this patch tested?
By new test added to `using-join.sql`:
```
$ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z using-join.sql"
```
and the related test suites:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly org.apache.spark.sql.hive.HiveContextCompatibilitySuite"
```
Closes apache#41204 from MaxGekk/fix-using-join.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
…ly jar ### What changes were proposed in this pull request? Exclude `javax.activation:activation:jar:1.1.1` and `org.apache.logging.log4j:log4j-slf4j2-impl:jar:2.20.0` from `spark-streaming-kafka-0-10-assembly_2.12-<version>.jar` ### Why are the changes needed? We should not include the jar which already exists in Spark binary artifact into the assembly jar of optional component. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? ``` build/mvn dependency:list -pl :spark-streaming-kafka-0-10-assembly_2.12 ``` before ``` [INFO] --- maven-dependency-plugin:3.5.0:list (default-cli) spark-streaming-kafka-0-10-assembly_2.12 --- [INFO] [INFO] The following files have been resolved: [INFO] org.apache.spark:spark-streaming-kafka-0-10_2.12:jar:3.5.0-SNAPSHOT:compile [INFO] org.apache.spark:spark-token-provider-kafka-0-10_2.12:jar:3.5.0-SNAPSHOT:compile [INFO] org.apache.kafka:kafka-clients:jar:3.4.0:compile [INFO] org.apache.spark:spark-tags_2.12:jar:3.5.0-SNAPSHOT:compile [INFO] javax.activation:activation:jar:1.1.1:compile [INFO] org.apache.logging.log4j:log4j-slf4j2-impl:jar:2.20.0:compile [INFO] org.spark-project.spark:unused:jar:1.0.0:compile ``` after ``` [INFO] --- maven-dependency-plugin:3.5.0:list (default-cli) spark-streaming-kafka-0-10-assembly_2.12 --- [INFO] [INFO] The following files have been resolved: [INFO] org.apache.spark:spark-streaming-kafka-0-10_2.12:jar:3.5.0-SNAPSHOT:compile [INFO] org.apache.spark:spark-token-provider-kafka-0-10_2.12:jar:3.5.0-SNAPSHOT:compile [INFO] org.apache.kafka:kafka-clients:jar:3.4.0:compile [INFO] org.apache.spark:spark-tags_2.12:jar:3.5.0-SNAPSHOT:compile [INFO] org.spark-project.spark:unused:jar:1.0.0:compile ``` Closes apache#41217 from pan3793/SPARK-43575. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…nd pandas function APIs in workers during runtime ### What changes were proposed in this pull request? This PR proposes a new configuration `spark.sql.execution.pyspark.python` that sets the Python executable on worker nodes. Note that, if the Python executable is different from the one previously ran, it will creates a new Python worker processes because we reuse Python workers but they are unique by both executable path and env variables as a key: https://github.com/apache/spark/blob/d7a8b852eaa6cc04df1eea0018a9b9de29b1c4fe/core/src/main/scala/org/apache/spark/SparkEnv.scala#L123-L124 This PR is also a basework for Spark Connect to support a different set of dependencies. ### Why are the changes needed? This can be useful especially when you want to run your Python with a different set of dependencies during runtime (see also https://www.databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html) ### Does this PR introduce _any_ user-facing change? No, this PR adds a configuration but that's internal for now. ### How was this patch tested? Manually tested as below: ```python import sys from pyspark.sql.functions import udf spark.range(1).select(udf(lambda x: sys.executable)("id")).show(truncate=False) spark.conf.set("spark.sql.execution.pyspark.python", "/.../miniconda3/envs/another-python/bin/python") spark.range(1).select(udf(lambda x: sys.executable)("id")).show(truncate=False) ``` ``` +---------------------------------------------------------+ |<lambda>(id) | +---------------------------------------------------------+ |/.../miniconda3/envs/python3.9/bin/python3| +---------------------------------------------------------+ +-------------------------------------------------------------+ |<lambda>(id) | +-------------------------------------------------------------+ |/.../miniconda3/envs/another-python/bin/python| +-------------------------------------------------------------+ ``` Closes apache#41215 from HyukjinKwon/SPARK-43574. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…lt_sites of get_preferred_mirrors ### What changes were proposed in this pull request? This PR aims to add `https://dlcdn.apache.org/` to the default mirror site list during python installation tests. ### Why are the changes needed? This is a preferred mirror. So, even if `https://www.apache.org/dyn/closer.lua` is inaccessible, we will download from `https://dlcdn.apache.org/`. ``` $ curl https://www.apache.org/dyn/closer.lua\?preferred\=true https://dlcdn.apache.org/ ``` Although we try to get this programmatically, sometimes `https://www.apache.org/dyn/closer.lua` seems to fail. https://github.com/apache/spark/blob/acad77d56112f2cab2ce5adca913b75ce659add5/python/pyspark/install.py#L169C2-L179 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes apache#41222 from dongjoon-hyun/SPARK-43580. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request? This PR aims to upgrade `kubernetes-client` from 6.6.1 to 6.6.2 ### Why are the changes needed? Bring a fix, "RequestConfig is propagated to derived HttpClient instances". fabric8io/kubernetes-client@8cf4804 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes apache#41223 from dongjoon-hyun/SPARK-43581. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…t in tests ### What changes were proposed in this pull request? This is a follow-up of apache#41064 . `LocalRelaton` is heavily used in tests and it's better to not report row count in tests to avoid the query being optimized too well which may hurt test coverage. This PR updates `LocalRelaton` to not report row count in tests, and adds a test-only config to still enable it in tests. ### Why are the changes needed? keep test coverage ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests Closes apache#41216 from cloud-fan/follow. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request? Upgrade Parquet from 1.13.0 to 1.13.1 ### Why are the changes needed? [Apache Parquet 1.13.1](https://parquet.apache.org/blog/2023/05/18/1.13.1/) is available, the release notes are > ### Version 1.13.1 ### > > Release Notes - Parquet - Version 1.13.1 > > #### Improvement > > * [PARQUET-2276](https://issues.apache.org/jira/browse/PARQUET-2276) - Bring back support > for Hadoop 2.7.3 > * [PARQUET-2297](https://issues.apache.org/jira/browse/PARQUET-2297) - Skip delta problem > check > * [PARQUET-2292](https://issues.apache.org/jira/browse/PARQUET-2292) - Improve default > SpecificRecord model selection for Avro `{Write,Read}`Support > * [PARQUET-2290](https://issues.apache.org/jira/browse/PARQUET-2290) - Add CI for Hadoop 2 > * [PARQUET-2282](https://issues.apache.org/jira/browse/PARQUET-2282) - Don't initialize HadoopCodec > * [PARQUET-2283](https://issues.apache.org/jira/browse/PARQUET-2283) - Remove Hadoop HiddenFileFilter > * [PARQUET-2081](https://issues.apache.org/jira/browse/PARQUET-2081) - Fix support for rewriting files without ColumnIndexes ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GA Closes apache#41178 from pan3793/SPARK-43519. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
… thrift server ### What changes were proposed in this pull request? Add a new test for scrollable result set support, which is uncovered yet through jdbc APIs ### Why are the changes needed? test improvement ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new test Closes apache#41213 from yaooqinn/SPARK-43572. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Yuming Wang <yumwang@ebay.com>
### What changes were proposed in this pull request? This PR aims to upgrade `sbt-pom-reader` from 2.2.0 to 2.4.0. ### Why are the changes needed? Since v2.3.0, organization has moved from `com.typesafe.sbt` to `com.github.sbt` - https://github.com/sbt/sbt-pom-reader/releases/tag/v2.4.0 - https://github.com/sbt/sbt-pom-reader/releases/tag/v2.3.0 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. Closes apache#41224 from panbingkun/SPARK-43582. Authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…edicated JVM ### What changes were proposed in this pull request? This PR aims to run `HealthTrackerIntegrationSuite` in a dedicated JVM to mitigate a flaky tests. ### Why are the changes needed? `HealthTrackerIntegrationSuite` has been flaky and SPARK-25400 and SPARK-37384 increased the timeout `from 1s to 10s` and `10s to 20s`, respectively. The usual suspect of this flakiness is some unknown side-effect like GCs. In this PR, we aims to run this in a separate JVM instead of increasing the timeout more. https://github.com/apache/spark/blob/abc140263303c409f8d4b9632645c5c6cbc11d20/core/src/test/scala/org/apache/spark/scheduler/SchedulerIntegrationSuite.scala#L56-L58 This is the recent failure. - https://github.com/apache/spark/actions/runs/5020505360/jobs/9002039817 ``` [info] HealthTrackerIntegrationSuite: [info] - If preferred node is bad, without excludeOnFailure job will fail (92 milliseconds) [info] - With default settings, job can succeed despite multiple bad executors on node (3 seconds, 163 milliseconds) [info] - Bad node with multiple executors, job will still succeed with the right confs *** FAILED *** (20 seconds, 43 milliseconds) [info] java.util.concurrent.TimeoutException: Futures timed out after [20 seconds] [info] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:259) [info] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:187) [info] at org.apache.spark.util.ThreadUtils$.awaitReady(ThreadUtils.scala:355) [info] at org.apache.spark.scheduler.SchedulerIntegrationSuite.awaitJobTermination(SchedulerIntegrationSuite.scala:276) [info] at org.apache.spark.scheduler.HealthTrackerIntegrationSuite.$anonfun$new$9(HealthTrackerIntegrationSuite.scala:92) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes apache#41229 from dongjoon-hyun/SPARK-43587. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
| errorClass = "_LEGACY_ERROR_TEMP_0036", | ||
| messageParameters = Map("ctx" -> ctx.getText), | ||
| errorClass = "INVALID_SQL_SYNTAX", | ||
| messageParameters = Map("inputString" -> s"${ctx.getText} must be ${toSQLStmt("NOSCAN")}"), |
There was a problem hiding this comment.
It would be better to do not embed error's text in source code. Could you leave the error class as a separate one (+ assign name) or at least make it as a sub-class of INVALID_SQL_SYNTAX.
There was a problem hiding this comment.
I have found many similar uses in QueryParsingErrors. Should we refactor it to classify INVALID_SQL_SYNTAX and avoid to embed error's text in source code?
eg:
Because I think it is not good to assign a new name for the above scenario, because it is a syntax error.
It seems more reasonable to be implemented as a sub-class of INVALID_SQL_SYNTAX.
There was a problem hiding this comment.
Should we refactor it to classify INVALID_SQL_SYNTAX and avoid to embed error's text in source code?
Yep. Let's refactor INVALID_SQL_SYNTAX first of all, and then come back to this PR.
There was a problem hiding this comment.
Yes, I will continue the work on the PR right now.
…um serialization ### What changes were proposed in this pull request? Follows-up on the comment here: apache#41075 (comment) Namely: - updates `error-classes.json` and `sql-error-conditions.md` to have the updated error name. - adds an additional test to assert that enum serialization with invalid enum values throws the correct exception. ### Why are the changes needed? Improve documentation ### Does this PR introduce _any_ user-facing change? Yes, documentation. ### How was this patch tested? Existing unit tests Closes apache#41188 from justaparth/parth/update-documentation-enum-error-message. Authored-by: Parth Upadhyay <parth.upadhyay@gmail.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
…R_TEMP_[0041|1206] ### What changes were proposed in this pull request? This PR proposes to assign the proper names to the following `_LEGACY_ERROR_TEMP*` error classes: * `_LEGACY_ERROR_TEMP_0041` -> `DUPLICATE_CLAUSES` * `_LEGACY_ERROR_TEMP_1206` -> `COLUMN_NOT_DEFINED_IN_TABLE` ### Why are the changes needed? Proper name improves user experience w/ Spark SQL. ### Does this PR introduce _any_ user-facing change? Yes, the PR changes an user-facing error message. ### How was this patch tested? By running modified test suties. Closes apache#41020 from imback82/error_messages. Authored-by: Terry Kim <terry.kim@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request? This PR aims to upgrade ASM to 9.5. ### Why are the changes needed? xbean-asm9-shaded 4.25 upgrade to use ASM 9.5 and ASM 9.5 is for Java 21: - https://asm.ow2.io/versions.html - https://issues.apache.org/jira/browse/XBEAN-339 | apache/geronimo-xbean#36 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions Closes apache#41231 from LuciferYang/asm-9.5. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…o use `bytesToString` ### What changes were proposed in this pull request? This PR aims to fix `cannotBroadcastTableOverMaxTableBytesError` to use `bytesToString` instead of shift operations. ### Why are the changes needed? To avoid user confusion by giving more accurate values. For example, `maxBroadcastTableBytes` is 1GB and `dataSize` is `2GB - 1 byte`. **BEFORE** ``` Cannot broadcast the table that is larger than 1GB: 1 GB. ``` **AFTER** ``` Cannot broadcast the table that is larger than 1024.0 MiB: 2048.0 MiB. ``` ### Does this PR introduce _any_ user-facing change? Yes, but only error message. ### How was this patch tested? Pass the CIs with newly added test case. Closes apache#41232 from dongjoon-hyun/SPARK-43589. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…-plugin` plugins ### What changes were proposed in this pull request? The pr aims to update some sbt plugins to newest version. include: - sbt-assembly from 2.1.0 to 2.1.1 - sbt-mima-plugin from 1.1.0 to 1.1.2 - sbt-revolver from 0.9.1 to 0.10.0 ### Why are the changes needed? Routine upgrade. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. Closes apache#41226 from panbingkun/sbt_plugins_update. Authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…P_0003 ### What changes were proposed in this pull request? The pr aims to assign a name to the error class _LEGACY_ERROR_TEMP_0003. ### Why are the changes needed? The changes improve the error framework. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. Closes apache#41200 from panbingkun/SPARK-43539. Authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
Fix nested MapType behavior in Pandas UDF (and Arrow-optimized Python UDF).
Previously during Arrow-pandas conversion, only the outermost layer is converted to a dictionary; but now nested MapType will be converted to nested dictionaries.
That applies to Spark Connect as well.
### Why are the changes needed?
Correctness and consistency (with `createDataFrame` and `toPandas` when Arrow is enabled).
### Does this PR introduce _any_ user-facing change?
Yes.
Nested MapType type support is corrected in Pandas UDF
```py
>>> schema = StructType([
... StructField("id", StringType(), True),
... StructField("attributes", MapType(StringType(), MapType(StringType(), StringType())), True)
... ])
>>>
>>> data = [
... ("1", {"personal": {"name": "John", "city": "New York"}}),
... ]
>>> df = spark.createDataFrame(data, schema)
>>> pandas_udf(StringType())
... def f(s: pd.Series) -> pd.Series:
... return s.astype(str)
...
>>> df.select(f(df.attributes)).show(truncate=False)
```
The results of `df.select(f(df.attributes)).show(truncate=False)` is corrected
**FROM**
```py
+------------------------------------------------------+
|f(attributes) |
+------------------------------------------------------+
|{'personal': [('name', 'John'), ('city', 'New York')]}|
+------------------------------------------------------+
```
**TO**
```py
>>> df.select(f(df.attributes)).show(truncate=False)
+--------------------------------------------------+
|f(attributes) |
+--------------------------------------------------+
|{'personal': {'name': 'John', 'city': 'New York'}}|
+--------------------------------------------------+
```
**Another more obvious example:**
```py
>>> pandas_udf(StringType())
... def extract_name(s:pd.Series) -> pd.Series:
... return s.apply(lambda x: x['personal']['name'])
...
>>> df.select(extract_name(df.attributes)).show(truncate=False)
```
`df.select(extract_name(df.attributes)).show(truncate=False)` is corrected
**FROM**
```py
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
...
TypeError: list indices must be integers or slices, not str
```
**TO**
```py
+------------------------+
|extract_name(attributes)|
+------------------------+
|John |
+------------------------+
```
### How was this patch tested?
Unit tests.
Closes apache#41147 from xinrong-meng/nestedType.
Authored-by: Xinrong Meng <xinrong@apache.org>
Signed-off-by: Xinrong Meng <xinrong@apache.org>
…re streaming query fails due to concurrent run of streaming query with same checkpoint ### What changes were proposed in this pull request? We are migrating to a new error framework in order to surface errors in a friendlier way to customers. This PR defines a new error class specifically for when there are concurrent updates to the log for the same batch ID ### Why are the changes needed? This gives more information to customers, and allows them to filter in a better way ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? There is an existing test to check the error message upon this condition. Because we are only changing the error type, and not the error message, this test is sufficient. Closes apache#41205 from ericm-db/SC-130782. Authored-by: Eric Marnadi <eric.marnadi@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request? Deduplicate `scikit-learn` in Dockerfile ### Why are the changes needed? has two `scikit-learn` items in `pip` command ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing CI Closes apache#41243 from zhengruifeng/infra_dev_sklearn. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request? The pr aims to update some maven plugins to newest version. include: - exec-maven-plugin from 1.6.0 to 3.1.0 - scala-maven-plugin from 4.8.0 to 4.8.1 - maven-antrun-plugin from 1.8 to 3.1.0 - maven-enforcer-plugin from 3.2.1 to 3.3.0 - build-helper-maven-plugin from 3.3.0 to 3.4.0 - maven-surefire-plugin from 3.0.0 to 3.1.0 - maven-assembly-plugin from 3.1.0 to 3.6.0 - maven-install-plugin from 3.1.0 to 3.1.1 - maven-deploy-plugin from 3.1.0 to 3.1.1 - maven-checkstyle-plugin from 3.2.1 to 3.2.2 ### Why are the changes needed? Routine upgrade. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. Closes apache#41228 from panbingkun/maven_plugin_upgrade. Authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>
…path if active hadoop-provided
### What changes were proposed in this pull request?
This PR adds `log4j-1.2-api` and `log4j-slf4j2-impl` to classpath if active `hadoop-provided`.
### Why are the changes needed?
To fix log issue.
How to reproduce this issue:
1. Build Spark:
```
./dev/make-distribution.sh --name provided --tgz -Phive -Phive-thriftserver -Pyarn -Phadoop-provided
tar -zxf spark-3.5.0-SNAPSHOT-bin-provided.tgz
```
2. Copy the following jars to spark-3.5.0-SNAPSHOT-bin-provided/jars:
```
guava-14.0.1.jar
hadoop-client-api-3.3.5.jar
hadoop-client-runtime-3.3.5.jar
hadoop-shaded-guava-1.1.1.jar
hadoop-yarn-server-web-proxy-3.3.5.jar
slf4j-api-2.0.7.jar
```
3. Add a new log4j2.properties to spark-3.5.0-SNAPSHOT-bin-provided/conf:
```
rootLogger.level = info
rootLogger.appenderRef.file.ref = File
rootLogger.appenderRef.stderr.ref = console
appender.console.type = Console
appender.console.name = console
appender.console.target = SYSTEM_ERR
appender.console.layout.type = PatternLayout
appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L : %m%n
appender.file.type = RollingFile
appender.file.name = File
appender.file.fileName = /tmp/spark/logs/spark.log
appender.file.filePattern = /tmp/spark/logs/spark.%d{yyyyMMdd-HH}.log
appender.file.append = true
appender.file.layout.type = PatternLayout
appender.file.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L : %m%n
appender.file.policies.type = Policies
appender.file.policies.time.type = TimeBasedTriggeringPolicy
appender.file.policies.time.interval = 1
appender.file.policies.time.modulate = true
appender.file.policies.size.type = SizeBasedTriggeringPolicy
appender.file.policies.size.size = 256M
appender.file.strategy.type = DefaultRolloverStrategy
appender.file.strategy.max = 100
```
4. Start Spark thriftserver: `sbin/start-thriftserver.sh`.
5. The log file is empty: `cat /tmp/spark/logs/spark.log`.
6. Copy the following jars to spark-3.5.0-SNAPSHOT-bin-provided/jars:
```
log4j-1.2-api-2.20.0.jar
log4j-slf4j2-impl-2.20.0.jar
```
7. Restart Spark thriftserver: `sbin/start-thriftserver.sh`.
8. The log file is not empty: `cat /tmp/spark/logs/spark.log`.
This is because hadoop classpath does not contain these jars. So these jars are needed even if `hadoop-provided` is activated

### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual test.
Closes apache#41195 from wangyum/SPARK-43534.
Lead-authored-by: Yuming Wang <wgyumg@gmail.com>
Co-authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request? The pr aims to upgrade buf from 1.18.0 to 1.19.0 ### Why are the changes needed? 1.Release Notes: https://github.com/bufbuild/buf/releases, bug fixed as follow: - Fix issue in buf build and buf generate where the use of type filtering (via --type flags) would cause the resulting image to have no source code info, even when --exclude-source-info was not specified. The main impact of the bug was that generated code would be missing comments. - Fix issue in buf curl when using --user or --netrc that would cause a malformed Authorization header to be sent. 2.bufbuild/buf@v1.18.0...v1.19.0 3.Manually test: dev/connect-gen-protos.sh, this upgrade will not change the generated files. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually test and Pass GA. Closes apache#41246 from panbingkun/SPARK-43599. Authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request? This PR aims to update K8s doc to recommend K8s 1.24+. ### Why are the changes needed? **1. Default K8s Version in Public Cloud environments** As of today (2023 May), the default K8s versions of public cloud providers moved on to K8s 1.24+ already. - EKS: v1.26 (Default) - GKE: v1.24 (Stable), v1.25 (Regular), v1.27 (Rapid) **2. End Of Support** In addition, K8s 1.23 and olders are going to reach EOL when Apache Spark 3.5.0 arrives. K8s 1.24 also will reach EOL in some cloud providers. | K8s | AKS | GKE | EKS | | ---- | ------- | ------- | ------- | | 1.23 | 2023-03 | 2023-07 | 2023-10 | | 1.24 | 2023-07 | 2023-10 | 2024-01 | - [AKS EOL Schedule](https://docs.microsoft.com/en-us/azure/aks/supported-kubernetes-versions?tabs=azure-cli#aks-kubernetes-release-calendar) - [GKE EOL Schedule](https://cloud.google.com/kubernetes-engine/docs/release-schedule) - [EKS EOL Schedule](https://docs.aws.amazon.com/eks/latest/userguide/kubernetes-versions.html) ### Does this PR introduce _any_ user-facing change? No, this is a documentation-only change about K8s versions. ### How was this patch tested? Manual review. Closes apache#41247 from dongjoon-hyun/SPARK-43600. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…ements
### What changes were proposed in this pull request?
Remove the upper bound of `matplotlib` in requirements
### Why are the changes needed?
1, actually, `matplotlib` is not pinned in CI;
2, `matplotlib<3.3.0` fails `pip install -U -r dev/requirements.txt` in some cases, e.g. in `ubuntu 18.04`
```
gcc -pthread -B /home/ruifeng.zheng/miniconda3/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /home/ruifeng.zheng/miniconda3/include -fPIC -O2 -isystem /home/ruifeng.zheng/miniconda3/include -fPIC -DFREETYPE_BUILD_TYPE=system -DPY_ARRAY_UNIQUE_SYMBOL=MPL_matplotlib_ft2font_ARRAY_API -DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION -D__STDC_FORMAT_MACROS=1 -Iextern/agg24-svn/include -I/home/ruifeng.zheng/miniconda3/lib/python3.10/site-packages/numpy/core/include -I/home/ruifeng.zheng/miniconda3/include/python3.10 -c src/checkdep_freetype2.c -o build/temp.linux-x86_64-cpython-310/src/checkdep_freetype2.o
src/checkdep_freetype2.c:3:6: error: #error "FreeType version 2.3 or higher is required. You may set the MPLLOCALFREETYPE environment variable to 1 to let Matplotlib download it."
#error "FreeType version 2.3 or higher is required. \
^~~~~
src/checkdep_freetype2.c:10:10: error: #include expects "FILENAME" or <FILENAME>
#include FT_FREETYPE_H
^~~~~~~~~~~~~
src/checkdep_freetype2.c:15:9: note: #pragma message: Compiling with FreeType version FREETYPE_MAJOR.FREETYPE_MINOR.FREETYPE_PATCH.
#pragma message("Compiling with FreeType version " \
^~~~~~~
src/checkdep_freetype2.c:18:4: error: #error "FreeType version 2.3 or higher is required. You may set the MPLLOCALFREETYPE environment variable to 1 to let Matplotlib download it."
#error "FreeType version 2.3 or higher is required. \
^~~~~
error: command '/usr/bin/gcc' failed with exit code 1
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure
× Encountered error while trying to install package.
╰─> matplotlib
```
### Does this PR introduce _any_ user-facing change?
no, dev-only
### How was this patch tested?
manually test
Closes apache#41248 from zhengruifeng/unpin_matplotlib_in_req.
Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
### What changes were proposed in this pull request? This PR aims to handle a corner case when `row.excludedInStages` field is missing. ### Why are the changes needed? To fix the following type error when Spark loads some very old 2.4.x or 3.0.x logs.  We have two places and this PR protects both places. ``` $ git grep row.excludedInStages core/src/main/resources/org/apache/spark/ui/static/executorspage.js: if (typeof row.excludedInStages === "undefined" || row.excludedInStages.length == 0) { core/src/main/resources/org/apache/spark/ui/static/executorspage.js: return "Active (Excluded in Stages: [" + row.excludedInStages.join(", ") + "])"; ``` ### Does this PR introduce _any_ user-facing change? No, this will remove the error case only. ### How was this patch tested? Manual review. Closes apache#41266 from dongjoon-hyun/SPARK-43719. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…perly ### What changes were proposed in this pull request? This PR proposes to fix `SeriesDateTimeTests.test_quarter` to work properly. ### Why are the changes needed? Test has not been properly testing ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Manually tested, and the existing CI should pass Closes apache#41274 from itholic/minor_quarter_test. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…ith `SparkThrowableSuite` , the last line of the file should be empty line ### What changes were proposed in this pull request? The pr aims to generate a blank line when formatting `error-classes.json` file using `SparkThrowableSuite`. ### Why are the changes needed? - When I format `error-classes.json` file using `SparkThrowableSuite`, I found the last blank line of the file will be erased, which does not comply with universal underlying code specifications, similar: python: https://www.flake8rules.com/rules/W391.html - Promote developer experience. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual testing. Closes apache#41256 from panbingkun/SPARK-43714. Authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>
…d `Drop(columnName)` ### What changes were proposed in this pull request? Document the difference between `Drop(column)` and `Drop(columnName)` ### Why are the changes needed? to better illustrate this difference ### Does this PR introduce _any_ user-facing change? yes, new doc and example ### How was this patch tested? CI and added doctest Closes apache#41273 from zhengruifeng/doc_drop_api. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request? Fix typo in test - should check q2, not q1 twice. ### Why are the changes needed? Fix typo. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI Closes apache#41260 from juliuszsompolski/SPARK-43331-fixup. Authored-by: Juliusz Sompolski <julek@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…m auto-completion ### What changes were proposed in this pull request? move unsupported functions to `__getattr__`, except `getActiveSession` (this method doesn't work for `classmethod`) ### Why are the changes needed? Hide unsupported functions from auto-completion before: <img width="1464" alt="image" src="https://github.com/apache/spark/assets/7322292/6a3efc83-99ed-4b73-b681-13640b3de7a0"> after: <img width="1311" alt="image" src="https://github.com/apache/spark/assets/7322292/79366a32-718b-4208-ae1f-3e749971d6d2"> ### Does this PR introduce _any_ user-facing change? yes ### How was this patch tested? manually check in `ipython` Closes apache#41272 from zhengruifeng/session_unsupported. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
…r class _LEGACY_ERROR_TEMP_240[1-3] ### What changes were proposed in this pull request? The pr aims to assign a name to the error class _LEGACY_ERROR_TEMP_240[1-3]. ### Why are the changes needed? Improve the error framework. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Exists test cases. Closes apache#41252 from beliefer/offset-limit-error-improve. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>
…-kerberos secured HS2) ### What changes were proposed in this pull request? This PR port [HIVE-12188](https://issues.apache.org/jira/browse/HIVE-12188)(DoAs does not work properly in non-kerberos secure HS2) to Spark. ### Why are the changes needed? The case with following settings is valid but it seems still not work correctly in current HS2(Copied from HIVE-12188's description): ``` hive.server2.authentication=NONE (or LDAP) hive.server2.enable.doAs=true hive.metastore.sasl.enabled=true (with HMS Kerberos enabled) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. Closes apache#41276 from wangyum/SPARK-43743. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request? This pr aims upgrade `zstd-jni` from 1.5.5-2 to 1.5.5-3. ### Why are the changes needed? New version includes some improvement & bug fixed, eg - luben/zstd-jni#258 - luben/zstd-jni#262 - luben/zstd-jni#253 Other changes as follows: - luben/zstd-jni@v1.5.5-2...v1.5.5-3 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GA. Closes apache#41269 from panbingkun/SPARK-43737. Authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…ession.addArtifacts
### What changes were proposed in this pull request?
This PR proposes to add the support of pyfiles (`.zip`, `.py`, `.jar`, `.egg` files) in `SparkSession.addArtifacts`.
### Why are the changes needed?
In order for end users to add the dependencies in Python Spark Connect client.
### Does this PR introduce _any_ user-facing change?
Yes, it adds the support of pyfiles (`.zip`, `.py`, `.jar`, `.egg` files) in `SparkSession.addArtifacts`.
### How was this patch tested?
Manually tested via `local-cluster`.
```bash
./sbin/start-connect-server.sh --jars `ls connector/connect/server/target/**/spark-connect*SNAPSHOT.jar` --master "local-cluster[2,2,1024]"
./bin/pyspark --remote "sc://localhost:15002"
```
```python
import os
import tempfile
from pyspark.sql.functions import udf
import shutil
with tempfile.TemporaryDirectory() as d:
package_path = os.path.join(d, "my_zipfile")
os.mkdir(package_path)
pyfile_path = os.path.join(package_path, "__init__.py")
with open(pyfile_path, "w") as f:
_ = f.write("my_func = lambda: 5")
shutil.make_archive(package_path, 'zip', d, "my_zipfile")
udf("long")
def func(x):
import my_zipfile
return my_zipfile.my_func()
spark.addArtifacts(f"{package_path}.zip", pyfile=True)
spark.range(1).select(func("id")).show()
```
Also added a couple of unittests.
Closes apache#41278 from HyukjinKwon/SPARK-43747.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…unction ### What changes were proposed in this pull request? The pr aims to add a max distance argument to the levenshtein() function. ### Why are the changes needed? Currently, Spark's levenshtein(str1, str2) function can be very inefficient for long strings. Many other databases which support this type of built-in function also take a third argument which signifies a maximum distance after which it is okay to terminate the algorithm. For example something like: `levenshtein(str1, str2[, max_distance])` the function stops computing the distant once the max values is reached. See postgresql for an example of a 3 argument [levenshtein](https://www.postgresql.org/docs/current/fuzzystrmatch.html#id-1.11.7.26.7). ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Add new UT & Pass GA. Closes apache#41169 from panbingkun/SPARK-43493. Authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>
…ed error's text in source code ### What changes were proposed in this pull request? The pr aims to refactor `INVALID_SQL_SYNTAX` for avoiding to embed error's text in source code. ### Why are the changes needed? The changes improve the error framework. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Update UT. - Pass GA Closes apache#41254 from panbingkun/SPARK-43604. Authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request? This pr aims upgrade dropwizard metrics to 4.2.18. ### Why are the changes needed? - 4.2.17 VS 4.2.18: dropwizard/metrics@v4.2.17...v4.2.18 - This version relies on jetty9 v9.4.51.v20230217 for compilation, and Spark is currently using this version as well [Update jetty9.version to v9.4.51.v20230217](dropwizard/metrics@245c516) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. Closes apache#41270 from panbingkun/SPARK-43738. Authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request? This PR upgrades `snappy-java` version to 1.1.10.0 from 1.1.9.1. ### Why are the changes needed? The new `snappy-java` version fixes a potential issue for Graviton support when used with old GLIBC versions. See xerial/snappy-java#417. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. Closes apache#41285 from sunchao/snappy-java. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…lid path throws exception' ### What changes were proposed in this pull request? The pr aims to fix bug in AvroSuite for 'reading from invalid path throws exception'. ### Why are the changes needed? - As discussed and analyzed in [41271](apache#41271 (comment)) - There is a problem with this UT. Its original intention was to test if there is no file with .avro extensions in the directory, and the read should fail. However, this UT triggered the error as FileUtils.touch instead of spark.read.format("avro").load(dir.toString).The root cause for the failure of this case is that the parent directory was not created. When FileUtils.touch is called in version 1.11.0, it just throws java.io.FileNotFoundException, which covers the error. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. Closes apache#41289 from panbingkun/SPARK-43767. Authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
What changes were proposed in this pull request?
The pr aims to Convert
_LEGACY_ERROR_TEMP_0036to INVALID_SQL_SYNTAX.Why are the changes needed?
The changes improve the error framework.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Pass GA & Update UT.