Kyspark 3.2.x 4.x qa merge by zheniantoushipashi · Pull Request #42648 · apache/spark

zheniantoushipashi · 2023-08-24T07:14:14Z

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

…HUFFLE_ENABLED is set to false ### What changes were proposed in this pull request? Only throw logWarning when `PUSH_BASED_SHUFFLE_ENABLED` is set to true and `canDoPushBasedShuffle` is false ### Why are the changes needed? Currently, this logWarning will still be printed out even when `PUSH_BASED_SHUFFLE_ENABLED` is set to false, which is unnecessary. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Passed existing UT. Closes apache#33984 from rmcyang/SPARK-36705-follow-up. Authored-by: Minchu Yang <minyang@minyang-mn3.linkedin.biz> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit 2d7dc7c) Signed-off-by: Mridul Muralidharan <mridulatgmail.com>

### What changes were proposed in this pull request? For query ``` select array_union(array(cast('nan' as double), cast('nan' as double)), array()) ``` This returns [NaN, NaN], but it should return [NaN]. This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too. In this pr we add a wrap for OpenHashSet that can handle `null`, `Double.NaN`, `Float.NaN` together ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? ArrayUnion won't show duplicated `NaN` value ### How was this patch tested? Added UT Closes apache#33955 from AngersZhuuuu/SPARK-36702-WrapOpenHashSet. Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit f71f377) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…m.xml ### What changes were proposed in this pull request? This PR aims to fix the regex to avoid breaking `pom.xml`. ### Why are the changes needed? **BEFORE** ``` $ dev/change-scala-version.sh 2.12 $ git diff | head -n10 diff --git a/core/pom.xml b/core/pom.xml index dbde22f..6ed368353b 100644 --- a/core/pom.xml +++ b/core/pom.xml -35,7 +35,7 </properties> <dependencies> - <!--<!-- ``` **AFTER** Since the default Scala version is `2.12`, the following `no-op` is the correct behavior which is consistent with the previous behavior. ``` $ dev/change-scala-version.sh 2.12 $ git diff ``` ### Does this PR introduce _any_ user-facing change? No. This is a dev only change. ### How was this patch tested? Manually. Closes apache#33996 from dongjoon-hyun/SPARK-36712. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit d730ef2) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…and Float.NaN ### What changes were proposed in this pull request? According to apache#33955 (comment) use normalized NaN ### Why are the changes needed? Use normalized NaN for duplicated NaN value ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Exiting UT Closes apache#34003 from AngersZhuuuu/SPARK-36702-FOLLOWUP. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 6380859) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…and Float.NaN ### What changes were proposed in this pull request? For query ``` select arrays_overlap(array(cast('nan' as double), 1d), array(cast('nan' as double))) ``` This returns [false], but it should return [true]. This issue is caused by `scala.mutable.HashSet` can't handle `Double.NaN` and `Float.NaN`. ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? arrays_overlap won't handle equal `NaN` value ### How was this patch tested? Added UT Closes apache#34006 from AngersZhuuuu/SPARK-36755. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit b665782) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? Fix Series.update with another in same frame also add test for update series in diff frame ### Why are the changes needed? Fix Series.update with another in same frame Pandas behavior: ``` python >>> pdf = pd.DataFrame( ... {"a": [None, 2, 3, 4, 5, 6, 7, 8, None], "b": [None, 5, None, 3, 2, 1, None, 0, 0]}, ... ) >>> pdf a b 0 NaN NaN 1 2.0 5.0 2 3.0 NaN 3 4.0 3.0 4 5.0 2.0 5 6.0 1.0 6 7.0 NaN 7 8.0 0.0 8 NaN 0.0 >>> pdf.a.update(pdf.b) >>> pdf a b 0 NaN NaN 1 5.0 5.0 2 3.0 NaN 3 3.0 3.0 4 2.0 2.0 5 1.0 1.0 6 7.0 NaN 7 0.0 0.0 8 0.0 0.0 ``` ### Does this PR introduce _any_ user-facing change? Before ```python >>> psdf = ps.DataFrame( ... {"a": [None, 2, 3, 4, 5, 6, 7, 8, None], "b": [None, 5, None, 3, 2, 1, None, 0, 0]}, ... ) >>> psdf.a.update(psdf.b) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/dgd/spark/python/pyspark/pandas/series.py", line 4551, in update combined = combine_frames(self._psdf, other._psdf, how="leftouter") File "/Users/dgd/spark/python/pyspark/pandas/utils.py", line 141, in combine_frames assert not same_anchor( AssertionError: We don't need to combine. `this` and `that` are same. >>> ``` After ```python >>> psdf = ps.DataFrame( ... {"a": [None, 2, 3, 4, 5, 6, 7, 8, None], "b": [None, 5, None, 3, 2, 1, None, 0, 0]}, ... ) >>> psdf.a.update(psdf.b) >>> psdf a b 0 NaN NaN 1 5.0 5.0 2 3.0 NaN 3 3.0 3.0 4 2.0 2.0 5 1.0 1.0 6 7.0 NaN 7 0.0 0.0 8 0.0 0.0 >>> ``` ### How was this patch tested? unit tests Closes apache#33968 from dgd-contributor/SPARK-36722_fix_update_same_anchor. Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn> Signed-off-by: Takuya UESHIN <ueshin@databricks.com> (cherry picked from commit c15072c) Signed-off-by: Takuya UESHIN <ueshin@databricks.com>

### What changes were proposed in this pull request? Upgrade Apache Parquet to 1.12.1 ### Why are the changes needed? Parquet 1.12.1 contains the following bug fixes: - PARQUET-2064: Make Range public accessible in RowRanges - PARQUET-2022: ZstdDecompressorStream should close `zstdInputStream` - PARQUET-2052: Integer overflow when writing huge binary using dictionary encoding - PARQUET-1633: Fix integer overflow - PARQUET-2054: fix TCP leaking when calling ParquetFileWriter.appendFile - PARQUET-2072: Do Not Determine Both Min/Max for Binary Stats - PARQUET-2073: Fix estimate remaining row count in ColumnWriteStoreBase - PARQUET-2078: Failed to read parquet file after writing with the same In particular PARQUET-2078 is a blocker for the upcoming Apache Spark 3.2.0 release. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests + a new test for the issue in SPARK-36696 Closes apache#33969 from sunchao/upgrade-parquet-12.1. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com> (cherry picked from commit a927b08) Signed-off-by: DB Tsai <d_tsai@apple.com>

### What changes were proposed in this pull request? This PR aims to upgrade Scala to 2.12.15 to support Java 17/18 better. ### Why are the changes needed? Scala 2.12.15 improves compatibility with JDK 17 and 18: https://github.com/scala/scala/releases/tag/v2.12.15 - Avoids IllegalArgumentException in JDK 17+ for lambda deserialization - Upgrades to ASM 9.2, for JDK 18 support in optimizer ### Does this PR introduce _any_ user-facing change? Yes, this is a Scala version change. ### How was this patch tested? Pass the CIs Closes apache#33999 from dongjoon-hyun/SPARK-36759. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 16f1f71) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

### What changes were proposed in this pull request? This PR aims to upgrade Apache ORC to 1.6.11 to bring the latest bug fixes. ### Why are the changes needed? Apache ORC 1.6.11 has the following fixes. - https://issues.apache.org/jira/projects/ORC/versions/12350499 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes apache#33971 from dongjoon-hyun/SPARK-36732. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit c217797) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

### What changes were proposed in this pull request? Add documentation for ANSI store assignment rules for - the valid source/target type combinations - runtime error will happen on numberic overflow ### Why are the changes needed? Better docs ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Build docs and preview: ![image](https://user-images.githubusercontent.com/1097932/133554600-8c80c0a9-8753-4c01-94d0-994d8082e319.png) Closes apache#34014 from gengliangwang/addStoreAssignDoc. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit ff7705a) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…lease ### What changes were proposed in this pull request? This PR aims to move Java 17 on GA from early access release to LTS release. ### Why are the changes needed? Java 17 LTS was released a few days ago and it's available on GA. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GA itself. Closes apache#34017 from sarutak/ga-java17. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Yuming Wang <yumwang@ebay.com> (cherry picked from commit 89a9456) Signed-off-by: Yuming Wang <yumwang@ebay.com>

…nd doc ### What changes were proposed in this pull request? This is a follow-up to fix the leftover during switching the Scala version. ### Why are the changes needed? This should be consistent. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This is not tested by UT. We need to check manually. There is no more `2.12.14`. ``` $ git grep 2.12.14 R/pkg/tests/fulltests/test_sparkSQL.R: c(as.Date("2012-12-14"), as.Date("2013-12-15"), as.Date("2014-12-16"))) data/mllib/ridge-data/lpsa.data:3.5307626,0.987291634724086 -0.36279314978779 -0.922212414640967 0.232904453212813 -0.522940888712441 1.79270085261407 0.342627053981254 1.26288870310799 sql/hive/src/test/resources/data/files/over10k:-3|454|65705|4294967468|62.12|14.32|true|mike white|2013-03-01 09:11:58.703087|40.18|joggying ``` Closes apache#34020 from dongjoon-hyun/SPARK-36759-2. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit adbea25) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…nd use it in SparkSubmitSuite ### What changes were proposed in this pull request? This PR refactors test code in order to improve the debugability of `SparkSubmitSuite`. The `sql/hive` module contains a `SparkSubmitTestUtils` helper class which launches `spark-submit` and captures its output in order to display better error messages when tests fail. This helper is currently used by `HiveSparkSubmitSuite` and `HiveExternalCatalogVersionsSuite`, but isn't used by `SparkSubmitSuite`. In this PR, I moved `SparkSubmitTestUtils` and `ProcessTestUtils` into the `core` module and updated `SparkSubmitSuite`, `BufferHolderSparkSubmitSuite`, and `WholestageCodegenSparkSubmitSuite` to use the relocated helper classes. This required me to change `SparkSubmitTestUtils` to make its timeouts configurable and to generalize its method for locating the `spark-submit` binary. ### Why are the changes needed? Previously, `SparkSubmitSuite` tests would fail with messages like: ``` [info] - launch simple application with spark-submit *** FAILED *** (1 second, 832 milliseconds) [info] Process returned with exit code 101. See the log4j logs for more detail. (SparkSubmitSuite.scala:1551) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) ``` which require the Spark developer to hunt in log4j logs in order to view the logs from the failed `spark-submit` command. After this change, those tests will fail with detailed error messages that include the text of failed command plus timestamped logs captured from the failed proces: ``` [info] - launch simple application with spark-submit *** FAILED *** (2 seconds, 800 milliseconds) [info] spark-submit returned with exit code 101. [info] Command line: '/Users/joshrosen/oss-spark/bin/spark-submit' '--class' 'invalidClassName' '--name' 'testApp' '--master' 'local' '--conf' 'spark.ui.enabled=false' '--conf' 'spark.master.rest.enabled=false' 'file:/Users/joshrosen/oss-spark/target/tmp/spark-0a8a0c93-3aaf-435d-9cf3-b97abd318d91/testJar-1631768004882.jar' [info] [info] 2021-09-15 21:53:26.041 - stderr> SLF4J: Class path contains multiple SLF4J bindings. [info] 2021-09-15 21:53:26.042 - stderr> SLF4J: Found binding in [jar:file:/Users/joshrosen/oss-spark/assembly/target/scala-2.12/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class] [info] 2021-09-15 21:53:26.042 - stderr> SLF4J: Found binding in [jar:file:/Users/joshrosen/.m2/repository/org/slf4j/slf4j-log4j12/1.7.30/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class] [info] 2021-09-15 21:53:26.042 - stderr> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. [info] 2021-09-15 21:53:26.042 - stderr> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] [info] 2021-09-15 21:53:26.619 - stderr> Error: Failed to load class invalidClassName. (SparkSubmitTestUtils.scala:97) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I manually ran the affected test suites. Closes apache#34013 from JoshRosen/SPARK-36774-move-SparkSubmitTestUtils-to-core. Authored-by: Josh Rosen <joshrosen@databricks.com> Signed-off-by: Josh Rosen <joshrosen@databricks.com> (cherry picked from commit 3ae6e67) Signed-off-by: Josh Rosen <joshrosen@databricks.com>

…terministic Project ### What changes were proposed in this pull request? `ScanOperation` collects adjacent Projects and Filters. The caller side always assume that the collected Filters should run before collected Projects, which means `ScanOperation` effectively pushes Filter through Project. Following `PushPredicateThroughNonJoin`, we should not push Filter through nondeterministic Project. This PR fixes `ScanOperation` to follow this rule. ### Why are the changes needed? Fix a bug that violates the semantic of nondeterministic expressions. ### Does this PR introduce _any_ user-facing change? Most likely no change, but in some cases, this is a correctness bug fix which changes the query result. ### How was this patch tested? existing tests Closes apache#34023 from cloud-fan/scan. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit dfd5237) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…om warning to debug ### What changes were proposed in this pull request? This PR suppresses the warnings for plans where AQE is not supported. Currently we show the warnings such as: ``` org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: Sort [a#324881 DESC NULLS FIRST], true, 23 +- Scan ExistingRDD[a#324881] ``` for every plan that AQE is not supported. ### Why are the changes needed? It's too noisy now. Below is the example of `SortSuite` run: ``` 14:51:40.675 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: Sort [a#324881 DESC NULLS FIRST], true, 23 +- Scan ExistingRDD[a#324881] . [info] - sorting on DayTimeIntervalType(0,1) with nullable=true, sortOrder=List('a DESC NULLS FIRST) (785 milliseconds) 14:51:41.416 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: ReferenceSort [a#324884 ASC NULLS FIRST], true +- Scan ExistingRDD[a#324884] . 14:51:41.467 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: Sort [a#324884 ASC NULLS FIRST], true, 23 +- Scan ExistingRDD[a#324884] . [info] - sorting on DayTimeIntervalType(0,1) with nullable=false, sortOrder=List('a ASC NULLS FIRST) (796 milliseconds) 14:51:42.210 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: ReferenceSort [a#324887 ASC NULLS LAST], true +- Scan ExistingRDD[a#324887] . 14:51:42.259 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: Sort [a#324887 ASC NULLS LAST], true, 23 +- Scan ExistingRDD[a#324887] . [info] - sorting on DayTimeIntervalType(0,1) with nullable=false, sortOrder=List('a ASC NULLS LAST) (797 milliseconds) 14:51:43.009 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: ReferenceSort [a#324890 DESC NULLS LAST], true +- Scan ExistingRDD[a#324890] . 14:51:43.061 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: Sort [a#324890 DESC NULLS LAST], true, 23 +- Scan ExistingRDD[a#324890] . [info] - sorting on DayTimeIntervalType(0,1) with nullable=false, sortOrder=List('a DESC NULLS LAST) (848 milliseconds) 14:51:43.857 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: ReferenceSort [a#324893 DESC NULLS FIRST], true +- Scan ExistingRDD[a#324893] . 14:51:43.903 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: Sort [a#324893 DESC NULLS FIRST], true, 23 +- Scan ExistingRDD[a#324893] . [info] - sorting on DayTimeIntervalType(0,1) with nullable=false, sortOrder=List('a DESC NULLS FIRST) (827 milliseconds) 14:51:44.682 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: ReferenceSort [a#324896 ASC NULLS FIRST], true +- Scan ExistingRDD[a#324896] . 14:51:44.748 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: Sort [a#324896 ASC NULLS FIRST], true, 23 +- Scan ExistingRDD[a#324896] . [info] - sorting on YearMonthIntervalType(0,1) with nullable=true, sortOrder=List('a ASC NULLS FIRST) (565 milliseconds) 14:51:45.248 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: ReferenceSort [a#324899 ASC NULLS LAST], true +- Scan ExistingRDD[a#324899] . 14:51:45.312 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: Sort [a#324899 ASC NULLS LAST], true, 23 +- Scan ExistingRDD[a#324899] . [info] - sorting on YearMonthIntervalType(0,1) with nullable=true, sortOrder=List('a ASC NULLS LAST) (591 milliseconds) 14:51:45.841 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: ReferenceSort [a#324902 DESC NULLS LAST], true +- Scan ExistingRDD[a#324902] . 14:51:45.905 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: Sort [a#324902 DESC NULLS LAST], true, 23 +- Scan ExistingRDD[a#324902] . ``` ### Does this PR introduce _any_ user-facing change? Yes, it will show less warnings to users. Note that AQE is enabled by default from Spark 3.2, see SPARK-33679 ### How was this patch tested? Manually tested via unittests. Closes apache#34026 from HyukjinKwon/minor-log-level. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 917d7da) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…lder in array functions ### What changes were proposed in this pull request? In array functions, we use constant 0 as the placeholder when adding a null value to an array buffer. This PR makes sure the constant 0 matches the type of the array element. ### Why are the changes needed? Fix a potential bug. Somehow we can hit this bug sometimes after apache#33955 . ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests Closes apache#34029 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 4145498) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…at.Nan ### What changes were proposed in this pull request? For query ``` select array_distinct(array(cast('nan' as double), cast('nan' as double))) ``` This returns [NaN, NaN], but it should return [NaN]. This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too. In this pr fix this based on apache#33955 ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? ArrayDistinct won't show duplicated `NaN` value ### How was this patch tested? Added UT Closes apache#33993 from AngersZhuuuu/SPARK-36741. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit e356f6a) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…am is being used" in KafkaContinuousTest ### What changes were proposed in this pull request? The test “ensure continuous stream is being used“ in KafkaContinuousTest quickly checks the actual type of the execution, and stops the query. Stopping the streaming query in continuous mode is done by interrupting query execution thread and join with indefinite timeout. In parallel, started streaming query is going to generate execution plan, including running optimizer. Some parts of SessionState can be built at that time, as they are defined as lazy. The problem is, some of them seem to “swallow” the InterruptedException and let the thread run continuously. That said, the query can’t indicate whether there is a request on stopping query, so the query won’t stop. This PR fixes such scenario via ensuring that streaming query has started before the test stops the query. ### Why are the changes needed? Race-condition could end up with test hang till test framework marks it as timed-out. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes apache#34004 from HeartSaVioR/SPARK-36764. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 6099edc) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? This patch proposes to fix incorrect schema of `union`. ### Why are the changes needed? The current `union` result of nested struct columns is incorrect. By definition of `union` API, it should resolve columns by position, not by name. Right now when determining the `output` (aka. the schema) of union plan, we use `merge` API which actually merges two structs (simply think it as concatenate fields from two structs if not overlapping). The merging behavior doesn't match the `union` definition. So currently we get incorrect schema but the query result is correct. We should fix the incorrect schema. ### Does this PR introduce _any_ user-facing change? Yes, fixing a bug of incorrect schema. ### How was this patch tested? Added unit test. Closes apache#34025 from viirya/SPARK-36673. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit cdd7ae9) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…and Unit test ### What changes were proposed in this pull request? Add comment about how ArrayMin/ArrayMax/SortArray/ArraySort handle NaN and add Unit test for this ### Why are the changes needed? Add Unit test ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes apache#34008 from AngersZhuuuu/SPARK-36740. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 69e006d) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? Java 17 has been officially released. This PR makes `dev/mima` runs on Java 17. ### Why are the changes needed? To make tests pass on Java 17. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. Closes apache#34022 from RabbidHY/SPARK-36780. Lead-authored-by: Yang He <stitch106hy@gmail.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 5d0889b) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…rationsSuite ### What changes were proposed in this pull request? As a followup of apache#34025 to remove duplicate test. ### Why are the changes needed? To remove duplicate test. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test. Closes apache#34032 from viirya/remove. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit f9644cc) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>

### What changes were proposed in this pull request? Fix Series.isin when Series has NaN values ### Why are the changes needed? Fix Series.isin when Series has NaN values ``` python >>> pser = pd.Series([None, 5, None, 3, 2, 1, None, 0, 0]) >>> psser = ps.from_pandas(pser) >>> pser.isin([1, 3, 5, None]) 0 False 1 True 2 False 3 True 4 False 5 True 6 False 7 False 8 False dtype: bool >>> psser.isin([1, 3, 5, None]) 0 None 1 True 2 None 3 True 4 None 5 True 6 None 7 None 8 None dtype: object ``` ### Does this PR introduce _any_ user-facing change? After this PR ``` python >>> pser = pd.Series([None, 5, None, 3, 2, 1, None, 0, 0]) >>> psser = ps.from_pandas(pser) >>> psser.isin([1, 3, 5, None]) 0 False 1 True 2 False 3 True 4 False 5 True 6 False 7 False 8 False dtype: bool ``` ### How was this patch tested? unit tests Closes apache#34005 from dgd-contributor/SPARK-36762_fix_series.isin_when_values_have_NaN. Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn> Signed-off-by: Takuya UESHIN <ueshin@databricks.com> (cherry picked from commit 32b8512) Signed-off-by: Takuya UESHIN <ueshin@databricks.com>

…empt id not matching ### What changes were proposed in this pull request? Remove the appAttemptId from TransportConf, and parsing through SparkEnv. ### Why are the changes needed? Push based shuffle will fail if there are any attemptId set in the SparkConf, as the attemptId is not set correctly in Driver. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tested within our Yarn cluster. Without this PR, the Driver will fail to finalize the shuffle merge on all the mergers. After the patch, Driver can successfully finalize the shuffle merge and the push based shuffle can work fine. Also with unit test to verify the attemptId is being set in the BlockStoreClient in Driver. Closes apache#34018 from zhouyejoe/SPARK-36772. Authored-by: Ye Zhou <yezhou@linkedin.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit cabc36b) Signed-off-by: Gengliang Wang <gengliang@apache.org>

…oat.NaN ### What changes were proposed in this pull request? For query ``` select array_intersect(array(cast('nan' as double), 1d), array(cast('nan' as double))) ``` This returns [NaN], but it should return []. This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too. In this pr fix this based on apache#33955 ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? ArrayIntersect won't show equal `NaN` value ### How was this patch tested? Added UT Closes apache#33995 from AngersZhuuuu/SPARK-36754. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 2fc7f2f) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? This PR aims to upgrade R from 3.6.3 to 4.0.4 in K8s R Docker image. ### Why are the changes needed? `openjdk:11-jre-slim` image is upgraded to `Debian 11`. ``` $ docker run -it openjdk:11-jre-slim cat /etc/os-release PRETTY_NAME="Debian GNU/Linux 11 (bullseye)" NAME="Debian GNU/Linux" VERSION_ID="11" VERSION="11 (bullseye)" VERSION_CODENAME=bullseye ID=debian HOME_URL="https://www.debian.org/" SUPPORT_URL="https://www.debian.org/support" BUG_REPORT_URL="https://bugs.debian.org/" ``` It causes `R 3.5` installation failures in our K8s integration test environment. - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47953/ ``` The following packages have unmet dependencies: r-base-core : Depends: libicu63 (>= 63.1-1~) but it is not installable Depends: libreadline7 (>= 6.0) but it is not installable E: Unable to correct problems, you have held broken packages. The command '/bin/sh -c apt-get update && apt install -y gnupg && echo "deb http://cloud.r-project.org/bin/linux/debian buster-cran35/" >> /etc/apt/sources.list && apt-key adv --keyserver keyserver.ubuntu.com --recv-key 'E19F5F87128899B192B1A2C2AD5F960A256A04AF' && apt-get update && apt install -y -t buster-cran35 r-base r-base-dev && rm -rf ``` ### Does this PR introduce _any_ user-facing change? Yes, this will recover the installation. ### How was this patch tested? Succeed to build SparkR docker image in the K8s integration test in Jenkins CI. - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47959/ ``` Successfully built 32e1a0cd5ff8 Successfully tagged kubespark/spark-r:3.3.0-SNAPSHOT_6e4f7e2d-054d-4978-812f-4f32fc546b51 ``` Closes apache#34048 from dongjoon-hyun/SPARK-36806. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit a178752) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

### What changes were proposed in this pull request? Fix DataFrame.isin when DataFrame has NaN value ### Why are the changes needed? Fix DataFrame.isin when DataFrame has NaN value ``` python >>> psdf = ps.DataFrame( ... {"a": [None, 2, 3, 4, 5, 6, 7, 8, None], "b": [None, 5, None, 3, 2, 1, None, 0, 0], "c": [1, 5, 1, 3, 2, 1, 1, 0, 0]}, ... ) >>> psdf a b c 0 NaN NaN 1 1 2.0 5.0 5 2 3.0 NaN 1 3 4.0 3.0 3 4 5.0 2.0 2 5 6.0 1.0 1 6 7.0 NaN 1 7 8.0 0.0 0 8 NaN 0.0 0 >>> other = [1, 2, None] >>> psdf.isin(other) a b c 0 None None True 1 True None None 2 None None True 3 None None None 4 None True True 5 None True True 6 None None True 7 None None None 8 None None None >>> psdf.to_pandas().isin(other) a b c 0 False False True 1 True False False 2 False False True 3 False False False 4 False True True 5 False True True 6 False False True 7 False False False 8 False False False ``` ### Does this PR introduce _any_ user-facing change? After this PR ``` python >>> psdf = ps.DataFrame( ... {"a": [None, 2, 3, 4, 5, 6, 7, 8, None], "b": [None, 5, None, 3, 2, 1, None, 0, 0], "c": [1, 5, 1, 3, 2, 1, 1, 0, 0]}, ... ) >>> psdf a b c 0 NaN NaN 1 1 2.0 5.0 5 2 3.0 NaN 1 3 4.0 3.0 3 4 5.0 2.0 2 5 6.0 1.0 1 6 7.0 NaN 1 7 8.0 0.0 0 8 NaN 0.0 0 >>> other = [1, 2, None] >>> psdf.isin(other) a b c 0 False False True 1 True False False 2 False False True 3 False False False 4 False True True 5 False True True 6 False False True 7 False False False 8 False False False ``` ### How was this patch tested? Unit tests Closes apache#34040 from dgd-contributor/SPARK-36785_dataframe.isin_fix. Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn> Signed-off-by: Takuya UESHIN <ueshin@databricks.com> (cherry picked from commit cc182fe) Signed-off-by: Takuya UESHIN <ueshin@databricks.com>

* KE-40433 add page index filter log * KE-40433 update parquet version

…reader (Kyligence#623) (Kyligence#629) KE-41399 spark-42388 Avoid parquet footer reads twice in vectorized reader

…netty (Kyligence#638) * KE-42144 Upgrade tomcat-embed-core to 9.0.76 * KE-42145 Upgrade netty-handler to 4.1.94.Final

…Utils.unpack (Kyligence#643) (Kyligence#648) ### What changes were proposed in this pull request? This PR proposes to use `FileUtil.unTarUsingJava` that is a Java implementation for un-tar `.tar` files. `unTarUsingJava` is not public but it exists in all Hadoop versions from 2.1+, see HADOOP-9264. The security issue reproduction requires a non-Windows platform, and a non-gzipped TAR archive file name (contents don't matter). ### Why are the changes needed? There is a risk for arbitrary shell command injection via `Utils.unpack` when the filename is controlled by a malicious user. This is due to an issue in Hadoop's `unTar`, that is not properly escaping the filename before passing to a shell command:https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileUtil.java#L904 ### Does this PR introduce _any_ user-facing change? Yes, it prevents a security issue that, previously, allowed users to execute arbitrary shall command. ### How was this patch tested? Manually tested in local, and existing test cases should cover. Closes apache#35946 from HyukjinKwon/SPARK-38631. Authored-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 057c051) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> Co-authored-by: Hyukjin Kwon <gurwls223@apache.org>

…ce#650) * Revert "Revert "KE-41874 Optimize Calcite plan to convert spark logical plan"" This reverts commit 59a9ba5. * Revert "Revert "[SPARK-41732][SQL][SS] Apply tree-pattern based pruning for the rule SessionWindowing"" This reverts commit b13a975. * Revert "Revert "[SPARK-39441][SQL] Speed up DeduplicateRelations"" This reverts commit 665dc0c. * Columns that support special characters

…e#664) This reverts commit 7fb0255.

Minchu Yang and others added 30 commits September 13, 2021 23:24

[SPARK-36732][BUILD][FOLLOWUP] Fix dependency manifest

6352846

Preparing Spark release v3.2.0-rc3

96044e9

Preparing development version 3.2.1-SNAPSHOT

b024985

ygjia and others added 10 commits May 11, 2023 13:54

KE-40433 add page index filter log (Kyligence#619) (Kyligence#624)

19fef37

* KE-40433 add page index filter log * KE-40433 update parquet version

KE-41399 spark-42388 Avoid parquet footer reads twice in vectorized …

ea75c33

…reader (Kyligence#623) (Kyligence#629) KE-41399 spark-42388 Avoid parquet footer reads twice in vectorized reader

AL-7861 simplify conditionals in filter in case of large CaseWhen expr

4c1716e

KE-42110 Upgrade snappy-java to 1.1.10.1 (Kyligence#632)

47971e9

KE-42126 allow decimal conversion accuracy loss

7916ba8

KE-42144 & KE-42145 Fix vulnerabilities, upgrade tomcat-embed-core & …

289884d

…netty (Kyligence#638) * KE-42144 Upgrade tomcat-embed-core to 9.0.76 * KE-42145 Upgrade netty-handler to 4.1.94.Final

KE-42055 parquet footer read cache (Kyligence#641)

7fb0255

Revert "KE-42055 parquet footer read cache (Kyligence#641)" (Kyligenc…

2c5c54d

…e#664) This reverts commit 7fb0255.

zheniantoushipashi closed this Aug 24, 2023

github-actions Bot added SQL ML MLLIB STRUCTURED STREAMING KUBERNETES WEB UI GRAPHX MESOS BUILD SPARK SHELL YARN EXAMPLES DOCS CORE INFRA PYTHON R DSTREAM PANDAS API ON SPARK labels Aug 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kyspark 3.2.x 4.x qa merge#42648

Kyspark 3.2.x 4.x qa merge#42648
zheniantoushipashi wants to merge 696 commits into
apache:masterfrom
zheniantoushipashi:kyspark-3.2.x-4.x-qa-merge

zheniantoushipashi commented Aug 24, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

zheniantoushipashi commented Aug 24, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants