Raykim/iceberg/upgrade 142 by rayhondo · Pull Request #43876 · apache/spark

rayhondo · 2023-11-17T17:47:40Z

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

…mp sequences ### What changes were proposed in this pull request? Add code to defensively check if the pre-allocated result array is big enough to handle the next element in a date or timestamp sequence. ### Why are the changes needed? `InternalSequenceBase.getSequenceLength` is a fast method for estimating the size of the result array. It uses an estimated step size in micros which is not always entirely accurate for the date/time/time-zone combination. As a result, `getSequenceLength` occasionally overestimates the size of the result array and also occasionally underestimates the size of the result array. `getSequenceLength` sometimes overestimates the size of the result array when the step size is in months (because `InternalSequenceBase` assumes 28 days per month). This case is handled: `InternalSequenceBase` will slice the array, if needed. `getSequenceLength` sometimes underestimates the size of the result array when the sequence crosses a DST "spring forward" without a corresponding "fall backward". This case is not handled (thus, this PR). For example: ``` select sequence( timestamp'2022-03-13 00:00:00', timestamp'2022-03-14 00:00:00', interval 1 day) as x; ``` In the America/Los_Angeles time zone, this results in the following error: ``` java.lang.ArrayIndexOutOfBoundsException: 1 at scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:77) ``` This happens because `InternalSequenceBase` calculates an estimated step size of 24 hours. If you add 24 hours to 2022-03-13 00:00:00 in the America/Los_Angeles time zone, you get 2022-03-14 01:00:00 (because 2022-03-13 has only 23 hours due to "spring forward"). Since 2022-03-14 01:00:00 is later than the specified stop value, `getSequenceLength` assumes the stop value is not included in the result. Therefore, `getSequenceLength` estimates an array size of 1. However, when actually creating the sequence, `InternalSequenceBase` does not use a step of 24 hours, but of 1 day. When you add 1 day to 2022-03-13 00:00:00, you get 2022-03-14 00:00:00. Now the stop value *is* included, and we overrun the end of the result array. The new unit test includes examples of problematic date sequences. This PR adds code to to handle the underestimation case: it checks if we're about to overrun the array, and if so, gets a new array that's larger by 1. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit test. Closes apache#37513 from bersprockets/date_sequence_array_size_issue. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 3a1136a) Signed-off-by: Max Gekk <max.gekk@gmail.com>

…erV2.overwrite ### What changes were proposed in this pull request? Fix DataFrameWriterV2.overwrite() fails to convert the condition parameter to java. This prevents the function from being called. It is caused by the following commit that deleted the `_to_java_column` call instead of fixing it: apache@a1e459e ### Why are the changes needed? DataFrameWriterV2.overwrite() cannot be called. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually checked whether the arguments are sent to JVM or not. Closes apache#37547 from looi/fix-overwrite. Authored-by: Wenli Looi <wlooi@ucalgary.ca> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 4637986) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…ibutes when traversing other children in RemoveRedundantAliases ### What changes were proposed in this pull request? Do not exclude `Union`'s first child attributes when traversing other children in `RemoveRedundantAliases`. ### Why are the changes needed? We don't need to exclude those attributes that `Union` inherits from its first child. See discussion here: apache#37496 (comment) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing UTs. Closes apache#37534 from peter-toth/SPARK-39887-keep-attributes-of-unions-first-child-follow-up. Authored-by: Peter Toth <ptoth@cloudera.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit e732232) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? This PR proposes to initialize the projection so non-deterministic expressions can be evaluated with Python UDFs. ### Why are the changes needed? To make the Python UDF working with non-deterministic expressions. ### Does this PR introduce _any_ user-facing change? Yes. ```python from pyspark.sql.functions import udf, rand spark.range(10).select(udf(lambda x: x, "double")(rand())).show() ``` **Before** ``` java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown Source) at org.apache.spark.sql.execution.python.EvalPythonExec.$anonfun$doExecute$10(EvalPythonExec.scala:126) at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) at scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:1161) at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:1176) at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1213) ``` **After** ``` +----------------------------------+ |<lambda>rand(-2507211707257730645)| +----------------------------------+ | 0.7691724424045242| | 0.09602244075319044| | 0.3006471278112862| | 0.4182649571961977| | 0.29349096650900974| | 0.7987097908937618| | 0.5324802583101007| | 0.72460930912789| | 0.1367749768412846| | 0.17277322931919348| +----------------------------------+ ``` ### How was this patch tested? Manually tested, and unittest was added. Closes apache#37552 from HyukjinKwon/SPARK-40121. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 336c9bc) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…ssifier.setParams ### What changes were proposed in this pull request? Restore rawPredictionCol to MultilayerPerceptronClassifier.setParams ### Why are the changes needed? This param was inadvertently removed in the refactoring in apache@40cdb6d#r81473316 Without it, using this param in the constructor fails. ### Does this PR introduce _any_ user-facing change? Not aside from the bug fix. ### How was this patch tested? Existing tests. Closes apache#37561 from srowen/SPARK-40132. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 6768d9c) Signed-off-by: Sean Owen <srowen@gmail.com>

This PR aims to update ORC to 1.7.6. This will bring the latest changes and bug fixes. https://github.com/apache/orc/releases/tag/v1.7.6 - ORC-1204: ORC MapReduce writer to flush when long arrays - ORC-1205: `nextVector` should invoke `ensureSize` when reusing vectors - ORC-1215: Remove a wrong `NotNull` annotation on `value` of `setAttribute` - ORC-1222: Upgrade `tools.hadoop.version` to 2.10.2 - ORC-1227: Use `Constructor.newInstance` instead of `Class.newInstance` - ORC-1228: Fix `setAttribute` to handle null value No. Pass the CIs. Closes apache#37563 from williamhyun/ORC-176. Authored-by: William Hyun <william@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit a1a049f) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…arameters splitsArray, inputCols and outputCols can not be loaded after saving it Signed-off-by: Weichen Xu <weichen.xudatabricks.com> ### What changes were proposed in this pull request? Fix: Bucketizer created for multiple columns with parameters splitsArray, inputCols and outputCols can not be loaded after saving it ### Why are the changes needed? Bugfix. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes apache#37568 from WeichenXu123/SPARK-35542. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com> (cherry picked from commit 876ce6a) Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

…ile as well ### What changes were proposed in this pull request? This fixes a bug where ConfigMap is not mounted on executors if they are under a non-default resource profile. ### Why are the changes needed? When `spark.kubernetes.executor.disableConfigMap` is `false`, expected behavior is that the ConfigMap is mounted regardless of executor's resource profile. However, it is not mounted if the resource profile is non-default. ### Does this PR introduce _any_ user-facing change? Executors with non-default resource profile will have the ConfigMap mounted that was missing before if `spark.kubernetes.executor.disableConfigMap` is `false` or default. If certain users need to keep that behavior for some reason, they would need to explicitly set `spark.kubernetes.executor.disableConfigMap` to `true`. ### How was this patch tested? A new test case is added just below the existing ConfigMap test case. Closes apache#37504 from nsuke/SPARK-40065. Authored-by: Aki Sukegawa <nsuke@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 41ca629) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…ectness issue in the case of overlapping partition and data columns ### What changes were proposed in this pull request? This PR fixes a correctness issue in Parquet DSv1 FileFormat when projection does not contain columns referenced in pushed filters. This typically happens when partition columns and data columns overlap. This could result in empty result when in fact there were records matching predicate as can be seen in the provided fields. The problem is especially visible with `count()` and `show()` reporting different results, for example, show() would return 1+ records where the count() would return 0. In Parquet, when the predicate is provided and column index is enabled, we would try to filter row ranges to figure out what the count should be. Unfortunately, there is an issue that if the projection is empty or is not in the set of filter columns, any checks on columns would fail and 0 rows are returned (`RowRanges.EMPTY`) even though there is data matching the filter. Note that this is rather a mitigation, a quick fix. The actual fix needs to go into Parquet-MR: https://issues.apache.org/jira/browse/PARQUET-2170. The fix is not required in DSv2 where the overlapping columns are removed in `FileScanBuilder::readDataSchema()`. ### Why are the changes needed? Fixes a correctness issue when projection columns are not referenced by columns in pushed down filters or the schema is empty in Parquet DSv1. Downsides: Parquet column filter would be disabled if it had not been explicitly enabled which could affect performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I added a unit test that reproduces this behaviour. The test fails without the fix and passes with the fix. Closes apache#37419 from sadikovi/SPARK-39833. Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit cde71aa) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? Fix `split_part` codegen compilation issue: ```sql SELECT split_part(str, delimiter, partNum) FROM VALUES ('11.12.13', '.', 3) AS v1(str, delimiter, partNum); ``` ``` org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 42, Column 1: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 42, Column 1: Expression "project_isNull_0 = false" is not a type ``` ### Why are the changes needed? Fix bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes apache#37589 from wangyum/SPARK-40152. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit cf1a80e) Signed-off-by: Sean Owen <srowen@gmail.com>

### What changes were proposed in this pull request? This fixes https://issues.apache.org/jira/browse/SPARK-40089 where the prefix can overflow in some cases and the code assumes that the overflow is always on the negative side, not the positive side. ### Why are the changes needed? This adds a check when the overflow does happen to know what is the proper prefix to return. ### Does this PR introduce _any_ user-facing change? No, unless you consider getting the sort order correct a user facing change. ### How was this patch tested? I tested manually with the file in the JIRA and I added a small unit test. Closes apache#37540 from revans2/fix_dec_sort. Authored-by: Robert (Bobby) Evans <bobby@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 8dfd3df) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…ate/timestamp sequences the same ### What changes were proposed in this pull request? Change how the length of the new result array is calculated in `InternalSequenceBase.eval` to match how the same is calculated in the generated code. ### Why are the changes needed? This change brings the interpreted mode code in line with the generated code. Although I am not aware of any case where the current interpreted mode code fails, the generated code is more correct (it handles the case where the result array must grow more than once, whereas the current interpreted mode code does not). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests. Closes apache#37542 from bersprockets/date_sequence_array_size_issue_follow_up. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit d718867) Signed-off-by: Max Gekk <max.gekk@gmail.com>

### What changes were proposed in this pull request? Add tests for `SplitPart`. ### Why are the changes needed? Improve test coverage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A. Closes apache#37626 from wangyum/SPARK-40152-2. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 4f525ee) Signed-off-by: Sean Owen <srowen@gmail.com>

…eFileFormatSuite ### What changes were proposed in this pull request? 3 test cases in ImageFileFormatSuite become flaky in the GitHub action tests: https://github.com/apache/spark/runs/7941765326?check_suite_focus=true https://github.com/gengliangwang/spark/runs/7928658069 Before they are fixed(https://issues.apache.org/jira/browse/SPARK-40171), I suggest disabling them in OSS. ### Why are the changes needed? Disable flaky tests before they are fixed. The test cases keep failing from time to time, while they always pass on local env. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing CI Closes apache#37605 from gengliangwang/disableFlakyTest. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 50f2f50) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…y tests ### What changes were proposed in this pull request? This is port of SPARK-40124 to Spark 3.3. Fix query 32 for TPCDS v1.4 ### Why are the changes needed? Current q32.sql seems to be wrong. It is just selection `1`. Reference for query template: https://github.com/databricks/tpcds-kit/blob/eff5de2c30337b71cc0dc1976147742d2c65d378/query_templates/query32.tpl#L41 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test change only Closes apache#37615 from mskapilks/change-q32-3.3. Authored-by: Kapil Kumar Singh <kapilsingh@microsoft.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

### What changes were proposed in this pull request? This PR proposes to support ASCII value conversion for Latin-1 Supplement characters. ### Why are the changes needed? `ascii()` should be the inverse of `chr()`. But for latin-1 char, we get incorrect ascii value. For example: ```sql select ascii('§') -- output: -62, expect: 167 select chr(167) -- output: '§' ``` ### Does this PR introduce _any_ user-facing change? Yes, fixes the incorrect ASCII conversion for Latin-1 Supplement characters ### How was this patch tested? UT Closes apache#37651 from linhongliu-db/SPARK-40213. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit c078523) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? This PR fixes a bug caused by apache#32022 . Although we deprecate `GROUP BY ... GROUPING SETS ...`, it should still work if it worked before. apache#32022 made a mistake that it didn't preserve the order of user-specified group by columns. Usually it's not a problem, as `GROUP BY a, b` is no different from `GROUP BY b, a`. However, the `grouping_id(...)` function requires the input to be exactly the same with the group by columns. This PR fixes the problem by preserve the order of user-specified group by columns. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? Yes, now a query that worked before 3.2 can work again. ### How was this patch tested? new test Closes apache#37655 from cloud-fan/grouping. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 1ed592e) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

Move tests from SplitPart to elementAt in CollectionExpressionsSuite. Simplify test. No. N/A. Closes apache#37637 from wangyum/SPARK-40152-3. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 06997d6) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…st at ElementAt This PR proposes to fix the test to pass with ANSI mode on. Currently `elementAt` test fails when ANSI mode is on: ``` [info] - elementAt *** FAILED *** (309 milliseconds) [info] Exception evaluating element_at(stringsplitsql(11.12.13, .), 10, Some(), true) (ExpressionEvalHelper.scala:205) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) [info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) [info] at org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1563) [info] at org.scalatest.Assertions.fail(Assertions.scala:949) [info] at org.scalatest.Assertions.fail$(Assertions.scala:945) [info] at org.scalatest.funsuite.AnyFunSuite.fail(AnyFunSuite.scala:1563) [info] at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkEvaluationWithoutCodegen(ExpressionEvalHelper.scala:205) [info] at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkEvaluationWithoutCodegen$(ExpressionEvalHelper.scala:199) [info] at org.apache.spark.sql.catalyst.expressions.CollectionExpressionsSuite.checkEvaluationWithoutCodegen(CollectionExpressionsSuite.scala:39) [info] at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkEvaluation(ExpressionEvalHelper.scala:87) [info] at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkEvaluation$(ExpressionEvalHelper.scala:82) [info] at org.apache.spark.sql.catalyst.expressions.CollectionExpressionsSuite.checkEvaluation(CollectionExpressionsSuite.scala:39) [info] at org.apache.spark.sql.catalyst.expressions.CollectionExpressionsSuite.$anonfun$new$333(CollectionExpressionsSuite.scala:1546) ``` https://github.com/apache/spark/runs/8046961366?check_suite_focus=true To recover the build with ANSI mode. No, test-only. Unittest fixed. Closes apache#37684 from HyukjinKwon/SPARK-40152. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 4b0c3ba) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? Correct the link ### Why are the changes needed? existing link was wrong ### Does this PR introduce _any_ user-facing change? yes, a link was updated ### How was this patch tested? Manually check Closes apache#37685 from zhengruifeng/doc_fix_udtf. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Yuming Wang <yumwang@ebay.com> (cherry picked from commit 8ffcecb) Signed-off-by: Yuming Wang <yumwang@ebay.com>

…te, short, or float The `castPartValueToDesiredType` function now returns byte for ByteType and short for ShortType, rather than ints; also floats for FloatType rather than double. Previously, attempting to read back in a file partitioned on one of these column types would result in a ClassCastException at runtime (for Byte, `java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Byte`). I can't think this is anything but a bug, as returning the correct data type prevents the crash. Yes: it changes the observed behavior when reading in a byte/short/float-partitioned file. Added unit test. Without the `castPartValueToDesiredType` updates, the test fails with the stated exception. === I'll note that I'm not familiar enough with the spark repo to know if this will have ripple effects elsewhere, but tests pass on my fork and since the very similar https://github.com/apache/spark/pull/36344/files only needed to touch these two files I expect this change is self-contained as well. Closes apache#37659 from BrennanStein/spark40212. Authored-by: Brennan Stein <brennan.stein@ekata.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 146f187) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? Spark's `BitSet` doesn't implement `equals()` and `hashCode()` but it is used in `FileSourceScanExec` for bucket pruning. ### Why are the changes needed? Without proper equality check reuse issues can occur. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added new UT. Closes apache#37696 from peter-toth/SPARK-40247-fix-bitset-equals. Authored-by: Peter Toth <ptoth@cloudera.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 527ddec) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…essage in 3.3 ### What changes were proposed in this pull request? This PR fixes the error message in branch-3.3. Different error message is thrown at the test added in apache@4b0c3ba. ### Why are the changes needed? `branch-3.3` is broken because of the different error message being thrown (https://github.com/apache/spark/runs/8065373173?check_suite_focus=true). ``` [info] - elementAt *** FAILED *** (996 milliseconds) [info] (non-codegen mode) Expected error message is `The index 0 is invalid`, but `SQL array indices start at 1` found (ExpressionEvalHelper.scala:176) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) [info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) [info] at org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1563) [info] at org.scalatest.Assertions.fail(Assertions.scala:933) [info] at org.scalatest.Assertions.fail$(Assertions.scala:929) [info] at org.scalatest.funsuite.AnyFunSuite.fail(AnyFunSuite.scala:1563) [info] at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.$anonfun$checkExceptionInExpression$1(ExpressionEvalHelper.scala:176) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18) [info] at org.scalatest.Assertions.withClue(Assertions.scala:1065) [info] at org.scalatest.Assertions.withClue$(Assertions.scala:1052) [info] at org.scalatest.funsuite.AnyFunSuite.withClue(AnyFunSuite.scala:1563) [info] at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkException$1(ExpressionEvalHelper.scala:163) [info] at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkExceptionInExpression(ExpressionEvalHelper.scala:183) [info] at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkExceptionInExpression$(ExpressionEvalHelper.scala:156) [info] at org.apache.spark.sql.catalyst.expressions.CollectionExpressionsSuite.checkExceptionInExpression(CollectionExpressionsSuite.scala:39) [info] at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkExceptionInExpression(ExpressionEvalHelper.scala:153) [info] at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkExceptionInExpression$(ExpressionEvalHelper.scala:150) [info] at org.apache.spark.sql.catalyst.expressions.CollectionExpressionsSuite.checkExceptionInExpression(CollectionExpressionsSuite.scala:39) [info] at org.apache.spark.sql.catalyst.expressions.CollectionExpressionsSuite.$anonfun$new$365(CollectionExpressionsSuite.scala:1555) [info] at org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf(SQLHelper.scala:54) [info] at org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf$(SQLHelper.scala:38) ``` ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? CI in this PR should test it out. Closes apache#37708 from HyukjinKwon/SPARK-40152-3.3. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Yuming Wang <yumwang@ebay.com>

…e.style This PR make `compute.max_rows` option as `None` working in `DataFrame.style`, as expected instead of throwing an exception., by collecting it all to a pandas DataFrame. To make the configuration working as expected. Yes. ```python import pyspark.pandas as ps ps.set_option("compute.max_rows", None) ps.get_option("compute.max_rows") ps.range(1).style ``` **Before:** ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/pandas/frame.py", line 3656, in style pdf = self.head(max_results + 1)._to_internal_pandas() TypeError: unsupported operand type(s) for +: 'NoneType' and 'int' ``` **After:** ``` <pandas.io.formats.style.Styler object at 0x7fdf78250430> ``` Manually tested, and unittest was added. Closes apache#37718 from HyukjinKwon/SPARK-40270. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 0f0e8cc) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…literal of In/InSet downcast failed ### Why are the changes needed? This PR aims to fix the case ```scala sql("create table t1(a decimal(3, 0)) using parquet") sql("insert into t1 values(100), (10), (1)") sql("select * from t1 where a in(100000, 1.00)").show ``` ``` java.lang.RuntimeException: After applying rule org.apache.spark.sql.catalyst.optimizer.UnwrapCastInBinaryComparison in batch Operator Optimization before Inferring Filters, the structural integrity of the plan is broken. at org.apache.spark.sql.errors.QueryExecutionErrors$.structuralIntegrityIsBrokenAfterApplyingRuleError(QueryExecutionErrors.scala:1325) ``` 1. the rule `UnwrapCastInBinaryComparison` transforms the expression `In` to Equals ``` CAST(a as decimal(12,2)) IN (100000.00,1.00) OR( CAST(a as decimal(12,2)) = 100000.00, CAST(a as decimal(12,2)) = 1.00 ) ``` 2. using `UnwrapCastInBinaryComparison.unwrapCast()` to optimize each `EqualTo` ``` // Expression1 CAST(a as decimal(12,2)) = 100000.00 => CAST(a as decimal(12,2)) = 100000.00 // Expression2 CAST(a as decimal(12,2)) = 1.00 => a = 1 ``` 3. return the new unwrapped cast expression `In` ``` a IN (100000.00, 1.00) ``` Before this PR: the method `UnwrapCastInBinaryComparison.unwrapCast()` returns the original expression when downcasting to a decimal type fails (the `Expression1`),returns the original expression if the downcast to the decimal type succeeds (the `Expression2`), the two expressions have different data type which would break the structural integrity ``` a IN (100000.00, 1.00) | | decimal(12, 2) | decimal(3, 0) ``` After this PR: the PR transform the downcasting failed expression to `falseIfNotNull(fromExp)` ``` ((isnull(a) AND null) OR a IN (1.00) ``` ### Does this PR introduce _any_ user-facing change? No, only bug fix. ### How was this patch tested? Unit test. Closes apache#37439 from cfmcgrady/SPARK-39896. Authored-by: Fu Chen <cfmcgrady@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 6e62b93) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? Add a section under [customized-kubernetes-schedulers-for-spark-on-kubernetes](https://spark.apache.org/docs/latest/running-on-kubernetes.html#customized-kubernetes-schedulers-for-spark-on-kubernetes) to explain how to run Spark with Apache YuniKorn. This is based on the review comments from apache#35663. ### Why are the changes needed? Explain how to run Spark with Apache YuniKorn ### Does this PR introduce _any_ user-facing change? No Closes apache#37622 from yangwwei/SPARK-40187. Authored-by: Weiwei Yang <wwei@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 4b18773) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

This PR aims the followings. 1. Add `YuniKornSuite` integration test suite which extends `KubernetesSuite` on Apache YuniKorn scheduler. 2. Support `--default-exclude-tags` command to override `test.default.exclude.tags`. To improve test coverage. No. This is a test suite addition. Since this requires `Apache YuniKorn` installation, the test suite is disabled by default. So, CI K8s integration test should pass without running this suite. In order to run the tests, we need to override `test.default.exclude.tags` like the following. **SBT** ``` $ build/sbt -Psparkr -Pkubernetes -Pkubernetes-integration-tests \ -Dspark.kubernetes.test.deployMode=docker-desktop "kubernetes-integration-tests/test" \ -Dtest.exclude.tags=minikube,local \ -Dtest.default.exclude.tags= ``` **MAVEN** ``` $ dev/dev-run-integration-tests.sh --deploy-mode docker-desktop \ --exclude-tag minikube,local \ --default-exclude-tags '' ``` Closes apache#37753 from dongjoon-hyun/SPARK-40302. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit b2e38e1) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

### What changes were proposed in this pull request? This PR aims to add a new test tag, `decomTestTag`, to K8s Integration Test. ### Why are the changes needed? Decommission-related tests took over 6 minutes (`363s`). It would be helpful we can run them selectively. ``` [info] - Test basic decommissioning (44 seconds, 51 milliseconds) [info] - Test basic decommissioning with shuffle cleanup (44 seconds, 450 milliseconds) [info] - Test decommissioning with dynamic allocation & shuffle cleanups (2 minutes, 43 seconds) [info] - Test decommissioning timeouts (44 seconds, 389 milliseconds) [info] - SPARK-37576: Rolling decommissioning (1 minute, 8 seconds) ``` ### Does this PR introduce _any_ user-facing change? No, this is a test-only change. ### How was this patch tested? Pass the CIs and test manually. ``` $ build/sbt -Psparkr -Pkubernetes -Pkubernetes-integration-tests \ -Dspark.kubernetes.test.deployMode=docker-desktop "kubernetes-integration-tests/test" \ -Dtest.exclude.tags=minikube,local,decom ... [info] KubernetesSuite: [info] - Run SparkPi with no resources (12 seconds, 441 milliseconds) [info] - Run SparkPi with no resources & statefulset allocation (11 seconds, 949 milliseconds) [info] - Run SparkPi with a very long application name. (11 seconds, 999 milliseconds) [info] - Use SparkLauncher.NO_RESOURCE (11 seconds, 846 milliseconds) [info] - Run SparkPi with a master URL without a scheme. (11 seconds, 176 milliseconds) [info] - Run SparkPi with an argument. (11 seconds, 868 milliseconds) [info] - Run SparkPi with custom labels, annotations, and environment variables. (11 seconds, 858 milliseconds) [info] - All pods have the same service account by default (11 seconds, 5 milliseconds) [info] - Run extraJVMOptions check on driver (5 seconds, 757 milliseconds) [info] - Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j2.properties (12 seconds, 467 milliseconds) [info] - Run SparkPi with env and mount secrets. (21 seconds, 119 milliseconds) [info] - Run PySpark on simple pi.py example (13 seconds, 129 milliseconds) [info] - Run PySpark to test a pyfiles example (14 seconds, 937 milliseconds) [info] - Run PySpark with memory customization (12 seconds, 195 milliseconds) [info] - Run in client mode. (11 seconds, 343 milliseconds) [info] - Start pod creation from template (11 seconds, 975 milliseconds) [info] - SPARK-38398: Schedule pod creation from template (11 seconds, 901 milliseconds) [info] - Run SparkR on simple dataframe.R example (14 seconds, 305 milliseconds) ... ``` Closes apache#37755 from dongjoon-hyun/SPARK-40304. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit fd0498f) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

This reverts commit 32d4a2b and 3aa4e11. Closes apache#37729 from wangyum/SPARK-33861. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <yumwang@ebay.com> (cherry picked from commit 43cbdc6) Signed-off-by: Yuming Wang <yumwang@ebay.com>

upgrade `com.fasterxml.jackson.dataformat:jackson-dataformat-yaml` and `fasterxml.jackson.databind.version` from 2.13.3 to 2.13.4 [CVE-2022-25857](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-25857) [SNYK-JAVA-ORGYAML](https://security.snyk.io/vuln/SNYK-JAVA-ORGYAML-2806360) No. Pass GA Closes apache#37796 from bjornjorgensen/upgrade-fasterxml.jackson-to-2.13.4. Authored-by: Bjørn <bjornjorgensen@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit a82a006) Signed-off-by: Sean Owen <srowen@gmail.com>

Spark 3.3.2 fork with K8s / Iceberg support

Reduce size of 3.3.2 build without aws bundle

bersprockets and others added 30 commits August 16, 2022 11:53

rayhondo and others added 9 commits June 15, 2023 12:50

version build

d5a0ae2

fix linting

a64bfa5

Update setup.py

c4e87b2

Merge pull request #20 from Affirm/raykim/affirm-3.3.2

48f4de9

Spark 3.3.2 fork with K8s / Iceberg support

slim 3.3.2

07bbbdf

update version.py

8b1d162

version as official version

98b1c73

Merge pull request #22 from Affirm/raykim/slim-3.3.2

3d44692

Reduce size of 3.3.2 build without aws bundle

upgrade iceberg spark runtime to 1.4.2

b16b11c

rayhondo closed this Nov 17, 2023

github-actions bot added SQL ML MLLIB STRUCTURED STREAMING KUBERNETES WEB UI DEPLOY GRAPHX BUILD SPARK SHELL YARN EXAMPLES DOCS CORE WINDOWS INFRA PYTHON R DSTREAM PANDAS API ON SPARK labels Nov 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Raykim/iceberg/upgrade 142#43876

Raykim/iceberg/upgrade 142#43876
rayhondo wants to merge 683 commits intoapache:masterfrom
Affirm:raykim/iceberg/upgrade-142

rayhondo commented Nov 17, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

rayhondo commented Nov 17, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants