[SPARK-38230][SQL] InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions in most cases by czxm · Pull Request #39595 · apache/spark

czxm · 2023-01-16T05:43:37Z

What changes were proposed in this pull request?

This is based on @coalchan 's previous work of #35549

Why are the changes needed?

method listPartitions is order to get locations of partitions and compute custom partition locations(variable customPartitionLocations), but customPartitionLocations is used for overwriting static hive partitions and file commits as well.
However method listPartitions is much costly than listPartitionNames. For table without custom locations, we can use listPartitionNames to reduce requests to HMS.

Does this PR introduce any user-facing change?

Yes. Introduced a new flag "spark.sql.hasCustomPartitionLocations". By default it's true to use listPartitions. Setting to false indicating to custom partitions hence listPartitionNames can be used.

How was this patch tested?

created a nested partitioned table
run the insert/overwrite static/dynamic partition cases such as:
insert overwrite table p1 partition(p1="1",p2="1.1",p3="1.1.1") select 1
insert into table p1 partition(p1="1",p2="1.1",p3="1.1.1") select 1
insert overwrite table p1 partition(p1="1",p2,p3) select 1,"1.1","1.1.1"
insert into table p1 partition(p1="1",p2,p3) select 1,"1.1","1.1.1"

czxm · 2023-01-16T10:08:49Z

The initial commit caused some insertion cases failed, and they were due to dynamic partition which has custom locations. So a new flag "spark.sql.hasCustomPartitionLocations" is added to let users decide if listPartitionNames can be used here.

Sorry for missing this case. I also updated the PR description as well.

AmplabJenkins · 2023-01-17T23:34:31Z

Can one of the admins verify this patch?

roczei · 2023-01-27T19:20:40Z

Hi @czxm,

Thanks for this fix! I tested it with Cloudera Spark 3.3.0 where the Hive version is 3.1.3000 and it includes this Hive improvement:

Support hive.metastore.limit.partition.request for get_partitions_ps: HIVE-23556

Test case 1 (default setting: spark.sql.hasCustomPartitionLocations=true)

Created a hive-site.xml file with the following content:

<?xml version="1.0" encoding="UTF-8"?>

<configuration>
<property>
   <name>hive.metastore.limit.partition.request</name>
   <value>3</value>
</property>
</configuration>


./spark-shell --conf spark.sql.catalogImplementation=hive --jars ./hive-site.xml

spark.sql("create table student53 (student_name string) using parquet partitioned by (father_name string, percentage  float, section string)")
spark.sql("insert into student53 values('dikshant','ashok', 95, 'A')")
spark.sql("insert into student53 values('raj','ramesh', 90, 'B')")
spark.sql("insert into student53 values('alok','ajit', 85, 'C')")
spark.sql("insert into student53 values('rajendra','naveen', 70, 'D')")
spark.sql("create table student63 (student_name string) using parquet partitioned by (father_name string, percentage  float, section string)")
spark.sql("insert into student63 values('dikshant','ashok', 95, 'A')")
spark.sql("insert into student63 values('raj','ramesh', 90, 'B')")
spark.sql("insert into student63 values('alok','ajit', 85, 'C')")
spark.sql("insert into student63 values('rajendra','naveen', 70, 'D')")

Query 1)

spark.sql("insert overwrite table student63 partition (section, father_name, percentage) select * from student53 where section='A'")

failed with this error (this is expected):

Caused by: org.apache.hadoop.hive.metastore.api.MetaException: Number of partitions scanned (=4) on table 'student31' exceeds limit (=3). This is controlled on the metastore server by metastore.limit.partition.request.

Query 2)

spark.sql("insert overwrite table student63 select * from student53 where section='C'")

failed with this error (this is expected):

Caused by: org.apache.hadoop.hive.metastore.api.MetaException: Number of partitions scanned (=4) on table 'student31' exceeds limit (=3). This is controlled on the metastore server by metastore.limit.partition.request.

The number of partitions are 4 and it tries to do a full partition scan, therefore we have reached the hive.metastore.limit.partition.request limit which is currently 3.

Test case 2)

Added this extra parameter to the spark-shell what you have introduced in this pull request:

--conf spark.sql.hasCustomPartitionLocations=false

$ ./spark-shell  --conf spark.sql.catalogImplementation=hive --jars ./hive-site.xml --conf spark.sql.hasCustomPartitionLocations=false

Results:

scala> spark.sql("insert overwrite table student63 partition (section, father_name, percentage) select * from student53 where section='A'")
res10: org.apache.spark.sql.DataFrame = []

scala> spark.sql("insert overwrite table student63 select * from student53 where section='C'")
res11: org.apache.spark.sql.DataFrame = []
scala>

As you can see it works for me and it resolved the above Hive limit problem.

Before we involve the Spark committers, please create a unit test which includes this new configuration parameter: spark.sql.hasCustomPartitionLocations=false. For example I would add it to this test file:

sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala

Related document:

https://spark.apache.org/developer-tools.html#running-individual-tests

### What changes were proposed in this pull request? This PR follow-up for SPARK-43671, to refine functions to use `pyspark_column_op` util for clean-up the code. ### Why are the changes needed? To avoid `is_remote` in too many places for future maintenance. ### Does this PR introduce _any_ user-facing change? No, it's code cleanup ### How was this patch tested? The existing CI should pass Closes apache#41326 from itholic/categorical_followup. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

…ataFrame from pandas DataFrame and toPandas ### What changes were proposed in this pull request? Support `UserDefinedType` in `createDataFrame` from pandas DataFrame and `toPandas`. For the following schema and pandas DataFrame: ```py schema = ( StructType() .add("point", ExamplePointUDT()) .add("struct", StructType().add("point", ExamplePointUDT())) .add("array", ArrayType(ExamplePointUDT())) .add("map", MapType(StringType(), ExamplePointUDT())) ) data = [ Row( ExamplePoint(1.0, 2.0), Row(ExamplePoint(3.0, 4.0)), [ExamplePoint(5.0, 6.0)], dict(point=ExamplePoint(7.0, 8.0)), ) ] df = spark.createDataFrame(data, schema) pdf = pd.DataFrame.from_records(data, columns=schema.names) ``` ##### `spark.createDataFrame()` For all, return the same results: ```py >>> spark.createDataFrame(pdf, schema).show(truncate=False) +----------+------------+------------+---------------------+ |point |struct |array |map | +----------+------------+------------+---------------------+ |(1.0, 2.0)|{(3.0, 4.0)}|[(5.0, 6.0)]|{point -> (7.0, 8.0)}| +----------+------------+------------+---------------------+ ``` ##### `df.toPandas()` ```py >>> spark.conf.set('spark.sql.execution.pandas.structHandlingMode', 'row') >>> df.toPandas() point struct array map 0 (1.0,2.0) ((3.0,4.0),) [(5.0,6.0)] {'point': (7.0,8.0)} ``` ### Why are the changes needed? Currently `UserDefinedType` in `spark.createDataFrame()` with pandas DataFrame and `df.toPandas()` is not supported with Arrow enabled or in Spark Connect. ##### `spark.createDataFrame()` Works without Arrow: ```py >>> spark.createDataFrame(pdf, schema).show(truncate=False) +----------+------------+------------+---------------------+ |point |struct |array |map | +----------+------------+------------+---------------------+ |(1.0, 2.0)|{(3.0, 4.0)}|[(5.0, 6.0)]|{point -> (7.0, 8.0)}| +----------+------------+------------+---------------------+ ``` , whereas: - With Arrow: Works with fallback: ```py >>> spark.createDataFrame(pdf, schema).show(truncate=False) /.../python/pyspark/sql/pandas/conversion.py:351: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below: [UNSUPPORTED_DATA_TYPE_FOR_ARROW_CONVERSION] ExamplePointUDT() is not supported in conversion to Arrow. Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true. warn(msg) +----------+------------+------------+---------------------+ |point |struct |array |map | +----------+------------+------------+---------------------+ |(1.0, 2.0)|{(3.0, 4.0)}|[(5.0, 6.0)]|{point -> (7.0, 8.0)}| +----------+------------+------------+---------------------+ ``` - Spark Connect ```py >>> spark.createDataFrame(pdf, schema).show(truncate=False) Traceback (most recent call last): ... pyspark.errors.exceptions.base.PySparkTypeError: [UNSUPPORTED_DATA_TYPE_FOR_ARROW_CONVERSION] ExamplePointUDT() is not supported in conversion to Arrow. ``` ##### `df.toPandas()` Works without Arrow: ```py >>> spark.conf.set('spark.sql.execution.pandas.structHandlingMode', 'row') >>> df.toPandas() point struct array map 0 (1.0,2.0) ((3.0,4.0),) [(5.0,6.0)] {'point': (7.0,8.0)} ``` , whereas: - With Arrow Works with fallback: ```py >>> spark.conf.set('spark.sql.execution.pandas.structHandlingMode', 'row') >>> df.toPandas() /.../python/pyspark/sql/pandas/conversion.py:111: UserWarning: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below: [UNSUPPORTED_DATA_TYPE_FOR_ARROW_CONVERSION] ExamplePointUDT() is not supported in conversion to Arrow. Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true. warn(msg) point struct array map 0 (1.0,2.0) ((3.0,4.0),) [(5.0,6.0)] {'point': (7.0,8.0)} ``` - Spark Connect Results with the internal type: ```py >>> spark.conf.set('spark.sql.execution.pandas.structHandlingMode', 'row') >>> df.toPandas() point struct array map 0 [1.0, 2.0] ([3.0, 4.0],) [[5.0, 6.0]] {'point': [7.0, 8.0]} ``` ### Does this PR introduce _any_ user-facing change? Users will be able to use `UserDefinedType`. ### How was this patch tested? Added the related tests. Closes apache#41333 from ueshin/issues/SPARK-43817/udt. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

…ign names to the error class _LEGACY_ERROR_TEMP_241[1-7] ### What changes were proposed in this pull request? The pr aims to assign a name to the error class _LEGACY_ERROR_TEMP_241[1-7]. ### Why are the changes needed? Improve the error framework. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Exists test cases. Closes apache#41339 from beliefer/2411-2417. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

…P_103[1-2] ### What changes were proposed in this pull request? The pr aims to assign a name to the error class _LEGACY_ERROR_TEMP_103[1-2]. ### Why are the changes needed? The changes improve the error framework. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Update existed UT. - Pass GA. Closes apache#41346 from panbingkun/SPARK-43837. Authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

…solveDefaultColumns` ### What changes were proposed in this pull request? The pr aims to use error classes in the compilation errors of `ResolveDefaultColumns`. ### Why are the changes needed? The changes improve the error framework. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Update UT. - Pass GA. Closes apache#41345 from panbingkun/SPARK-43834. Authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

…FEATURE.TIME_TRAVEL` ### What changes were proposed in this pull request? The pr aims to convert `_LEGACY_ERROR_TEMP_1337` to `UNSUPPORTED_FEATURE.TIME_TRAVEL` and remove `_LEGACY_ERROR_TEMP_1335` ### Why are the changes needed? - The changes improve the error framework. - In the spark base code `_ LEGACY_ ERROR_ TEMP_ 1335` is no longer used anywhere. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Add new UT - Pass GA Closes apache#41349 from panbingkun/SPARK-43839. Authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

### What changes were proposed in this pull request? This PR aims to make Scala 2.13 the default Scala version in Apache Spark 3.5. ### Why are the changes needed? The current releases of Scala community are `Scala 3.2.2` and `Scala 2.13.10`. - https://scala-lang.org/download/all.html Although the Apache Spark community has been using Scala 2.12 by default since Apache Spark 3.0 and Scala community will release Scala 2.12.18 for Java 21 support, we had better focus on `Scala 2.13+` more from Apache Spark 3.5 timeline to adopt Scala community's activity. Since SPARK-25075 added Scala 2.13 at Apache Spark 3.2.0, the Apache Spark community has been using it as a second Scala version. This PR aims to switch only the default Scala version from 2.12 to 2.13. Apache Spark will support both Scala 2.12 and 2.13 still. ### Does this PR introduce _any_ user-facing change? Yes, but we still have Scala 2.12. ### How was this patch tested? Pass the CIs. Closes apache#41344 from dongjoon-hyun/SPARK-43836. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

### What changes were proposed in this pull request? This PR refactors the default column value resolution so that we don't need an extra DS v2 API for external v2 sources. The general idea is to split the default column value resolution into two parts: 1. resolve the column "DEFAULT" to the column default expression. This applies to `Project`/`UnresolvedInlineTable` under `InsertIntoStatement`, and assignment expressions in `UpdateTable`/`MergeIntoTable`. 2. fill missing columns with column default values for the input query. This does not apply to UPDATE and non-INSERT action of MERGE as they use the column from the target table as the default value. The first part should be done for all the data sources, as it's part of column resolution. The second part should not be applied to v2 data sources with `ACCEPT_ANY_SCHEMA`, as they are free to define how to handle missing columns. More concretely, this PR: 1. put the column "DEFAULT" resolution logic in the rule `ResolveReferences`, with two new virtual rules. This is to follow apache#38888 2. put the missing column handling in `TableOutputResolver`, which is shared by both the v1 and v2 insertion resolution rule. External v2 data sources can add custom catalyst rules to deal with missing columns for themselves. 3. Remove the old rule `ResolveDefaultColumns`. Note that, with the refactor, we no long need to manually look up the table. We will deal with column default values after the target table of INSERT/UPDATE/MERGE is resolved. 4. Remove the rule `ResolveUserSpecifiedColumns` and merge it to `PreprocessTableInsertion`. These two rules are both to resolve v1 insertion, and it's tricky to reason about their interactions. It's clearer to resolve the insertion with one pass. ### Why are the changes needed? code cleanup and remove unneeded DS v2 API. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? updated tests Closes apache#41262 from cloud-fan/def-val. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

… arrow UDF operations ### What changes were proposed in this pull request? Adds a new config that uses the `LargeUtf8` and `LargeBinary` arrow types for arrow-based UDF operations. These arrow types make arrow use `LargeVarCharVector` and `LargeVarBinaryVector` instead of the regular `VarCharVector` and `VarBinaryVector` respectively. This config is disabled by default to maintain the current behavior. ### Why are the changes needed? `VarCharVector` and `VarBinaryVector` have a size limit of 2 GiB for a single vector. This is because they use 4 byte integers to track the offsets of each value in the vector. During certain operations, it is possible to hit this limit. The most affected way that we've run into this is during a `applyInPandas` operation, since the entire group is sent as a single RecordBatch, and there is no way to chunk up any smaller than the entire group. However, other map and UDF operations can benefit as well without having to find the right records per batch to not hit this limit. The large vector types use an 8 byte long to track value offsets, removing the 2 GiB total size limit. ### Does this PR introduce _any_ user-facing change? Adds an option that can help users get around what currently results in `IndexOutOfBoundsException`, though this exception being raised is a bug that was fixed in Arrow and it should actually be a `OversizedAllocationException` in the next release which suggests using the large variable width types instead. ### How was this patch tested? A few new tests are added. I also enabled the setting by default for a full CI run and all existing tests passed. I can add more tests if needed. Closes apache#39572 from Kimahriman/large-binary-vector. Authored-by: Adam Binford <adamq43@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? This PR aims to switch `scala-213` GitHub Action Job to `scala-212`. ### Why are the changes needed? Since SPARK-43836, Scala 2.13 is used by default. So, we need to test `Scala 2.12` instead of `Scala 2.13` in this additional test pipeline. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs and check the `scala-212` pipeline instead of `scala-213` pipeline. https://github.com/dongjoon-hyun/spark/actions/runs/5105828839/jobs/9177657650 Closes apache#41351 from dongjoon-hyun/SPARK-43840. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

### What changes were proposed in this pull request? This PR aims to upgrade `gcs-connector` to 2.2.14. ### Why are the changes needed? - https://github.com/GoogleCloudDataproc/hadoop-connectors/releases/tag/v2.2.14 - https://github.com/GoogleCloudDataproc/hadoop-connectors/releases/tag/v2.2.13 - https://github.com/GoogleCloudDataproc/hadoop-connectors/releases/tag/v2.2.12 ### Does this PR introduce _any_ user-facing change? Although this is a dependency change, there is no user-facing change. ### How was this patch tested? Pass the CIs and manual verification. **BUILD** ``` $ dev/make-distribution.sh -Phadoop-cloud ``` **RUN** ``` $ export KEYFILE=YOUR-credentials.json $ export EMAIL=$(jq -r '.client_email' < $KEYFILE) $ export PRIVATE_KEY_ID=$(jq -r '.private_key_id' < $KEYFILE) $ export PRIVATE_KEY="$(jq -r '.private_key' < $KEYFILE)" $ bin/spark-shell \ -c spark.hadoop.fs.gs.auth.service.account.email=$EMAIL \ -c spark.hadoop.fs.gs.auth.service.account.private.key.id=$PRIVATE_KEY_ID \ -c spark.hadoop.fs.gs.auth.service.account.private.key="$PRIVATE_KEY" Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.5.0-SNAPSHOT /_/ Using Scala version 2.13.8 (OpenJDK 64-Bit Server VM, Java 17.0.7) Type in expressions to have them evaluated. Type :help for more information. 23/05/28 14:43:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Spark context Web UI available at http://localhost:4040 Spark context available as 'sc' (master = local[*], app id = local-1685310212206). Spark session available as 'spark'. scala> spark.read.text("gs://apache-spark-bucket/README.md").count() val res0: Long = 124 scala> spark.read.orc("examples/src/main/resources/users.orc").write.mode("overwrite").orc("gs://apache-spark-bucket/users.orc") scala> spark.read.orc("gs://apache-spark-bucket/users.orc").show() +------+--------------+----------------+ | name|favorite_color|favorite_numbers| +------+--------------+----------------+ |Alyssa| NULL| [3, 9, 15, 20]| | Ben| red| []| +------+--------------+----------------+ ``` Closes apache#41352 from dongjoon-hyun/SPARK-43842. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…by reuse existent functions ### What changes were proposed in this pull request? apache#40615 added the `hll_union_agg` into Spark and Spark connect project. and apache#40120 added the `nth_value` into Spark connect project. But `hll_union_agg` and `nth_value` in connect functions API are defined one by one. In fact, we can simplify the `hll_union_agg` and `nth_value` by reuse functions. Another minor change is we should keep a consistent API order with functions API in SQL. ### Why are the changes needed? Simplify the `hll_union_agg` by reuse functions. ### Does this PR introduce _any_ user-facing change? 'No'. Just update the inner implementation. ### How was this patch tested? Exists test cases. Closes apache#41329 from beliefer/SPARK-16484_followup. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…ps` for Spark Connect ### What changes were proposed in this pull request? This PR proposes to fix `BinaryOps` test for pandas API on Spark with Spark Connect. This includes SPARK-43666, SPARK-43667, SPARK-43668, SPARK-43669 at once, because they are all related similar modifications in single file. ### Why are the changes needed? To support all features for pandas API on Spark with Spark Connect. ### Does this PR introduce _any_ user-facing change? Yes, `BinaryOps.lt`, `BinaryOps.le`, `BinaryOps.ge`, `BinaryOps.gt` are now working as expected on Spark Connect. ### How was this patch tested? Uncomment the UTs, and tested manually. Closes apache#41305 from itholic/SPARK-43666-9. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…ation used in local checkpoint ### What changes were proposed in this pull request? Use the utils to get the switch for dynamic allocation used in local checkpoint ### Why are the changes needed? In RDD's local checkpoint, only through retrieve the value from configuration, but not adjudge the local master and testing for dynamic allocation which unified in Utils. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Used the existing UTs Closes apache#39998 from jiwq/SPARK-42421. Authored-by: Wanqiang Ji <jiwq@apache.org> Signed-off-by: Kent Yao <yao@apache.org>

### What changes were proposed in this pull request? This PR aims to setup Scala 2.12 Daily GitHub Action Job by converting Scala 2.13 Daily Job. - Rename `build_scala213.yml` file to `.github/workflows/build_scala212.yml`. - Rename job name ``` -name: "Build (master, Scala 2.13, Hadoop 3, JDK 8)" +name: "Build (master, Scala 2.12, Hadoop 3, JDK 8)" ``` - Switch SCALA_PROFILE from `scala2.13` to `scala2.12`. ### Why are the changes needed? To keep the Scala 2.12 test coverage ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This is a daily job. We need to check after merging. Closes apache#41354 from dongjoon-hyun/SPARK-43845. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

… IntegrationTestUtils clearer ### What changes were proposed in this pull request? The pr aims to make the prompt for `findJar` method in IntegrationTestUtils clearer. ### Why are the changes needed? Friendly prompts can help locate problems. When I am running tests in ClientE2ETestSuite, I often cannot locate them through error prompts when they fail, and I can only search for specific reasons through code - Before applying this patche, the error prompt is as follows: `Exception encountered when invoking run on a nested suite - Failed to find the jar inside folder: .../spark-community/connector/connect/server/target` - After applying this patche, The error prompt is as follows: `Exception encountered when invoking run on a nested suite - Failed to find the jar: spark-connect-assembly(.).jar or spark-connect(.)3.5.0-SNAPSHOT.jar inside folder: .../spark-community/connector/connect/server/target. This file can be generated by similar to the following command: build/sbt package|assembly` Improvement in two aspects - Prompt us what files are missing - How to generate the above file ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual check Closes apache#41336 from panbingkun/SPARK-43821. Authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…ingUtils#orderSuggestedIdentifiersBySimilarity` ### What changes were proposed in this pull request? In `StringUtils#orderSuggestedIdentifiersBySimilarity`, handle the case where the candidate attributes have a mix of empty and non-empty prefixes. ### Why are the changes needed? The following query throws a `StringIndexOutOfBoundsException`: ``` with v1 as ( select * from values (1, 2) as (c1, c2) ), v2 as ( select * from values (2, 3) as (c1, c2) ) select v1.c1, v1.c2, v2.c1, v2.c2, b from v1 full outer join v2 using (c1); ``` The query should fail anyway, since `b` refers to a non-existent column. But it should fail with a helpful error message, not with a `StringIndexOutOfBoundsException`. `StringUtils#orderSuggestedIdentifiersBySimilarity` assumes that a list of suggested attributes with a mix of prefixes will never have an attribute name with an empty prefix. But in this case it does (`c1` from the `coalesce` has no prefix, since it is not associated with any relation or subquery): ``` +- 'Project [c1#5, c2#6, c1#7, c2#8, 'b] +- Project [coalesce(c1#5, c1#7) AS c1#9, c2#6, c2#8] <== c1#9 has no prefix, unlike c2#6 (v1.c2) or c2#8 (v2.c2) +- Join FullOuter, (c1#5 = c1#7) :- SubqueryAlias v1 : +- CTERelationRef 0, true, [c1#5, c2#6] +- SubqueryAlias v2 +- CTERelationRef 1, true, [c1#7, c2#8] ``` Because of this, `orderSuggestedIdentifiersBySimilarity` returns a sorted list of suggestions like this: ``` ArrayBuffer(.c1, v1.c2, v2.c2) ``` `UnresolvedAttribute.parseAttributeName` chokes on an attribute name that starts with a '.'. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit tests. Closes apache#41353 from bersprockets/unresolved_column_issue. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

…ionCatalogSuite ### What changes were proposed in this pull request? The pr aims to use `checkError()` to check Exception in `SessionCatalogSuite`. ### Why are the changes needed? Migration on `checkError()` will make the tests independent from the text of error messages. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. Closes apache#41355 from panbingkun/checkError_SessionCatalogSuite. Authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

### What changes were proposed in this pull request? ParseToDate should load the EvalMode in main thread instead of loading it in a lazy val. ### Why are the changes needed? This is because it is sometimes hard to estimate when the lazy val is executed while the SQLConf where we load the EvalMode is thread local. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UT Closes apache#41298 from amaliujia/SPARK-43779. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…eOps` for Spark Connect ### What changes were proposed in this pull request? This PR proposes to fix `DatetimeOps` test for pandas API on Spark with Spark Connect. This includes SPARK-43676, SPARK-43677, SPARK-43678, SPARK-43679 at once, because they are all related similar modifications in single file. ### Why are the changes needed? To support all features for pandas API on Spark with Spark Connect. ### Does this PR introduce _any_ user-facing change? Yes, `DatetimeOps.lt`, `DatetimeOps.le`, `DatetimeOps.ge`, `DatetimeOps.gt` are now working as expected on Spark Connect. ### How was this patch tested? Uncomment the UTs, and tested manually. Closes apache#41306 from itholic/SPARK-43676-9. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

…` for Spark Connect ### What changes were proposed in this pull request? This PR proposes to fix `NullOps` test for pandas API on Spark with Spark Connect. This includes SPARK-43680, SPARK-43681, SPARK-43682, SPARK-43683 at once, because they are all related similar modifications in single file. ### Why are the changes needed? To support all features for pandas API on Spark with Spark Connect. ### Does this PR introduce _any_ user-facing change? Yes, `NullOps.lt`, `NullOps.le`, `NullOps.ge`, `NullOps.gt` are now working as expected on Spark Connect. ### How was this patch tested? Uncomment the UTs, and tested manually. Closes apache#41361 from itholic/SPARK-43680-3. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

### What changes were proposed in this pull request? This pr aims to enable `unused imports` check for Scala 2.13. There is an exception here for `import scala.language.higherKinds`（Scala 2.13 does not need to import them , while Scala 2.12 must import them）, so this pr add two new suppression rules for it. ### Why are the changes needed? Enable `unused imports` check for Scala 2.13. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass Github Actions Closes apache#41356 from LuciferYang/SPARK-43849. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

… as default ### What changes were proposed in this pull request? This pr aims to make the benchmark Github Action task use Scala 2.13 as default. ### Why are the changes needed? Spark 3.5.0 already change to use Scala 2.13 as default ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Manual checked: <img width="353" alt="image" src="https://github.com/apache/spark/assets/1475305/6bc9190d-0a84-40e7-8c97-7c1d5a6fdd7d"> Closes apache#41358 from LuciferYang/SPARK-43858. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

### What changes were proposed in this pull request Similar to SPARK-41316, this pr adds `scala.annotation.tailrec` inspected by IDE (IntelliJ), these are new cases after Spark 3.4. ### Why are the changes needed? To improve performance. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GItHub Actions Closes apache#41360 from LuciferYang/SPARK-43860. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: yangjie01 <yangjie01@baidu.com>

…r run benchmark ### What changes were proposed in this pull request? This is a followup of apache#41358, this pr change to restore Scala version to 2.13 after running benchmark to avoid containing changed `pom.xml` in the results tarball. ### Why are the changes needed? After running benchmark, should restore Scala to default version to results tarball includes changed `pom.xml`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions Closes apache#41371 from LuciferYang/SPARK-43858-FOLLOWUP. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…f API ### What changes were proposed in this pull request? This updated Protobuf Pyspark API to allow passing binary FileDescriptorSet rather than a file name. This is a Python follow up to feature implemented in Scala in apache#41192. ### Why are the changes needed? - This allows flexibility for Pyspark users to provide binary descriptor set directly. - Even if users are using file path, Pyspark avoids passing file name to Scala and reads the descriptor file in Python. This avoids having to read the file in Scala. ### Does this PR introduce _any_ user-facing change? - This adds extra arg to `from_protobuf()` and `to_protobuf()` API. ### How was this patch tested? - Doc tests - Manual tests Closes apache#41343 from rangadi/py-proto-file-buffer. Authored-by: Raghu Angadi <raghu.angadi@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? Improve logistic regression model saving: Current master code, it saves the core pytorch model that only includes the "Linear" layer, to make the saved pytorch model easier to use solely without pyspark, I append a "softmax" layer to the torch model and then save it. ### Why are the changes needed? Improving the saved pytorch model in `LogisticRegressionModel.save` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Closes apache#41629 from WeichenXu123/improve-lor-model-save. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

### What changes were proposed in this pull request? This PR aims to upgrade Scala to 2.12.18 - https://www.scala-lang.org/news/2.12.18 ### Why are the changes needed? This release adds support for JDK 20 and 21: - scala/scala#10185 - scala/scala#10362 - scala/scala#10397 - scala/scala#10400 The full release notes as follows: - https://github.com/scala/scala/releases/tag/v2.12.18 ### Does this PR introduce _any_ user-facing change? Yes, this is a Scala version change. ### How was this patch tested? Existing Test Closes apache#41627 from LuciferYang/SPARK-43832. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

### What changes were proposed in this pull request? Move `DataFrameFunctionsSuite` to `sql - slow` ### Why are the changes needed? 1, recently `DataFrameFunctionsSuite` is frequently updated for the new functions, while `sql - other` is quite flaky and it's likely to get stuck or failed before `DataFrameFunctionsSuite` is actually tested (see apache@61a195d and apache@37f9bea), so it is hard for the reviewers to judge whether all related tests pass; 2, `sql - other` is actually much slower than `sql - slow`, so I think it is fine to move a suite to `sql - slow`; ### Does this PR introduce _any_ user-facing change? no, test-only ### How was this patch tested? updated GA Closes apache#41643 from zhengruifeng/sql_test_move_dfsuite. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

### What changes were proposed in this pull request? This PR aims to add `build_java21.yml` via Java 21-ea to monitor the progress of Java 21 test capability. ### Why are the changes needed? To claim Java 21 support, this GitHub Action should be green at least. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This should be monitored after merging. Closes apache#41645 from dongjoon-hyun/SPARK-43835. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

### What changes were proposed in this pull request? This PR aims to remove `jdk.incubator.foreign` usage in Java 21. ### Why are the changes needed? Java 21 moved `jdk.incubator.foreign` to `java.lang.foreign` and it causes a boot layer failure. https://bugs.openjdk.org/browse/JDK-8280527 ``` $ jshell -J--add-modules=jdk.incubator.vector,jdk.incubator.foreign Error occurred during initialization of boot layer java.lang.module.FindException: Module jdk.incubator.foreign not found ``` ``` $ build/sbt "unsafe/test" Using /Users/dongjoon/.jenv/versions/21-ea as default JAVA_HOME. Note, this will be overridden by -java-home if it is set. ... Error occurred during initialization of boot layer java.lang.module.FindException: Module jdk.incubator.foreign not found ... [error] (unsafe / Test / test) sbt.TestsFailedException: Tests unsuccessful [error] Total time: 0 s, completed Jun 17, 2023, 9:53:26 PM ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs and manually Java 21 testing. ``` $ java -version openjdk version "21-ea" 2023-09-19 OpenJDK Runtime Environment (build 21-ea+27-2343) OpenJDK 64-Bit Server VM (build 21-ea+27-2343, mixed mode, sharing) $ build/sbt "unsafe/test" Using /Users/dongjoon/.jenv/versions/21-ea as default JAVA_HOME. Note, this will be overridden by -java-home if it is set. Using SPARK_LOCAL_IP=localhost [info] welcome to sbt 1.9.0 (Oracle Corporation Java 21-ea) ... [info] UTF8StringPropertyCheckSuite: [info] - toString (27 milliseconds) [info] - numChars (2 milliseconds) [info] - startsWith (5 milliseconds) [info] - endsWith (4 milliseconds) [info] - toUpperCase (2 milliseconds) [info] - toLowerCase (1 millisecond) [info] - compare (3 milliseconds) [info] - substring (14 milliseconds) [info] - contains (10 milliseconds) [info] - trim, trimLeft, trimRight (4 milliseconds) [info] - reverse (1 millisecond) [info] - indexOf (7 milliseconds) [info] - repeat (2 milliseconds) [info] - lpad, rpad (4 milliseconds) [info] - concat (8 milliseconds) [info] - concatWs (6 milliseconds) [info] - split !!! IGNORED !!! [info] - levenshteinDistance (2 milliseconds) [info] - hashCode (1 millisecond) [info] - equals (1 millisecond) [info] Run completed in 582 milliseconds. [info] Total number of tests run: 19 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 19, failed 0, canceled 0, ignored 1, pending 0 [info] All tests passed. [success] Total time: 1 s, completed Jun 17, 2023, 9:51:55 PM ``` Closes apache#41646 from dongjoon-hyun/SPARK-44088. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

### What changes were proposed in this pull request? The pr aims to update some maven plugins to newest version. include: - versions-maven-plugin from 2.15.0 to 2.16.0 - maven-source-plugin from 3.2.1 to 3.3.0 - maven-surefire-plugin from 3.1.0 to 3.1.2 - maven-dependency-plugin from 3.5.0 to 3.6.0 ### Why are the changes needed? - versions-maven-plugin 1.Release Notes: https://github.com/mojohaus/versions/releases/tag/2.16.0 2.Bug Fix: Resolves: display-dependency-updates only shows updates from the most major allowed segment (mojohaus/versions#966) ajarmoniuk Resolves mojohaus/versions#931: Fixing problems with encoding in UseDepVersion and PomHelper (mojohaus/versions#932) ajarmoniuk Resolves mojohaus/versions#916: Partially reverted mojohaus/versions#799. (mojohaus/versions#924) ajarmoniuk Resolves mojohaus/versions#954: Excluded plexus-container-default (mojohaus/versions#955) ajarmoniuk Resolves mojohaus/versions#951: DefaultArtifactVersion::getVersion can be null (mojohaus/versions#952) ajarmoniuk BoundArtifactVersion.toString() to work with NumericVersionComparator (mojohaus/versions#930) ajarmoniuk Issue mojohaus/versions#925: Protect against an NPE if a dependency version is defined in dependencyManagement (mojohaus/versions#926) ajarmoniuk - maven-source-plugin v3.2.1 VS v3.3.0: apache/maven-source-plugin@maven-source-plugin-3.2.1...maven-source-plugin-3.3.0 - maven-surefire-plugin Release Notes: https://github.com/apache/maven-surefire/releases/tag/surefire-3.1.2 - maven-dependency-plugin v3.5.0 VS v3.6.0: apache/maven-dependency-plugin@maven-dependency-plugin-3.5.0...maven-dependency-plugin-3.6.0 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. Closes apache#41641 from panbingkun/SPARK-44085. Authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…on - part 2 ### What changes were proposed in this pull request? Add following functions: - replace - split_part - substr - parse_url - printf - url_decode - url_encode - position - endswith - startswith to: - Scala API - Python API - Spark Connect Scala Client - Spark Connect Python Client ### Why are the changes needed? for parity ### Does this PR introduce _any_ user-facing change? Yes, new functions. ### How was this patch tested? - Add New UT. - Pass GA. Closes apache#41594 from panbingkun/SPARK-43944. Authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

…act a single element ### What changes were proposed in this pull request? A minor code simplification, use `map` instead of `unzip` when `unzip` only used to extract a single element. ### Why are the changes needed? Code simplification ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions Closes apache#41548 from LuciferYang/SPARK-44024. Lead-authored-by: yangjie01 <yangjie01@baidu.com> Co-authored-by: YangJie <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com>

…TableRenamePartitionSuite` ### What changes were proposed in this pull request? apache#41533 ignore `AlterTableRenamePartitionSuite` try to restore stability of `sql-others` test task, but it seems that it is not the root cause that affects stability, so this pr has removed the previously added `ignore` identifier to restore testing. ### Why are the changes needed? Resume testing of `AlterTableRenamePartitionSuite` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? should monitor ci Closes apache#41647 from LuciferYang/SPARK-44089. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

### What changes were proposed in this pull request? This PR aims to make `catalyst` module passes in Java 21. ### Why are the changes needed? https://bugs.openjdk.org/browse/JDK-8267125 changes the error message at Java 18. **JAVA** ``` $ java -version openjdk version "21-ea" 2023-09-19 OpenJDK Runtime Environment (build 21-ea+27-2343) OpenJDK 64-Bit Server VM (build 21-ea+27-2343, mixed mode, sharing) ``` **BEFORE** ``` $ build/sbt "catalyst/test" ... [info] *** 1 TEST FAILED *** [error] Failed: Total 7122, Failed 1, Errors 0, Passed 7121, Ignored 5, Canceled 1 [error] Failed tests: [error] org.apache.spark.sql.catalyst.expressions.ExpressionImplUtilsSuite [error] (catalyst / Test / test) sbt.TestsFailedException: Tests unsuccessful [error] Total time: 212 s (03:32), completed Jun 18, 2023, 1:11:17 AM ``` **AFTER** ``` $ build/sbt "catalyst/test" ... [info] All tests passed. [info] Passed: Total 7122, Failed 0, Errors 0, Passed 7122, Ignored 5, Canceled 1 [success] Total time: 213 s (03:33), completed Jun 18, 2023, 1:15:37 AM ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs and manual test on Java 21. Closes apache#41649 from dongjoon-hyun/SPARK-44093. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com>

### What changes were proposed in this pull request? This PR aims to upgrade Apache Arrow to 12.0.1. ### Why are the changes needed? This brings the following bug fixes. - https://arrow.apache.org/release/12.0.1.html ### Does this PR introduce _any_ user-facing change? No. SPARK-43446 introduced Apache Arrow 12.0.0 for Apache Spark 3.5.0. ### How was this patch tested? Pass the CIs. Closes apache#41650 from dongjoon-hyun/SPARK-44094. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…des by new attributes ### What changes were proposed in this pull request? apache#41475 introduces a fix that we remove extra alias-only project which might cause same metrics mismatch over the query plan. However, to make it more robust, we need to replace the attributes if we need to drop the extra Project. ### Why are the changes needed? Enhance the fix to cover more test case. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes apache#41620 from amaliujia/fix_json_followup. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

### What changes were proposed in this pull request? The purpose of this pr is to add a new test tag `SlowSQLTest` to the sql module, and identified some Suites with test cases more than 3 seconds, and apply it to GA testing task to reduce the testing pressure of the `sql others` group. For branches-3.3 and branches-3.4, a tag that will not appear in the sql module was assigned to the new test group to avoid `java.lang.ClassNotFoundException` and make this group build only without running any tests. ### Why are the changes needed? For a long time, the sql module UTs has only two groups: `slow` and `others`. The test cases in group `slow` are fixed, while the number of test cases in group `others` continues to increase, which has had a certain impact on the testing duration and stability of group `others`. So this PR proposes to add a new testing group to share the testing pressure of `sql others` group, which has made the testing time of the three groups more average, and hope it can improve the stability of the GA task. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Should monitor GA Closes apache#41638 from LuciferYang/SPARK-44034-2. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…ith Any constants ### What changes were proposed in this pull request? In the PR, I propose to change API of parameterized SQL, and replace type of argument values from `string` to `Any` in `sql_formatter`. Language API can accept `Any` objects from which it is possible to construct literal expressions. ### Why are the changes needed? To align the API to PySpark's `sql()`. And the current implementation the parameterized `sql()` requires arguments as string values parsed to SQL literal expressions that causes the following issues: 1. SQL comments are skipped while parsing, so, some fragments of input might be skipped. For example, `'Europe -- Amsterdam'`. In this case, `-- Amsterdam` is excluded from the input. 2. Special chars in string values must be escaped, for instance `'E\'Twaun Moore'` ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? By running the affected test suite: ``` $ python/run-tests --parallelism=1 --testnames 'pyspark.pandas.sql_formatter' ``` Closes apache#41644 from MaxGekk/fix-pandas-sql_formatter. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…ference) ### What changes were proposed in this pull request? Rename `k_res` to `ps_res` and `ps` to `pser` in `test_combine.py`. There is no functional change; it's purely stylistic/for consistency. ### Why are the changes needed? As a reader, the variable names are confusing and inconsistent. I only thought that the `k_` prefix meant Koalas because I've contributed to Koalas in the past. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests should continue to pass; no new tests necessary Closes apache#41634 from deepyaman/patch-1. Authored-by: Deepyaman Datta <deepyaman.datta@utexas.edu> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…old Spark versions on Java 21 ### What changes were proposed in this pull request? This PR aims to make `HiveExternalCatalogVersionsSuite` skip old Spark versions when Java 21 is used for testing. ### Why are the changes needed? Old Apache Spark releases are unable to support Java 21. So, it causes a test failure at runtime. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs and manual testing on Java 21. **BEFORE** ``` $ build/sbt "hive/testOnly *.HiveExternalCatalogVersionsSuite" -Phive ... [info] 2023-06-18 16:43:22.448 - stderr> Caused by: java.lang.IllegalStateException: java.lang.NoSuchMethodException: java.nio.DirectByteBuffer.<init>(long,int) ... [info] *** 1 SUITE ABORTED *** [error] Error during tests: [error] org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite [error] (hive / Test / testOnly) sbt.TestsFailedException: Tests unsuccessful [error] Total time: 33 s, completed Jun 18, 2023, 4:43:23 PM ``` **AFTER** ``` $ build/sbt "hive/testOnly *.HiveExternalCatalogVersionsSuite" -Phive ... [info] HiveExternalCatalogVersionsSuite: [info] - backward compatibility (8 milliseconds) [info] Run completed in 1 second, 26 milliseconds. [info] Total number of tests run: 1 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 14 s, completed Jun 18, 2023, 4:42:24 PM ``` Closes apache#41652 from dongjoon-hyun/SPARK-44095. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…` module pass with Java 21 ### What changes were proposed in this pull request? This PR aims to make `core` module tests succeed in Java 21. To do that, this PR - Adds a utility variable `Utils.isJavaVersionAtLeast21` because Apache Commons Lang3 `3.12.0` doesn't have a constant for Java 21 yet. - Fix `UtilsSuite` according to the Java behavior change of `Files.createDirectories` API. ### Why are the changes needed? Java 20+ changes the behavior. - openjdk/jdk@169a5d4 ``` 8294193: Files.createDirectories throws FileAlreadyExistsException for a symbolic link whose target is an existing directory ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual tests in Java 21. **JAVA** ``` $ java -version openjdk version "21-ea" 2023-09-19 OpenJDK Runtime Environment (build 21-ea+27-2343) OpenJDK 64-Bit Server VM (build 21-ea+27-2343, mixed mode, sharing) ``` **BEFORE** ``` $ $ build/sbt "core/test" -Dtest.exclude.tags=org.apache.spark.tags.ExtendedLevelDBTest ... [info] *** 1 TEST FAILED *** [error] Failed: Total 3451, Failed 1, Errors 0, Passed 3450, Ignored 10, Canceled 5 [error] Failed tests: [error] org.apache.spark.util.UtilsSuite [error] (core / Test / test) sbt.TestsFailedException: Tests unsuccessful [error] Total time: 1040 s (17:20), completed Jun 18, 2023, 12:27:59 AM ``` **AFTER** ``` $ build/sbt "core/testOnly org.apache.spark.util.UtilsSuite" ... [info] All tests passed. [success] Total time: 29 s, completed Jun 17, 2023, 11:16:23 PM ``` Closes apache#41648 from dongjoon-hyun/SPARK-44092. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

… Python and Connect API - part 1 ### What changes were proposed in this pull request? This PR want add date time functions to Scala, Python and Connect API. These functions show below. - dateadd - date_diff - date_from_unix_date - day The origin plan also contains the two function `date_part` and `datepart`. You can see this PR exclude them, since we can't get the data type for unresolved expressions. Please refer https://github.com/apache/spark/blob/b97ce8b9a99c570fc57dec967e7e9db3d115c1db/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L2835 and https://github.com/apache/spark/blob/b97ce8b9a99c570fc57dec967e7e9db3d115c1db/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L2922 ### Why are the changes needed? Add date time functions to Scala, Python and Connect API. ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? New test cases. Closes apache#41636 from beliefer/SPARK-43929. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

### What changes were proposed in this pull request? This PR proposes to add one newline into the example of `mapInPandas`. ### Why are the changes needed? You copy and paste the example: ![Screenshot 2023-06-19 at 4 23 25 PM](https://github.com/apache/spark/assets/6477701/56c4693d-63c8-4858-8324-6bfea6628ef7) and it does not work: ``` >>> def filter_func(iterator): ... for pdf in iterator: ... yield pdf[pdf.id == 1] ... df.mapInPandas(filter_func, df.schema).show() File "<stdin>", line 4 df.mapInPandas(filter_func, df.schema).show() ^ ``` We should add one more line to match ### Does this PR introduce _any_ user-facing change? Yes, it changes the example of `mapInPandas` copy-pastable ### How was this patch tested? Manually tested. Closes apache#41655 from HyukjinKwon/minor-doc. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…xpression ### What changes were proposed in this pull request? The `hashCode() `of `UserDefinedScalarFunc` and `GeneralScalarExpression` is not good enough. Take for example, `GeneralScalarExpression` uses `Objects.hash(name, children)`, it adopt the hash code of `name` and `children`'s reference and then combine them together as the `GeneralScalarExpression`'s hash code. In fact, we should adopt the hash code for each element in `children`. Because `UserDefinedAggregateFunc` and `GeneralAggregateFunc` missing `hashCode()`, this PR also want add them. This PR also improve the toString for `UserDefinedAggregateFunc` and `GeneralAggregateFunc` by using bool primitive comparison instead `Objects.equals`. Because the performance of bool primitive comparison better than `Objects.equals`. ### Why are the changes needed? Improve the hash code for some DS V2 Expression. ### Does this PR introduce _any_ user-facing change? 'Yes'. ### How was this patch tested? N/A Closes apache#41543 from beliefer/SPARK-44018. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

This reverts commit 2fdc73e.

### What changes were proposed in this pull request? This PR aims to upgrade Scala to 2.13.11 - https://www.scala-lang.org/news/2.13.11 Additionally, this pr adds a new suppression rule for warning message: `Implicit definition should have explicit type`, this is a new compile check introduced by scala/scala#10083, we must fix it when we upgrading to use Scala 3, ### Why are the changes needed? This release improves collections, adds support for JDK 20 and 21, adds support for JDK 17 `sealed`: - scala/scala#10363 - scala/scala#10184 - scala/scala#10397 - scala/scala#10348 - scala/scala#10105 There are 2 known issues in this version: - scala/bug#12800 - scala/bug#12799 For the first one, there is no compilation warning messages related to `match may not be exhaustive` in Spark compile log, and for the second one, there is no case of `method.isAnnotationPresent(Deprecated.class)` in Spark code, there is just https://github.com/apache/spark/blob/8c84d2c9349d7b607db949c2e114df781f23e438/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala#L130 in Spark Code, and I checked `javax.annotation.Nonnull` no this issue. So I think These two issues will not affect Spark itself, but this doesn't mean it won't affect the code written by end users themselves The full release notes as follows: - https://github.com/scala/scala/releases/tag/v2.13.11 ### Does this PR introduce _any_ user-facing change? Yes, this is a Scala version change. ### How was this patch tested? - Existing Test - Checked Java 8/17 + Scala 2.13.11 using GA, all test passed Java 8 + Scala 2.13.11: https://github.com/LuciferYang/spark/runs/14337279564 Java 17 + Scala 2.13.11: https://github.com/LuciferYang/spark/runs/14343012195 Closes apache#41626 from LuciferYang/SPARK-40497. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…ewline in all modules ### What changes were proposed in this pull request? I found that there are many instances same as apache#41655. This PR aims to address all the examples in all components in PySpark. ### Why are the changes needed? See apache#41655. ### Does this PR introduce _any_ user-facing change? Yes, it changes the documentation and makes the example copy-pastable, see also apache#41655. ### How was this patch tested? CI in this PR should validate them. This is logically the same as apache#41655. I will also build the documentation locally and test. Closes apache#41657 from HyukjinKwon/minor-newlines. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com>

… have custom partition locations

github-actions · 2023-09-28T00:17:30Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions bot added the SQL label Jan 16, 2023

itholic and others added 26 commits June 12, 2023 09:41

WeichenXu123 and others added 25 commits June 19, 2023 17:50

Revert "Fix scalastyle failings"

7061c4d

This reverts commit 2fdc73e.

Fix compilation failure

62b33e7

replace listPartitions with listPartitionNames when target table does…

09c62c6

… have custom partition locations

github-actions bot added the MLLIB label Jun 19, 2023

Fix compilation error

4018db4

github-actions bot added the Stale label Sep 28, 2023

github-actions bot closed this Sep 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-38230][SQL] InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions in most cases#39595

[SPARK-38230][SQL] InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions in most cases#39595
czxm wants to merge 299 commits intoapache:masterfrom
czxm:spark-38230

czxm commented Jan 16, 2023 •

edited

Loading

Uh oh!

czxm commented Jan 16, 2023

Uh oh!

AmplabJenkins commented Jan 17, 2023

Uh oh!

roczei commented Jan 27, 2023

Uh oh!

github-actions bot commented Sep 28, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Comments

Conversation

czxm commented Jan 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

czxm commented Jan 16, 2023

Uh oh!

AmplabJenkins commented Jan 17, 2023

Uh oh!

roczei commented Jan 27, 2023

Uh oh!

github-actions bot commented Sep 28, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Comments

czxm commented Jan 16, 2023 •

edited

Loading