Skip to content

Commit 4f9c8b9

Browse files
committed
[Spark] Upgrade Spark dependency to 3.5.0 in Delta-Spark
The following are the changes needed: * PySpark 3.5 has deprecated the support for Python 3.7. This required changes to Delta test infra to install the appropriate Python version and other packages. The `Dockerfile` used for running tests is also updated to have required Python version and packages and uses the same base image as PySpark test infra in Apache Spark. * `StructType.toAttributes` and `StructType.fromAttributes` methods are moved into a utility class `DataTypeUtils`. * The `iceberg` module is disabled as there is no released version of `iceberg` that works with Spark 3.5 yet * Remove the URI path hack used in `DMLWithDeletionVectorsHelper` to get around a bug in Spark 3.4. * Remove unrelated tutorial in `delta/examples/tutorials/saiseu19` * Test failure fixes * `org.apache.spark.sql.delta.DeltaHistoryManagerSuite` - Error message has changed * `org.apache.spark.sql.delta.DeltaOptionSuite` - Parquet file name using the LZ4 code has changed due to a apache/parquet-java#1000 in `parquet-mr` dependency. * `org.apache.spark.sql.delta.deletionvectors.DeletionVectorsSuite` - Parquet now generates `row-index` whenever `_metadata` column is selected, however Spark 3.5 has a bug where a row group containing more than 2bn rows fails. For now don't return any `row-index` column in `_metadata` by overriding the `metadataSchemaFields: Seq[StructField]` in `DeltaParquetFileFormat`. * `org.apache.spark.sql.delta.perf.OptimizeMetadataOnlyDeltaQuerySuite`: A behavior change by apache/spark#40922. In Spark plans a new function called `ToPrettyString` is used instead of `cast(aggExpr To STRING)` in when `Dataset.show()` usage. * `org.apache.spark.sql.delta.DeltaCDCStreamDeletionVectorSuite` and `org.apache.spark.sql.delta.DeltaCDCStreamSuite`: Regression in Spark 3.5 RC fixed by apache/spark#42774 before the Spark 3.5 release Closes #1986 GitOrigin-RevId: b0e4a81b608a857e45ecba71b070309347616a30
1 parent bcd1867 commit 4f9c8b9

File tree

58 files changed

+303
-1846
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

58 files changed

+303
-1846
lines changed

.github/workflows/spark_test.yaml

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -46,17 +46,21 @@ jobs:
4646
export PATH="~/.pyenv/bin:$PATH"
4747
eval "$(pyenv init -)"
4848
eval "$(pyenv virtualenv-init -)"
49-
pyenv install 3.7.4
50-
pyenv global system 3.7.4
51-
pipenv --python 3.7 install
52-
pipenv run pip install pyspark==3.4.0
49+
pyenv install 3.8.18
50+
pyenv global system 3.8.18
51+
pipenv --python 3.8 install
52+
pipenv run pip install pyspark==3.5.0
5353
pipenv run pip install flake8==3.5.0 pypandoc==1.3.3
5454
pipenv run pip install importlib_metadata==3.10.0
55-
pipenv run pip install mypy==0.910
55+
pipenv run pip install mypy==0.982
5656
pipenv run pip install cryptography==37.0.4
5757
pipenv run pip install twine==4.0.1
5858
pipenv run pip install wheel==0.33.4
5959
pipenv run pip install setuptools==41.0.1
60+
pipenv run pip install pydocstyle==3.0.0
61+
pipenv run pip install pandas==1.0.5
62+
pipenv run pip install pyarrow==8.0.0
63+
pipenv run pip install numpy==1.20.3
6064
if: steps.git-diff.outputs.diff
6165
- name: Run Scala/Java and Python tests
6266
run: |

Dockerfile

Lines changed: 21 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -13,19 +13,34 @@
1313
# See the License for the specific language governing permissions and
1414
# limitations under the License.
1515
#
16+
FROM ubuntu:focal-20221019
1617

17-
# Debian buster LTS is until June 30th 2024. [TODO] Upgrade to a newer version before then.
18-
FROM openjdk:8-jdk-buster
18+
ENV DEBIAN_FRONTEND noninteractive
19+
ENV DEBCONF_NONINTERACTIVE_SEEN true
20+
21+
RUN apt-get update
22+
RUN apt-get install -y software-properties-common
23+
RUN apt-get install -y curl
24+
RUN apt-get install -y wget
25+
RUN apt-get install -y openjdk-8-jdk
26+
RUN apt-get install -y python3.8
27+
RUN apt-get install -y python3-pip
1928

20-
# Install pip3
21-
RUN apt-get update && apt-get install -y python3-pip
2229
# Upgrade pip. This is needed to use prebuilt wheels for packages cffi (dep of cryptography) and
2330
# cryptography. Otherwise, building wheels for these packages fails.
2431
RUN pip3 install --upgrade pip
2532

26-
RUN pip3 install pyspark==3.4.0
33+
RUN pip3 install pyspark==3.5.0
34+
35+
RUN pip3 install mypy==0.982
36+
37+
RUN pip3 install pydocstyle==3.0.0
38+
39+
RUN pip3 install pandas==1.0.5
40+
41+
RUN pip3 install pyarrow==8.0.0
2742

28-
RUN pip3 install mypy==0.910
43+
RUN pip3 install numpy==1.20.3
2944

3045
RUN pip3 install importlib_metadata==3.10.0
3146

build.sbt

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ val default_scala_version = settingKey[String]("Default Scala version")
3535
Global / default_scala_version := scala212
3636

3737
// Dependent library versions
38-
val sparkVersion = "3.4.0"
38+
val sparkVersion = "3.5.0"
3939
val flinkVersion = "1.16.1"
4040
val hadoopVersion = "3.3.1"
4141
val scalaTestVersion = "3.2.15"
@@ -115,6 +115,8 @@ lazy val spark = (project in file("spark"))
115115
"org.apache.spark" %% "spark-sql" % sparkVersion % "test" classifier "tests",
116116
"org.apache.spark" %% "spark-hive" % sparkVersion % "test" classifier "tests",
117117
),
118+
// For adding staged Spark RC versions, Ex:
119+
// resolvers += "Apche Spark 3.5.0 (RC1) Staging" at "https://repository.apache.org/content/repositories/orgapachespark-1444/",
118120
Compile / packageBin / mappings := (Compile / packageBin / mappings).value ++
119121
listPythonFiles(baseDirectory.value.getParentFile / "python"),
120122

@@ -322,6 +324,8 @@ val icebergSparkRuntimeArtifactName = {
322324
s"iceberg-spark-runtime-$expMaj.$expMin"
323325
}
324326

327+
/**
328+
* Need a Icebeg release version with support for Spark 3.5
325329
lazy val testDeltaIcebergJar = (project in file("testDeltaIcebergJar"))
326330
// delta-iceberg depends on delta-spark! So, we need to include it during our test.
327331
.dependsOn(spark % "test")
@@ -451,7 +455,7 @@ lazy val icebergShaded = (project in file("icebergShaded"))
451455
assemblyPackageScala / assembleArtifact := false,
452456
// Make the 'compile' invoke the 'assembly' task to generate the uber jar.
453457
)
454-
458+
*/
455459
lazy val hive = (project in file("connectors/hive"))
456460
.dependsOn(standaloneCosmetic)
457461
.settings (
@@ -1081,7 +1085,7 @@ val createTargetClassesDir = taskKey[Unit]("create target classes dir")
10811085

10821086
// Don't use these groups for any other projects
10831087
lazy val sparkGroup = project
1084-
.aggregate(spark, contribs, storage, storageS3DynamoDB, iceberg)
1088+
.aggregate(spark, contribs, storage, storageS3DynamoDB)
10851089
.settings(
10861090
// crossScalaVersions must be set to Nil on the aggregating project
10871091
crossScalaVersions := Nil,

dev/tox.ini

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,4 +19,4 @@ ignore=E226,E241,E305,E402,E722,E731,E741,W503,W504
1919
max-line-length=100
2020
exclude=cloudpickle.py,heapq3.py,shared.py,python/docs/conf.py,work/*/*.py,python/.eggs/*,dist/*
2121
[pydocstyle]
22-
ignore=D100,D101,D102,D103,D104,D105,D106,D107,D200,D201,D202,D203,D204,D205,D206,D207,D208,D209,D210,D211,D212,D213,D214,D215,D300,D301,D302,D400,D401,D402,D403,D404,D405,D406,D407,D408,D409,D410,D411,D412,D413,D414
22+
ignore=D100,D101,D102,D103,D104,D105,D106,D107,D200,D201,D202,D203,D204,D205,D206,D207,D208,D209,D210,D211,D212,D213,D214,D215,D300,D301,D302,D400,D401,D402,D403,D404,D405,D406,D407,D408,D409,D410,D411,D412,D413,D414,D415,D417

docs/generate_api_docs.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -142,7 +142,7 @@ def run_cmd(cmd, throw_on_error=True, env=None, stream_output=False, **kwargs):
142142
stderr = stderr.decode("UTF-8")
143143

144144
exit_code = child.wait()
145-
if throw_on_error and exit_code is not 0:
145+
if throw_on_error and exit_code != 0:
146146
raise Exception(
147147
"Non-zero exitcode: %s\n\nSTDOUT:\n%s\n\nSTDERR:%s" %
148148
(exit_code, stdout, stderr))

0 commit comments

Comments
 (0)