Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Breaking change between 1.5.3 and 1.6.0 affecting RASTER functions java.lang.NoSuchMethodError: void org.geotools.coverage.grid.GridGeometry2D #1477

Closed
golfalot opened this issue Jun 13, 2024 · 5 comments

Comments

@golfalot
Copy link
Contributor

golfalot commented Jun 13, 2024

Expected behavior

return result rows/table

Actual behavior

crash with stack trace

java.lang.NoSuchMethodError: 'void org.geotools.coverage.grid.GridGeometry2D.(org.opengis.coverage.grid.GridEnvelope, org.opengis.referencing.datum.PixelInCell, org.opengis.referencing.operation.MathTransform, org.opengis.referencing.crs.CoordinateReferenceSystem, org.geotools.util.factory.Hints)

Steps to reproduce the problem

from sedona.spark import SedonaContext
config = SedonaContext.builder() .\
    config('spark.jars.packages',
           'org.apache.sedona:sedona-spark-shaded-3.4_2.12-1.6.0,'
           'org.datasyslab:geotools-wrapper:1.6.0-28.2'). \
    getOrCreate()
from pyspark.sql import functions as f
df = sedona.read.format("binaryFile").load("/raw/GIS_Raster_Data/samples/test.nc")
df2 = df.withColumn("raster", f.expr("RS_FromNetCDF(content, 'O3')"))
df2.createOrReplaceTempView("raster_table")

# this command throws the error
sedona.sql("SELECT RS_Value(raster, 3, 4, 1) FROM raster_table").show()

Raster sources from: https://github.com/apache/sedona/blob/master/spark/common/src/test/resources/raster/netcdf/test.nc

sedona = SedonaContext.create(config)

Settings

Sedona version = 1.6.0

Apache Spark version = 3.4

API type =Python

Scala version = 2.12.17

Java version = 11

Python version = 3.10

Environment = Azure Synapse Spark Pool

Additional background

We're using Azure Synapse with DEP (data exfiltration protection enabled) which means no outbound internet access, so all packages must be obtained manually before being uploaded as "Workspace packages" which can then enabled on the spark pools.

A configuration that works (no error)

Spark pool

  • Apache Spark version = 3.4
  • Scala version = 2.12.17
  • Java version = 11
  • Python version = 3.10

Java

Python

  • apache_sedona-1.5.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
  • shapely-2.0.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

A configuration that causes the error

Spark pool (identical to above)

  • Apache Spark version = 3.4
  • Scala version = 2.12.17
  • Java version = 11
  • Python version = 3.10

Packages

Java

Python

  • click_plugins-1.1.1-py2.py3-none-any.whl
  • affine-2.4.0-py3-none-any.whl
  • apache_sedona-1.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
  • cligj-0.7.2-py3-none-any.whl
  • rasterio-1.3.10-cp310-cp310-manylinux2014_x86_64.whl
  • shapely-2.0.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
  • snuggs-1.4.7-py3-none-any.whl

stating the obvious: There are many packages listed in the failing scenario. See below the convoluted steps need to establish what packages are required for a baseline Synapse Spark pool.

How to establish Python package dependencies for Synapse Spark pool

Identify Operating System

https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-34-runtime

=> Mariner 2.0

Create a VM and apply baseline configuration

https://github.com/microsoft/azurelinux/blob/2.0/toolkit/docs/quick_start/quickstart.md

Get conda

wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh
sudo bash Miniforge3-Linux-x86_64.sh -b -p /usr/lib/miniforge3
export PATH="/usr/lib/miniforge3/bin:$PATH"

Apply baseline Synapse configuration

sudo tdnf -y install gcc g++
wget https://raw.githubusercontent.com/Azure-Samples/Synapse/main/Spark/Python/Synapse-Python310-CPU.yml
conda env create -n synapse-env -f Synapse-Python310-CPU.yml
source activate synapse-env

Install pip packages and determine which packages are Downloaded above and beyond the baseline packages

requirements.txt

# echo "apache-sedona==1.5.3" > input-user-req.txt
echo "apache-sedona==1.6.0" > input-user-req.txt

install apache-sedona and dependencies

pip install -r input-user-req.txt > pip_output.txt

install apache-sedona and dependencies

cat pip_output.txt | grep Downloading

Use the above output to identify the .whl files to download add to Synapse.

Full stack trace of error

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
Cell In[19], line 1
----> 1 sedona.sql("SELECT RS_Value(raster, ST_Point(507573, 103477)) FROM raster_table").show()

File /opt/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py:899, in DataFrame.show(self, n, truncate, vertical)
    893     raise PySparkTypeError(
    894         error_class="NOT_BOOL",
    895         message_parameters={"arg_name": "vertical", "arg_type": type(vertical).__name__},
    896     )
    898 if isinstance(truncate, bool) and truncate:
--> 899     print(self._jdf.showString(n, 20, vertical))
    900 else:
    901     try:

File ~/cluster-env/clonedenv/lib/python3.10/site-packages/py4j/java_gateway.py:1322, in JavaMember.__call__(self, *args)
   1316 command = proto.CALL_COMMAND_NAME +\
   1317     self.command_header +\
   1318     args_command +\
   1319     proto.END_COMMAND_PART
   1321 answer = self.gateway_client.send_command(command)
-> 1322 return_value = get_return_value(
   1323     answer, self.gateway_client, self.target_id, self.name)
   1325 for temp_arg in temp_args:
   1326     if hasattr(temp_arg, "_detach"):

File /opt/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py:169, in capture_sql_exception.<locals>.deco(*a, **kw)
    167 def deco(*a: Any, **kw: Any) -> Any:
    168     try:
--> 169         return f(*a, **kw)
    170     except Py4JJavaError as e:
    171         converted = convert_exception(e.java_exception)

File ~/cluster-env/clonedenv/lib/python3.10/site-packages/py4j/protocol.py:326, in get_return_value(answer, gateway_client, target_id, name)
    324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
    325 if answer[1] == REFERENCE_TYPE:
--> 326     raise Py4JJavaError(
    327         "An error occurred while calling {0}{1}{2}.\n".
    328         format(target_id, ".", name), value)
    329 else:
    330     raise Py4JError(
    331         "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n".
    332         format(target_id, ".", name, value))

Py4JJavaError: An error occurred while calling o4346.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 (TID 5) (vm-32f63676 executor 1): java.lang.NoSuchMethodError: 'void org.geotools.coverage.grid.GridGeometry2D.<init>(org.opengis.coverage.grid.GridEnvelope, org.opengis.referencing.datum.PixelInCell, org.opengis.referencing.operation.MathTransform, org.opengis.referencing.crs.CoordinateReferenceSystem, org.geotools.util.factory.Hints)'
	at org.apache.sedona.common.raster.RasterConstructors.makeNonEmptyRaster(RasterConstructors.java:375)
	at org.apache.sedona.common.raster.netcdf.NetCdfReader.getRasterHelper(NetCdfReader.java:282)
	at org.apache.sedona.common.raster.netcdf.NetCdfReader.getRaster(NetCdfReader.java:77)
	at org.apache.sedona.common.raster.RasterConstructors.fromNetCDF(RasterConstructors.java:79)
	at org.apache.spark.sql.sedona_sql.expressions.raster.RS_FromNetCDF$$anonfun$$lessinit$greater$17.apply(RasterConstructors.scala:196)
	at org.apache.spark.sql.sedona_sql.expressions.raster.RS_FromNetCDF$$anonfun$$lessinit$greater$17.apply(RasterConstructors.scala:196)
	at org.apache.spark.sql.sedona_sql.expressions.InferrableFunctionConverter$.$anonfun$inferrableFunction2$2(InferrableFunctionConverter.scala:53)
	at org.apache.spark.sql.sedona_sql.expressions.InferredExpression.evalWithoutSerialization(InferredExpression.scala:70)
	at org.apache.spark.sql.sedona_sql.expressions.raster.implicits$RasterInputExpressionEnhancer.toRaster(implicits.scala:32)
	at org.apache.spark.sql.sedona_sql.expressions.InferrableRasterTypes$.rasterExtractor(InferrableRasterTypes.scala:43)
	at org.apache.spark.sql.sedona_sql.expressions.InferredRasterExpression$.$anonfun$rasterExtractor$2(InferredRasterExpression.scala:48)
	at org.apache.spark.sql.sedona_sql.expressions.InferrableFunctionConverter$.$anonfun$inferrableFunction2$2(InferrableFunctionConverter.scala:50)
	at org.apache.spark.sql.sedona_sql.expressions.InferredExpression.eval(InferredExpression.scala:69)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:425)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:895)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:895)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:57)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
	at org.apache.spark.scheduler.Task.run(Task.scala:139)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2799)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2735)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2734)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2734)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1218)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1218)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1218)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2998)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2937)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2926)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:977)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2418)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2439)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2458)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:566)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:519)
	at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:61)
	at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:4203)
	at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:3174)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:4193)
	at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:642)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:4191)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:125)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:214)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:100)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:67)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:4191)
	at org.apache.spark.sql.Dataset.head(Dataset.scala:3174)
	at org.apache.spark.sql.Dataset.take(Dataset.scala:3395)
	at org.apache.spark.sql.Dataset.getRows(Dataset.scala:297)
	at org.apache.spark.sql.Dataset.showString(Dataset.scala:336)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.NoSuchMethodError: 'void org.geotools.coverage.grid.GridGeometry2D.<init>(org.opengis.coverage.grid.GridEnvelope, org.opengis.referencing.datum.PixelInCell, org.opengis.referencing.operation.MathTransform, org.opengis.referencing.crs.CoordinateReferenceSystem, org.geotools.util.factory.Hints)'
	at org.apache.sedona.common.raster.RasterConstructors.makeNonEmptyRaster(RasterConstructors.java:375)
	at org.apache.sedona.common.raster.netcdf.NetCdfReader.getRasterHelper(NetCdfReader.java:282)
	at org.apache.sedona.common.raster.netcdf.NetCdfReader.getRaster(NetCdfReader.java:77)
	at org.apache.sedona.common.raster.RasterConstructors.fromNetCDF(RasterConstructors.java:79)
	at org.apache.spark.sql.sedona_sql.expressions.raster.RS_FromNetCDF$$anonfun$$lessinit$greater$17.apply(RasterConstructors.scala:196)
	at org.apache.spark.sql.sedona_sql.expressions.raster.RS_FromNetCDF$$anonfun$$lessinit$greater$17.apply(RasterConstructors.scala:196)
	at org.apache.spark.sql.sedona_sql.expressions.InferrableFunctionConverter$.$anonfun$inferrableFunction2$2(InferrableFunctionConverter.scala:53)
	at org.apache.spark.sql.sedona_sql.expressions.InferredExpression.evalWithoutSerialization(InferredExpression.scala:70)
	at org.apache.spark.sql.sedona_sql.expressions.raster.implicits$RasterInputExpressionEnhancer.toRaster(implicits.scala:32)
	at org.apache.spark.sql.sedona_sql.expressions.InferrableRasterTypes$.rasterExtractor(InferrableRasterTypes.scala:43)
	at org.apache.spark.sql.sedona_sql.expressions.InferredRasterExpression$.$anonfun$rasterExtractor$2(InferredRasterExpression.scala:48)
	at org.apache.spark.sql.sedona_sql.expressions.InferrableFunctionConverter$.$anonfun$inferrableFunction2$2(InferrableFunctionConverter.scala:50)
	at org.apache.spark.sql.sedona_sql.expressions.InferredExpression.eval(InferredExpression.scala:69)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:425)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:895)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:895)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:57)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
	at org.apache.spark.scheduler.Task.run(Task.scala:139)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	... 1 more
@golfalot
Copy link
Contributor Author

I don't currently have the skills to debug this beyond this issue report. I hope the info provided is sufficiently comprehensive, do not hesitate to request further info from me.

An observation that is possibly material: geotools-wrapper release 1.6.0-28.2 Exclude GeographicLib here https://github.com/jiayuasu/geotools-wrapper/releases/tag/1.6.0-28.2

@golfalot
Copy link
Contributor Author

I also noted another change that could be material

https://docs.geotools.org/stable/userguide/welcome/upgrade.html

GeoTools 30.x
Replace constructor Envelope2D(crs,x,y,width,height) with ReferencedEnvelope.rect(x,y,width,height,crs)

@Kontinuation
Copy link
Member

The problem cannot be reproduced by simply launching pyspark locally using pyspark --packages org.apache.sedona:sedona-spark-shaded-3.4_2.12:1.6.0,org.datasyslab:geotools-wrapper:1.6.0-28.2. The java.lang.NoSuchMethodError exception raised by spark executor is really weird, since this constructor method do exist in org.datasyslab:geotools-wrapper:1.6.0-28.2. BTW, Sedona 1.6.0 is still using GeoTools 28.2, we reverted the GeoTools version upgrade before release.

After trying lots of possible misconfigurations, I finally reproduced this problem locally by putting both geotools-wrapper-1.6.0-28.2.jar and geotools-wrapper-1.6.0-31.0.jar to the classpath. Please make sure that only geotools-wrapper-1.6.0-28.2 is actually being used, and remove geotools-wrapper-1.6.0-31.0.jar from the cluster to avoid any misbehavior.

@golfalot
Copy link
Contributor Author

Yes! thank you @Kontinuation , exactly right. geotools-wrapper-1.6.0-31.0.jar was present on the Synapse Spark Pool, removing it , and leaving only geotools-wrapper-1.6.0-28.2.jar fixes the issue.

  • It was probably naive of me to leave it there and assume simply referencing the desired version in my config statement would suffice to disambiguate. I shall remember this important lesson!

  • Why was 1.6.0-31.0 there in the first place ?
    This is my first time getting up and running with Sedona, once I figured out what jars I needed, I trawled Maven to find the latest versions of the two jars for my target versions of Spark and Scala. https://mvnrepository.com/artifact/org.datasyslab/geotools-wrapper/1.6.0-31.0

  • How did I get the idea to downgrade to 1.6.0-28.2 ?
    When I headed over to https://github.com/jiayuasu/geotools-wrapper I saw that latest release was in fact 1.6.0-28.2 and NOT 1.6.0-31.0

  • I wonder if it would be helpful for other users if version 1.6.0-31.0 were completely deleted from Maven ? As this version did not work for me.

Thank you once again for the great guidance, I appreciate the time you spent on this.

@jiayuasu
Copy link
Member

jiayuasu commented Jun 14, 2024

@golfalot In the first release candidate of Sedona 1.6.0, we wanted to upgrade Sedona to use geotools 31.0 and that's why geotool-wrapper 1.6.0-31.0 exists. However, given the current user base of Sedona, we decided to revert this breaking change in the final release candidate of Sedona 1.6.0 and fall back to geotools-wrapper 1.6.0-28.2.

The exact geotools-wrapper version to use is always mentioned here: https://sedona.apache.org/1.6.0/setup/maven-coordinates/

Unfortunately, Maven Central does not allow anyone to retract a published package so we cannot delete geotools-wrapper 1.6.0-31.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants