[SEDONA-406] Raster deserializer for PySpark #1281

Kontinuation · 2024-03-21T04:20:19Z

Did you read the Contributor Guide?

Yes, I have read Contributor Rules and Contributor Development Guide

Is this PR related to a JIRA ticket?

Yes, the URL of the associated JIRA ticket is https://issues.apache.org/jira/browse/SEDONA-406. The PR name follows the format [SEDONA-XXX] my subject.

What changes were proposed in this PR?

API changes

This PR adds a new class SedonaRaster to sedona python package. Raster objects in sedona will be converted to SedonaRaster objects in python when collecting raster objects in PySpark:

rows = df_rast.collect()
rast = rows[0]['rast']  # rast is a SedonaRaster object

# You can get the metadata of raster by accessing the properties of SedonaRaster objects
print(rast.width, rast.height)
print(rast.affine_trans)
print(rast.crs_wkt)

# You can get the band data as numpy array
arr = rast.as_numpy()

# You can also get a rasterio DatasetReader object
ds = rast.as_rasterio()

# Please close the SedonaRaster after using it to free up resources allocated for the rasterio DatasetReader object
rast.close()

Users can define PandasUDFs taking raster object as parameter. Please use the deserialize function in sedona.raster.raster_serde module to deserialize the bytes to SedonaRaster object before processing it. Please note that this only works with Spark >= 3.4.0.

# A Python Pandas UDF that takes a geometry as input
@pandas_udf(IntegerType())
def pandas_udf_raster_as_param(s: pd.Series) -> pd.Series:
    from sedona.raster import raster_serde

    def func(x):
        with raster_serde.deserialize(x) as raster:
            arr = raster.as_numpy()
            return int(np.sum(arr))

    return s.apply(func)

spark.udf.register("pandas_udf_raster_as_param", pandas_udf_raster_as_param)

Internal changes

Changed the serialization format of RasterUDT to a language-neutral format
- Notably, CRS is now serialized to WKT instead of using the Java serializer.
- This also significantly improved the performance of raster serialization/deserialization, since the new Kryo serailizer is way faster than Java serializer we used before.
Added a raster deserializer for PySpark

How was this patch tested?

Added new tests

Did this PR include necessary documentation updates?

Yes. I've updated the documentation.

Kontinuation · 2024-03-21T07:04:42Z

The R test failure is unrelated to this PR. The recent updates of sparklyr or dbplyr caused this problem. See sparklyr/sparklyr#3429.

jiayuasu · 2024-03-21T18:14:02Z

docs/setup/compile.md

-2. Compile the Sedona Scala and Java code with `-Dgeotools` and then copy the ==sedona-spark-shaded-{{ sedona.current_version }}.jar== to ==SPARK_HOME/jars/== folder.
+2. Put JAI jars to ==SPARK_HOME/jars/== folder.
+```
+export JAI_CORE_VERSION="1.1.3"


Should we put these jars in geotools-wrapper?

These jars are already in geotools-wrapper, so we can instead put the geotools-wrapper jar to the $SPARK_HOME/jars/ folder, and build spark-shaded jar without -Dgeotools. However, this won't be able to test dependency changes such as adding jiffle as a new dependency.

I can update the document to use geotools-wrapper instead of directly using JAI jars, since it is much easier (no need to rebuild with -Dgeotools for testing sedona python), and covers most cases.

This is fine then. We don't need to update this.

jiayuasu · 2024-03-21T18:22:55Z

docs/tutorial/raster.md

@@ -583,6 +583,44 @@ SELECT RS_AsPNG(raster)

 Please refer to [Raster writer docs](../../api/sql/Raster-writer) for more details.

+## Collecting raster Dataframes and working with them locally in Python


Can you add one more section to explain how to write a regular Python User Defined Function (not Pandas UDF) to work on the raster type? I understand that the UDF cannot return a raster type directly since we only have a Python deserializer, but with the RS_MakeRaster() + NumPy array, we can still construct the raster type. It is important to show this workflow. Maybe we can show this in a separate Doc PR?

New section added.

Kontinuation force-pushed the oss-pyspark-raster-deserializer branch from f903332 to 5aea3b1 Compare March 21, 2024 07:36

Kontinuation changed the title ~~[SEDONA-406] Raster deserializer for PySpark (#116)~~ [SEDONA-406] Raster deserializer for PySpark Mar 21, 2024

Kontinuation marked this pull request as ready for review March 21, 2024 09:11

jiayuasu requested changes Mar 21, 2024

View reviewed changes

jiayuasu added attention needed affect public APIs sedona-python sedona-common sedona-spark labels Mar 21, 2024

jiayuasu added this to the sedona-1.6.0 milestone Mar 21, 2024

jiayuasu added the improvement label Mar 21, 2024

Kontinuation added 3 commits March 22, 2024 07:30

[SEDONA-406] Raster deserializer for PySpark (apache#116)

99cc507

Update documentation

8f452d2

Add documentation for writing Python UDF to work with raster data

b7c4881

Kontinuation force-pushed the oss-pyspark-raster-deserializer branch from 5aea3b1 to b7c4881 Compare March 22, 2024 00:11

jiayuasu approved these changes Mar 22, 2024

View reviewed changes

jiayuasu merged commit 63a1de0 into apache:master Mar 22, 2024
37 of 49 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SEDONA-406] Raster deserializer for PySpark #1281

[SEDONA-406] Raster deserializer for PySpark #1281

Kontinuation commented Mar 21, 2024 •

edited

Kontinuation commented Mar 21, 2024 •

edited

jiayuasu Mar 21, 2024

Kontinuation Mar 22, 2024 •

edited

jiayuasu Mar 22, 2024

jiayuasu Mar 21, 2024

Kontinuation Mar 22, 2024

		@@ -583,6 +583,44 @@ SELECT RS_AsPNG(raster)

		Please refer to [Raster writer docs](../../api/sql/Raster-writer) for more details.

		## Collecting raster Dataframes and working with them locally in Python

[SEDONA-406] Raster deserializer for PySpark #1281

[SEDONA-406] Raster deserializer for PySpark #1281

Conversation

Kontinuation commented Mar 21, 2024 • edited

Did you read the Contributor Guide?

Is this PR related to a JIRA ticket?

What changes were proposed in this PR?

API changes

Internal changes

How was this patch tested?

Did this PR include necessary documentation updates?

Kontinuation commented Mar 21, 2024 • edited

jiayuasu Mar 21, 2024

Choose a reason for hiding this comment

Kontinuation Mar 22, 2024 • edited

Choose a reason for hiding this comment

jiayuasu Mar 22, 2024

Choose a reason for hiding this comment

jiayuasu Mar 21, 2024

Choose a reason for hiding this comment

Kontinuation Mar 22, 2024

Choose a reason for hiding this comment

Kontinuation commented Mar 21, 2024 •

edited

Kontinuation commented Mar 21, 2024 •

edited

Kontinuation Mar 22, 2024 •

edited