Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SEDONA-406] Raster deserializer for PySpark #1281

Merged
merged 3 commits into from
Mar 22, 2024

Conversation

Kontinuation
Copy link
Member

@Kontinuation Kontinuation commented Mar 21, 2024

Did you read the Contributor Guide?

Is this PR related to a JIRA ticket?

What changes were proposed in this PR?

API changes

This PR adds a new class SedonaRaster to sedona python package. Raster objects in sedona will be converted to SedonaRaster objects in python when collecting raster objects in PySpark:

rows = df_rast.collect()
rast = rows[0]['rast']  # rast is a SedonaRaster object

# You can get the metadata of raster by accessing the properties of SedonaRaster objects
print(rast.width, rast.height)
print(rast.affine_trans)
print(rast.crs_wkt)

# You can get the band data as numpy array
arr = rast.as_numpy()

# You can also get a rasterio DatasetReader object
ds = rast.as_rasterio()

# Please close the SedonaRaster after using it to free up resources allocated for the rasterio DatasetReader object
rast.close()

Users can define PandasUDFs taking raster object as parameter. Please use the deserialize function in sedona.raster.raster_serde module to deserialize the bytes to SedonaRaster object before processing it. Please note that this only works with Spark >= 3.4.0.

# A Python Pandas UDF that takes a geometry as input
@pandas_udf(IntegerType())
def pandas_udf_raster_as_param(s: pd.Series) -> pd.Series:
    from sedona.raster import raster_serde

    def func(x):
        with raster_serde.deserialize(x) as raster:
            arr = raster.as_numpy()
            return int(np.sum(arr))

    return s.apply(func)

spark.udf.register("pandas_udf_raster_as_param", pandas_udf_raster_as_param)

Internal changes

  • Changed the serialization format of RasterUDT to a language-neutral format
    • Notably, CRS is now serialized to WKT instead of using the Java serializer.
    • This also significantly improved the performance of raster serialization/deserialization, since the new Kryo serailizer is way faster than Java serializer we used before.
  • Added a raster deserializer for PySpark

How was this patch tested?

Added new tests

Did this PR include necessary documentation updates?

  • Yes. I've updated the documentation.

@Kontinuation
Copy link
Member Author

Kontinuation commented Mar 21, 2024

The R test failure is unrelated to this PR. The recent updates of sparklyr or dbplyr caused this problem. See sparklyr/sparklyr#3429.

@Kontinuation Kontinuation force-pushed the oss-pyspark-raster-deserializer branch from f903332 to 5aea3b1 Compare March 21, 2024 07:36
@Kontinuation Kontinuation changed the title [SEDONA-406] Raster deserializer for PySpark (#116) [SEDONA-406] Raster deserializer for PySpark Mar 21, 2024
@Kontinuation Kontinuation marked this pull request as ready for review March 21, 2024 09:11
2. Compile the Sedona Scala and Java code with `-Dgeotools` and then copy the ==sedona-spark-shaded-{{ sedona.current_version }}.jar== to ==SPARK_HOME/jars/== folder.
2. Put JAI jars to ==SPARK_HOME/jars/== folder.
```
export JAI_CORE_VERSION="1.1.3"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we put these jars in geotools-wrapper?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These jars are already in geotools-wrapper, so we can instead put the geotools-wrapper jar to the $SPARK_HOME/jars/ folder, and build spark-shaded jar without -Dgeotools. However, this won't be able to test dependency changes such as adding jiffle as a new dependency.

I can update the document to use geotools-wrapper instead of directly using JAI jars, since it is much easier (no need to rebuild with -Dgeotools for testing sedona python), and covers most cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine then. We don't need to update this.

@@ -583,6 +583,44 @@ SELECT RS_AsPNG(raster)

Please refer to [Raster writer docs](../../api/sql/Raster-writer) for more details.

## Collecting raster Dataframes and working with them locally in Python
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add one more section to explain how to write a regular Python User Defined Function (not Pandas UDF) to work on the raster type? I understand that the UDF cannot return a raster type directly since we only have a Python deserializer, but with the RS_MakeRaster() + NumPy array, we can still construct the raster type. It is important to show this workflow. Maybe we can show this in a separate Doc PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New section added.

@Kontinuation Kontinuation force-pushed the oss-pyspark-raster-deserializer branch from 5aea3b1 to b7c4881 Compare March 22, 2024 00:11
@jiayuasu jiayuasu merged commit 63a1de0 into apache:master Mar 22, 2024
37 of 49 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants