[SEDONA-406] Add a dummy implementation of UDT RasterType to get rid of errors when representing pyspark dataframes containing raster fields. #1043

Kontinuation · 2023-10-03T03:07:25Z

Did you read the Contributor Guide?

Yes, I have read Contributor Rules and Contributor Development Guide

Is this PR related to a JIRA ticket?

Yes. https://issues.apache.org/jira/browse/SEDONA-406

What changes were proposed in this PR?

RasterUDT does not have a corresponding PySpark implementation, which will lead to errors when printing pyspark dataframes. For example, if we run df to print the pyspark dataframe in jupyter notebook, it will raise an exception:

File /opt/spark/python/pyspark/sql/dataframe.py:2022, in DataFrame.dtypes(self)
   2001 @property
   2002 def dtypes(self) -> List[Tuple[str, str]]:
   2003     """Returns all column names and their data types as a list.
   2004 
   2005     .. versionadded:: 1.3.0
   (...)
   2020     [('age', 'bigint'), ('name', 'string')]
   2021     """
-> 2022     return [(str(f.name), f.dataType.simpleString()) for f in self.schema.fields]

File /opt/spark/python/pyspark/sql/dataframe.py:573, in DataFrame.schema(self)
    569         self._schema = cast(
    570             StructType, _parse_datatype_json_string(self._jdf.schema().json())
    571         )
    572     except Exception as e:
--> 573         raise ValueError("Unable to parse datatype from schema. %s" % e) from e
    574 return self._schema

ValueError: Unable to parse datatype from schema. No module named 'Non'

This is because RasterUDT does not have a PySpark implementation. This patch adds a dummy PySpark implementation for RasterUDT to suppress this error.

It is hard to implement a functioning RasterType in python because some components of the rasters were serialized using Java serializer. We'll switch to customized kryo serializer for easier implementation of RasterType.

How was this patch tested?

After applying this patch, evaluating df in jupyter notebook yields this output:

DataFrame[name: string, length: bigint, rast: udt]

We've also added a unit test for it.

Did this PR include necessary documentation updates?

No, this PR does not affect any public API so no need to change the docs.

…n representing pyspark dataframes containing raster fields.

Add a dummy implementation of UDT RasterType to get rid of errors whe…

d4ace5f

…n representing pyspark dataframes containing raster fields.

Kontinuation changed the title ~~[Add a dummy implementation of UDT RasterType to get rid of errors whe…~~ [Add a dummy implementation of UDT RasterType to get rid of errors when representing pyspark dataframes containing raster fields. Oct 3, 2023

jiayuasu added bug sedona-spark labels Oct 3, 2023

jiayuasu added this to the sedona-1.5.0 milestone Oct 3, 2023

jiayuasu approved these changes Oct 3, 2023

View reviewed changes

jiayuasu merged commit 68e334f into apache:master Oct 3, 2023
40 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SEDONA-406] Add a dummy implementation of UDT RasterType to get rid of errors when representing pyspark dataframes containing raster fields. #1043

[SEDONA-406] Add a dummy implementation of UDT RasterType to get rid of errors when representing pyspark dataframes containing raster fields. #1043

Kontinuation commented Oct 3, 2023 •

edited

[SEDONA-406] Add a dummy implementation of UDT RasterType to get rid of errors when representing pyspark dataframes containing raster fields. #1043

[SEDONA-406] Add a dummy implementation of UDT RasterType to get rid of errors when representing pyspark dataframes containing raster fields. #1043

Conversation

Kontinuation commented Oct 3, 2023 • edited

Did you read the Contributor Guide?

Is this PR related to a JIRA ticket?

What changes were proposed in this PR?

How was this patch tested?

Did this PR include necessary documentation updates?

Kontinuation commented Oct 3, 2023 •

edited