Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SEDONA-406] Add a dummy implementation of UDT RasterType to get rid of errors when representing pyspark dataframes containing raster fields. #1043

Merged

Conversation

Kontinuation
Copy link
Member

@Kontinuation Kontinuation commented Oct 3, 2023

Did you read the Contributor Guide?

Is this PR related to a JIRA ticket?

What changes were proposed in this PR?

RasterUDT does not have a corresponding PySpark implementation, which will lead to errors when printing pyspark dataframes. For example, if we run df to print the pyspark dataframe in jupyter notebook, it will raise an exception:

File /opt/spark/python/pyspark/sql/dataframe.py:2022, in DataFrame.dtypes(self)
   2001 @property
   2002 def dtypes(self) -> List[Tuple[str, str]]:
   2003     """Returns all column names and their data types as a list.
   2004 
   2005     .. versionadded:: 1.3.0
   (...)
   2020     [('age', 'bigint'), ('name', 'string')]
   2021     """
-> 2022     return [(str(f.name), f.dataType.simpleString()) for f in self.schema.fields]

File /opt/spark/python/pyspark/sql/dataframe.py:573, in DataFrame.schema(self)
    569         self._schema = cast(
    570             StructType, _parse_datatype_json_string(self._jdf.schema().json())
    571         )
    572     except Exception as e:
--> 573         raise ValueError("Unable to parse datatype from schema. %s" % e) from e
    574 return self._schema

ValueError: Unable to parse datatype from schema. No module named 'Non'

This is because RasterUDT does not have a PySpark implementation. This patch adds a dummy PySpark implementation for RasterUDT to suppress this error.

It is hard to implement a functioning RasterType in python because some components of the rasters were serialized using Java serializer. We'll switch to customized kryo serializer for easier implementation of RasterType.

How was this patch tested?

After applying this patch, evaluating df in jupyter notebook yields this output:

DataFrame[name: string, length: bigint, rast: udt]

We've also added a unit test for it.

Did this PR include necessary documentation updates?

  • No, this PR does not affect any public API so no need to change the docs.

…n representing pyspark dataframes containing raster fields.
@Kontinuation Kontinuation changed the title [Add a dummy implementation of UDT RasterType to get rid of errors whe… [Add a dummy implementation of UDT RasterType to get rid of errors when representing pyspark dataframes containing raster fields. Oct 3, 2023
@Kontinuation Kontinuation changed the title [Add a dummy implementation of UDT RasterType to get rid of errors when representing pyspark dataframes containing raster fields. [SEDONA-406] Add a dummy implementation of UDT RasterType to get rid of errors when representing pyspark dataframes containing raster fields. Oct 3, 2023
@jiayuasu jiayuasu added this to the sedona-1.5.0 milestone Oct 3, 2023
@jiayuasu jiayuasu merged commit 68e334f into apache:master Oct 3, 2023
40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants