Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to use sedona.global.charset in ShapefileReader #1345

Closed
adamaps opened this issue Apr 19, 2024 · 4 comments · Fixed by #1377
Closed

Unable to use sedona.global.charset in ShapefileReader #1345

adamaps opened this issue Apr 19, 2024 · 4 comments · Fixed by #1377

Comments

@adamaps
Copy link

adamaps commented Apr 19, 2024

Expected behavior

ShapefileReader.readToGeometryRDD(sedona_context, shp_file) should use the sedona.global.charset configuration property set in the spark session when reading shapefiles containing non-ASCII characters.

E.g. A shapefile containing an attribute value "Ariñiz/Aríñez" should appear in a dataframe as "Ariñiz/Aríñez".

Actual behavior

ShapefileReader.readToGeometryRDD(sedona_context, shp_file) is not using the charset configuration property set in the spark context.

E.g. A shapefile containing an attribute value "Ariñiz/Aríñez" appears in a dataframe as "Ariñiz/Aríñez" instead.

Steps to reproduce the problem

from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
from sedona.core.formatMapper.shapefileParser import ShapefileReader
from sedona.spark import SedonaContext
from sedona.utils.adapter import Adapter

conf = SparkConf()
conf.set("sedona.global.charset", "utf8")
spark = SparkSession.builder.config(conf=conf).getOrCreate()

sedona = SedonaContext.create(spark)
sedona_context = sedona.sparkContext

shp_file = '[aws s3 path to shapefile]'
shp_rdd = ShapefileReader.readToGeometryRDD(sedona_context, shp_file)
shp_df = Adapter.toDf(shp_rdd, sedona)

I can confirm that ("sedona.global.charset", "utf8") appears in the configuration settings by using:

print(sedona_context.getConf().getAll())

I also tried setting the charset property after creating the sedona context as follows (although this appears to be an older solution):

sedona_context.setSystemProperty("sedona.global.charset", "utf8")

Please confirm how to set this configuration property correctly.

Settings

Sedona version = 1.5.1

Apache Spark version = 3.3.0

API type = Python

Python version = 3.10

Environment = AWS Glue 4.0 using sedona-spark-shaded-3.0_2.12-1.5.1.jar and geotools-wrapper-1.5.1-28.2.jar

@jiayuasu
Copy link
Member

jiayuasu commented Apr 19, 2024

@adamaps If you are running Sedona in a cluster mode, this needs to be set via spark.executorEnv.[EnvironmentVariableName]: https://spark.apache.org/docs/latest/configuration.html

In your case, you might want to try this:

spark.executorEnv.sedona.global.charset utf8

spark.executorEnv is a runtime config that can be set after your SparkSession or SedonaContext has been initiated.

spark.conf.set("spark.executorEnv.sedona.global.charset","utf8")
sedona.conf.set("spark.executorEnv.sedona.global.charset","utf8")

@adamaps
Copy link
Author

adamaps commented Apr 22, 2024

Thank you for the quick response, @jiayuasu !

I tested the following in client mode before creating the Sedona SparkSession/SparkContext (via a local Docker container):

conf = SparkConf()
conf.set("sedona.global.charset", "utf8")  # I have other conf settings not shown here
spark = SparkSession.builder.config(conf=conf).getOrCreate()
sedona = SedonaContext.create(spark)

And I tested both of the following in cluster mode after creating the Sedona SparkSession/SparkContext (via AWS Glue):

spark.conf.set("spark.executorEnv.sedona.global.charset","utf8")
sedona.conf.set("spark.executorEnv.sedona.global.charset","utf8")

Unfortunately I still see the same issue in both cases.
Are you able to replicate (or reject) the issue using the attached shapefile sample?

shapefile_sample.zip

@Kontinuation
Copy link
Member

sedona.global.charset has to be set as a Java system property. You can try setting the following spark configurations:

spark.driver.extraJavaOptions  -Dsedona.global.charset=utf8
spark.executor.extraJavaOptions  -Dsedona.global.charset=utf8

The dataframe loaded from the sample shapefile:

+--------------------+--------------------+--------------------+--------------------+
|            geometry|                  ID|                Name|          Name_ASCII|
+--------------------+--------------------+--------------------+--------------------+
|MULTIPOLYGON (((-...|01015               |Ariñiz/Aríñez    ...|Ariniz/Arinez    ...|
+--------------------+--------------------+--------------------+--------------------+

@adamaps
Copy link
Author

adamaps commented Apr 30, 2024

Thank you, @Kontinuation! 🎉

I can confirm that setting the following configuration parameter in PySpark worked for my local setup. And thanks @jiayuasu for updating the docs.

conf.set("spark.driver.extraJavaOptions", "-Dsedona.global.charset=utf8")

Running on AWS/Glue is still causing issues, but this seems specific to our setup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants