Unable to use sedona.global.charset in ShapefileReader #1345

adamaps · 2024-04-19T18:01:45Z

Expected behavior

ShapefileReader.readToGeometryRDD(sedona_context, shp_file) should use the sedona.global.charset configuration property set in the spark session when reading shapefiles containing non-ASCII characters.

E.g. A shapefile containing an attribute value "Ariñiz/Aríñez" should appear in a dataframe as "Ariñiz/Aríñez".

Actual behavior

ShapefileReader.readToGeometryRDD(sedona_context, shp_file) is not using the charset configuration property set in the spark context.

E.g. A shapefile containing an attribute value "Ariñiz/Aríñez" appears in a dataframe as "AriÃ±iz/ArÃÃ±ez" instead.

Steps to reproduce the problem

from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
from sedona.core.formatMapper.shapefileParser import ShapefileReader
from sedona.spark import SedonaContext
from sedona.utils.adapter import Adapter

conf = SparkConf()
conf.set("sedona.global.charset", "utf8")
spark = SparkSession.builder.config(conf=conf).getOrCreate()

sedona = SedonaContext.create(spark)
sedona_context = sedona.sparkContext

shp_file = '[aws s3 path to shapefile]'
shp_rdd = ShapefileReader.readToGeometryRDD(sedona_context, shp_file)
shp_df = Adapter.toDf(shp_rdd, sedona)

I can confirm that ("sedona.global.charset", "utf8") appears in the configuration settings by using:

print(sedona_context.getConf().getAll())

I also tried setting the charset property after creating the sedona context as follows (although this appears to be an older solution):

sedona_context.setSystemProperty("sedona.global.charset", "utf8")

Please confirm how to set this configuration property correctly.

Settings

Sedona version = 1.5.1

Apache Spark version = 3.3.0

API type = Python

Python version = 3.10

Environment = AWS Glue 4.0 using sedona-spark-shaded-3.0_2.12-1.5.1.jar and geotools-wrapper-1.5.1-28.2.jar

The text was updated successfully, but these errors were encountered:

jiayuasu · 2024-04-19T18:26:40Z

@adamaps If you are running Sedona in a cluster mode, this needs to be set via spark.executorEnv.[EnvironmentVariableName]: https://spark.apache.org/docs/latest/configuration.html

In your case, you might want to try this:

spark.executorEnv.sedona.global.charset utf8

spark.executorEnv is a runtime config that can be set after your SparkSession or SedonaContext has been initiated.

spark.conf.set("spark.executorEnv.sedona.global.charset","utf8")
sedona.conf.set("spark.executorEnv.sedona.global.charset","utf8")

adamaps · 2024-04-22T22:56:16Z

Thank you for the quick response, @jiayuasu !

I tested the following in client mode before creating the Sedona SparkSession/SparkContext (via a local Docker container):

conf = SparkConf()
conf.set("sedona.global.charset", "utf8")  # I have other conf settings not shown here
spark = SparkSession.builder.config(conf=conf).getOrCreate()
sedona = SedonaContext.create(spark)

And I tested both of the following in cluster mode after creating the Sedona SparkSession/SparkContext (via AWS Glue):

spark.conf.set("spark.executorEnv.sedona.global.charset","utf8")
sedona.conf.set("spark.executorEnv.sedona.global.charset","utf8")

Unfortunately I still see the same issue in both cases.
Are you able to replicate (or reject) the issue using the attached shapefile sample?

shapefile_sample.zip

Kontinuation · 2024-04-26T15:49:11Z

sedona.global.charset has to be set as a Java system property. You can try setting the following spark configurations:

spark.driver.extraJavaOptions  -Dsedona.global.charset=utf8
spark.executor.extraJavaOptions  -Dsedona.global.charset=utf8

The dataframe loaded from the sample shapefile:

+--------------------+--------------------+--------------------+--------------------+
|            geometry|                  ID|                Name|          Name_ASCII|
+--------------------+--------------------+--------------------+--------------------+
|MULTIPOLYGON (((-...|01015               |Ariñiz/Aríñez    ...|Ariniz/Arinez    ...|
+--------------------+--------------------+--------------------+--------------------+

adamaps · 2024-04-30T16:34:15Z

Thank you, @Kontinuation! 🎉

I can confirm that setting the following configuration parameter in PySpark worked for my local setup. And thanks @jiayuasu for updating the docs.

conf.set("spark.driver.extraJavaOptions", "-Dsedona.global.charset=utf8")

Running on AWS/Glue is still causing issues, but this seems specific to our setup.

jiayuasu linked a pull request Apr 28, 2024 that will close this issue

[DOCS] Fix docs of SedonaSnow, Affine transformation, SedonaUtils, Shapefile charset #1377

Merged

jiayuasu closed this as completed in #1377 Apr 28, 2024

Kontinuation mentioned this issue Jun 18, 2024

There was garbled code when reading Chinese from shp file #1480

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to use sedona.global.charset in ShapefileReader #1345

Unable to use sedona.global.charset in ShapefileReader #1345

adamaps commented Apr 19, 2024

jiayuasu commented Apr 19, 2024 •

edited

Loading

adamaps commented Apr 22, 2024

Kontinuation commented Apr 26, 2024

adamaps commented Apr 30, 2024

Unable to use sedona.global.charset in ShapefileReader #1345

Unable to use sedona.global.charset in ShapefileReader #1345

Comments

adamaps commented Apr 19, 2024

Expected behavior

Actual behavior

Steps to reproduce the problem

Settings

jiayuasu commented Apr 19, 2024 • edited Loading

adamaps commented Apr 22, 2024

Kontinuation commented Apr 26, 2024

adamaps commented Apr 30, 2024

jiayuasu commented Apr 19, 2024 •

edited

Loading