Need help with pyspark #408

ananbas · 2019-12-26T09:28:30Z

Expected behavior

GeoSparkRegistrator.registerAll(spark) return 0

Actual behavior

Got error

Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, bddevdn0001, executor 4): java.lang.IllegalStateException: unread block data

Steps to reproduce the problem

this is my code:

import findspark
findspark.init('/opt/cloudera/parcels/CDH/lib/spark')

import pyspark
import geo_pyspark
from pyspark.sql import SparkSession
from geo_pyspark.register import GeoSparkRegistrator

spark = SparkSession \
.builder \
.config("spark.driver.memory", "2g") \
.config("spark.jars", "local:/home/anung/geospark/geo_wrapper.jar,local:/home/anung/geospark/geospark-1.2.0.jar,local:/home/anung/geospark/geospark-sql_2.3-1.2.0.jar") \
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
.config("spark.kryo.registrator","org.datasyslab.geospark.serde.GeoSparkKryoRegistrator") \
.appName("anung-geospark-test") \
.enableHiveSupport() \
.getOrCreate()

GeoSparkRegistrator.registerAll(spark)

Settings

GeoSpark version = 1.2

Apache Spark version = 2.4.0-cdh6.2.1

JRE version = 1.8

API type = Python

The text was updated successfully, but these errors were encountered:

Imbruced · 2019-12-27T22:13:35Z

Can you check few things ?

GeoSparkSQL Scala/Java API produce the same error ?
You have JAVA_HOME environment variable set ?
Any action on Spark DataFrame works correctly (without GeoSpark) ?

ananbas · 2019-12-29T17:53:24Z

Hi Imbriced,

Using Scala API no error, everything worked, example project running.
Yes of course.
Yes, we are using pyspark for daily ETL.

Imbruced · 2019-12-29T20:32:57Z

Hi,
To me it looks like dependency issue, Library was tested with native spark (2.2, 2.3, 2.4) for hadoop distribution 2.7 and scala 2.11. I will try to reproduce the problem on cdh6.2.1 environment. I have two possible solutions, need to test them, (I need 2-3 days).

Imbruced · 2020-01-02T23:41:25Z

Hi,
I was not able to reproduce your issue. I added changes which can solve the issue on forked version,
https://github.com/Imbruced/GeoSpark/tree/master/python. Please let me know if it works for you.
geo_wrapper jar is no longer needed please remove it from spark.jars option.

* Fix Issue, unread block data (#408) * Add GeoSpark core Python API, version beta. * Fix issue with additional else statement. * Add WkbReader to direct imports, Fix issue with version tests. * Add geo_pyspark version 0.3.0. * Add geo_pyspark version 0.3.0. * Update wheel file for geo_pyspark version 0.3.0. * Improve serialization process for GeoSpark Python. * Fix Issue with Adapter import. * Create example notebook for GeoPysparkSQL and GeoPysparkCore. * Delete ShowCase Notebook.ipynb * Update GeoSparkCore example notebook. * Update code for DataBricks platform support. * Add support for collect SpatialPartitionedRDD. * Add persist possibility to indexedRDD. * Add support for serializing rawSpatialRDD. * Update wheel file for geo_pyspark version 0.3.0.

* Fix Issue, unread block data (#408) * Add GeoSpark core Python API, version beta. * Fix issue with additional else statement. * Add WkbReader to direct imports, Fix issue with version tests. * Add geo_pyspark version 0.3.0. * Add geo_pyspark version 0.3.0. * Update wheel file for geo_pyspark version 0.3.0. * Improve serialization process for GeoSpark Python. * Fix Issue with Adapter import. * Create example notebook for GeoPysparkSQL and GeoPysparkCore. * Delete ShowCase Notebook.ipynb * Update GeoSparkCore example notebook. * Update code for DataBricks platform support. * Add support for collect SpatialPartitionedRDD. * Add persist possibility to indexedRDD. * Add support for serializing rawSpatialRDD. * Update wheel file for geo_pyspark version 0.3.0. # Conflicts: # docs/tutorial/geospark-python.md # mkdocs.yml

* Fix Issue, unread block data (#408) * Add GeoSpark core Python API, version beta. * Fix issue with additional else statement. * Add WkbReader to direct imports, Fix issue with version tests. * Add geo_pyspark version 0.3.0. * Add geo_pyspark version 0.3.0. * Update wheel file for geo_pyspark version 0.3.0. * Improve serialization process for GeoSpark Python. * Fix Issue with Adapter import. * Create example notebook for GeoPysparkSQL and GeoPysparkCore. * Delete ShowCase Notebook.ipynb * Update GeoSparkCore example notebook. * Update code for DataBricks platform support. * Add support for collect SpatialPartitionedRDD. * Add persist possibility to indexedRDD. * Add support for serializing rawSpatialRDD. * Update wheel file for geo_pyspark version 0.3.0. * Add geo-pyspark on PyPi. * Change name of the package from geo_pyspark to geospark. * Change name from geo_pyspark to geospark. * Add CI script for Python. * Update documentation for geospark python. * Update CI script with removing DskipTests attribute. Bring back mvn clean install instead of mvn -q clean install -DskipTests whic was used to speed up tests. * Fix issue with CI script. -q missing flag was causing issue with to much verbosity. * Fix issue with amount of time with testing. Remove testing Spark 2.3 with Python, there is tests only for Python 3.7 and Spark 2.4. * Update jar files for previous GeoSpark SQL releases. The update was caused by package name change.

* Fix Issue, unread block data (#408) * Add GeoSpark core Python API, version beta. * Fix issue with additional else statement. * Add WkbReader to direct imports, Fix issue with version tests. * Add geo_pyspark version 0.3.0. * Add geo_pyspark version 0.3.0. * Update wheel file for geo_pyspark version 0.3.0. * Improve serialization process for GeoSpark Python. * Fix Issue with Adapter import. * Create example notebook for GeoPysparkSQL and GeoPysparkCore. * Delete ShowCase Notebook.ipynb * Update GeoSparkCore example notebook. * Update code for DataBricks platform support. * Add support for collect SpatialPartitionedRDD. * Add persist possibility to indexedRDD. * Add support for serializing rawSpatialRDD. * Update wheel file for geo_pyspark version 0.3.0. * Add geo-pyspark on PyPi. * Change name of the package from geo_pyspark to geospark. * Change name from geo_pyspark to geospark. * Add CI script for Python. * Update documentation for geospark python. * Update CI script with removing DskipTests attribute. Bring back mvn clean install instead of mvn -q clean install -DskipTests whic was used to speed up tests. * Fix issue with CI script. -q missing flag was causing issue with to much verbosity. * Fix issue with amount of time with testing. Remove testing Spark 2.3 with Python, there is tests only for Python 3.7 and Spark 2.4. * Update jar files for previous GeoSpark SQL releases. The update was caused by package name change. # Conflicts: # python/README.md

Butterflyer043 · 2020-04-26T18:13:09Z

Hi did you solve the problem? I have the same issue.

* Fix Issue, unread block data (#408) * Add GeoSpark core Python API, version beta. * Fix issue with additional else statement. * Add WkbReader to direct imports, Fix issue with version tests. * Add geo_pyspark version 0.3.0. * Add geo_pyspark version 0.3.0. * Update wheel file for geo_pyspark version 0.3.0. * Improve serialization process for GeoSpark Python. * Fix Issue with Adapter import. * Create example notebook for GeoPysparkSQL and GeoPysparkCore. * Delete ShowCase Notebook.ipynb * Update GeoSparkCore example notebook. * Update code for DataBricks platform support. * Add support for collect SpatialPartitionedRDD. * Add persist possibility to indexedRDD. * Add support for serializing rawSpatialRDD. * Update wheel file for geo_pyspark version 0.3.0. * Add geo-pyspark on PyPi. * Change name of the package from geo_pyspark to geospark. * Change name from geo_pyspark to geospark. * Add CI script for Python. * Update documentation for geospark python. * Update CI script with removing DskipTests attribute. Bring back mvn clean install instead of mvn -q clean install -DskipTests whic was used to speed up tests. * Fix issue with CI script. -q missing flag was causing issue with to much verbosity. * Fix issue with amount of time with testing. Remove testing Spark 2.3 with Python, there is tests only for Python 3.7 and Spark 2.4. * Update jar files for previous GeoSpark SQL releases. The update was caused by package name change. * [New version release] Set GeoSpark version to 1.3.1 * Add functions object for GeoSpark functions. * Replaced GeometrySerializer to use WKB API instead of the ShapeSerde which contains bugs (added test case with a buggy multipolygon) * Fixed test that before the WKB update passed by mistake (intersection of none intersect polygons returns multipolygon which make no sense) * Change deserialization methodology to WKB. * Update osgeo repo to use the new repository * Removed unused test "test serializing with user Data" * Removed duplicate test case Removed failure test Passed St_GeomFromWKT Removed unused testWkb file * Removed unused copy jar in the .travis.yml * Remove old dependencies from travis script. * Remove temporary files. * Remove temporary files. Co-authored-by: Pawel <pawel93kocinski@gmail.com>

jornfranke · 2020-05-25T16:41:34Z

I recommend to put the jar geospark_2.11-1.3.1.jar somewhere on HDFS on the cluster. The following example assume that it is in the folder /jars on HDFS in the cluster (you can put it though anywhere on HDFS and adapt the URL below accordingly). Please replace "myhdfshost" with the hdfs url of your cluster in the following fragment:


from pyspark.sql import SparkSession
from geospark.register import GeoSparkRegistrator
from geospark.utils.adapter import Adapter
from geospark.core.spatialOperator import KNNQuery
from geospark.core.formatMapper import GeoJsonReader
from shapely.geometry import Point

from geospark.register import GeoSparkRegistrator
from geospark.utils.serde import GeoSparkKryoRegistrator,KryoSerializer
 
# Load geospark library
spark = SparkSession\
    .builder\
    .config("spark.jars", "hdfs://myhdfshost/jars/geospark_2.11-1.3.1.jar")\
    .config('spark.executor.memory', "2g")\
    .config("spark.driver.memory", "3g")\
    .config("spark.serializer", KryoSerializer.getName)\
    .config("spark.kryo.registrator", GeoSparkKryoRegistrator.getName)\
    .config("spark.kryoserializer.buffer.max", "1g")\
    .appName("Geospark-spatialjoin-PySpark")\
    .getOrCreate()

* Fix Issue, unread block data (#408) * Add GeoSpark core Python API, version beta. * Fix issue with additional else statement. * Add WkbReader to direct imports, Fix issue with version tests. * Add geo_pyspark version 0.3.0. * Add geo_pyspark version 0.3.0. * Update wheel file for geo_pyspark version 0.3.0. * Improve serialization process for GeoSpark Python. * Fix Issue with Adapter import. * Create example notebook for GeoPysparkSQL and GeoPysparkCore. * Delete ShowCase Notebook.ipynb * Update GeoSparkCore example notebook. * Update code for DataBricks platform support. * Add support for collect SpatialPartitionedRDD. * Add persist possibility to indexedRDD. * Add support for serializing rawSpatialRDD. * Update wheel file for geo_pyspark version 0.3.0. * Add geo-pyspark on PyPi. * Change name of the package from geo_pyspark to geospark. * Change name from geo_pyspark to geospark. * Add CI script for Python. * Update documentation for geospark python. * Update CI script with removing DskipTests attribute. Bring back mvn clean install instead of mvn -q clean install -DskipTests whic was used to speed up tests. * Fix issue with CI script. -q missing flag was causing issue with to much verbosity. * Fix issue with amount of time with testing. Remove testing Spark 2.3 with Python, there is tests only for Python 3.7 and Spark 2.4. * Update jar files for previous GeoSpark SQL releases. The update was caused by package name change. * [New version release] Set GeoSpark version to 1.3.1 * Add functions object for GeoSpark functions. * Add support for partition number in spatialPartitioning.

scially mentioned this issue Jun 8, 2020

Spatial parttioning full support python (#448) scially/GeoSpark#2

Merged

jiayuasu closed this as completed Jan 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need help with pyspark #408

Need help with pyspark #408

ananbas commented Dec 26, 2019

Imbruced commented Dec 27, 2019

ananbas commented Dec 29, 2019

Imbruced commented Dec 29, 2019

Imbruced commented Jan 2, 2020

Butterflyer043 commented Apr 26, 2020

jornfranke commented May 25, 2020

Need help with pyspark #408

Need help with pyspark #408

Comments

ananbas commented Dec 26, 2019

Expected behavior

Actual behavior

Steps to reproduce the problem

Settings

Imbruced commented Dec 27, 2019

ananbas commented Dec 29, 2019

Imbruced commented Dec 29, 2019

Imbruced commented Jan 2, 2020

Butterflyer043 commented Apr 26, 2020

jornfranke commented May 25, 2020