Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need help with pyspark #408

Closed
ananbas opened this issue Dec 26, 2019 · 6 comments
Closed

Need help with pyspark #408

ananbas opened this issue Dec 26, 2019 · 6 comments

Comments

@ananbas
Copy link

ananbas commented Dec 26, 2019

Expected behavior

GeoSparkRegistrator.registerAll(spark) return 0

Actual behavior

Got error

Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, bddevdn0001, executor 4): java.lang.IllegalStateException: unread block data

Steps to reproduce the problem

this is my code:

import findspark
findspark.init('/opt/cloudera/parcels/CDH/lib/spark')

import pyspark
import geo_pyspark
from pyspark.sql import SparkSession
from geo_pyspark.register import GeoSparkRegistrator

spark = SparkSession \
.builder \
.config("spark.driver.memory", "2g") \
.config("spark.jars", "local:/home/anung/geospark/geo_wrapper.jar,local:/home/anung/geospark/geospark-1.2.0.jar,local:/home/anung/geospark/geospark-sql_2.3-1.2.0.jar") \
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
.config("spark.kryo.registrator","org.datasyslab.geospark.serde.GeoSparkKryoRegistrator") \
.appName("anung-geospark-test") \
.enableHiveSupport() \
.getOrCreate()

GeoSparkRegistrator.registerAll(spark)

Settings

GeoSpark version = 1.2

Apache Spark version = 2.4.0-cdh6.2.1

JRE version = 1.8

API type = Python

@Imbruced
Copy link
Member

Can you check few things ?

  • GeoSparkSQL Scala/Java API produce the same error ?
  • You have JAVA_HOME environment variable set ?
  • Any action on Spark DataFrame works correctly (without GeoSpark) ?

@ananbas
Copy link
Author

ananbas commented Dec 29, 2019

Hi Imbriced,

  1. Using Scala API no error, everything worked, example project running.
  2. Yes of course.
  3. Yes, we are using pyspark for daily ETL.

@Imbruced
Copy link
Member

Hi,
To me it looks like dependency issue, Library was tested with native spark (2.2, 2.3, 2.4) for hadoop distribution 2.7 and scala 2.11. I will try to reproduce the problem on cdh6.2.1 environment. I have two possible solutions, need to test them, (I need 2-3 days).

@Imbruced
Copy link
Member

Imbruced commented Jan 2, 2020

Hi,
I was not able to reproduce your issue. I added changes which can solve the issue on forked version,
https://github.com/Imbruced/GeoSpark/tree/master/python. Please let me know if it works for you.
geo_wrapper jar is no longer needed please remove it from spark.jars option.

jiayuasu pushed a commit that referenced this issue Feb 3, 2020
* Fix Issue, unread block data (#408)

* Add GeoSpark core Python API, version beta.

* Fix issue with additional else statement.

* Add WkbReader to direct imports, Fix issue with version tests.

* Add geo_pyspark version 0.3.0.

* Add geo_pyspark version 0.3.0.

* Update wheel file for geo_pyspark version 0.3.0.

* Improve serialization process for GeoSpark Python.

* Fix Issue with Adapter import.

* Create example notebook for GeoPysparkSQL and GeoPysparkCore.

* Delete ShowCase Notebook.ipynb

* Update GeoSparkCore example notebook.

* Update code for DataBricks platform support.

* Add support for collect SpatialPartitionedRDD.

* Add persist possibility to indexedRDD.

* Add support for serializing rawSpatialRDD.

* Update wheel file for geo_pyspark version 0.3.0.
jiayuasu pushed a commit that referenced this issue Feb 3, 2020
* Fix Issue, unread block data (#408)

* Add GeoSpark core Python API, version beta.

* Fix issue with additional else statement.

* Add WkbReader to direct imports, Fix issue with version tests.

* Add geo_pyspark version 0.3.0.

* Add geo_pyspark version 0.3.0.

* Update wheel file for geo_pyspark version 0.3.0.

* Improve serialization process for GeoSpark Python.

* Fix Issue with Adapter import.

* Create example notebook for GeoPysparkSQL and GeoPysparkCore.

* Delete ShowCase Notebook.ipynb

* Update GeoSparkCore example notebook.

* Update code for DataBricks platform support.

* Add support for collect SpatialPartitionedRDD.

* Add persist possibility to indexedRDD.

* Add support for serializing rawSpatialRDD.

* Update wheel file for geo_pyspark version 0.3.0.

# Conflicts:
#	docs/tutorial/geospark-python.md
#	mkdocs.yml
jiayuasu pushed a commit that referenced this issue Feb 3, 2020
* Fix Issue, unread block data (#408)

* Add GeoSpark core Python API, version beta.

* Fix issue with additional else statement.

* Add WkbReader to direct imports, Fix issue with version tests.

* Add geo_pyspark version 0.3.0.

* Add geo_pyspark version 0.3.0.

* Update wheel file for geo_pyspark version 0.3.0.

* Improve serialization process for GeoSpark Python.

* Fix Issue with Adapter import.

* Create example notebook for GeoPysparkSQL and GeoPysparkCore.

* Delete ShowCase Notebook.ipynb

* Update GeoSparkCore example notebook.

* Update code for DataBricks platform support.

* Add support for collect SpatialPartitionedRDD.

* Add persist possibility to indexedRDD.

* Add support for serializing rawSpatialRDD.

* Update wheel file for geo_pyspark version 0.3.0.

# Conflicts:
#	docs/tutorial/geospark-python.md
#	mkdocs.yml
jiayuasu pushed a commit that referenced this issue Feb 13, 2020
* Fix Issue, unread block data (#408)

* Add GeoSpark core Python API, version beta.

* Fix issue with additional else statement.

* Add WkbReader to direct imports, Fix issue with version tests.

* Add geo_pyspark version 0.3.0.

* Add geo_pyspark version 0.3.0.

* Update wheel file for geo_pyspark version 0.3.0.

* Improve serialization process for GeoSpark Python.

* Fix Issue with Adapter import.

* Create example notebook for GeoPysparkSQL and GeoPysparkCore.

* Delete ShowCase Notebook.ipynb

* Update GeoSparkCore example notebook.

* Update code for DataBricks platform support.

* Add support for collect SpatialPartitionedRDD.

* Add persist possibility to indexedRDD.

* Add support for serializing rawSpatialRDD.

* Update wheel file for geo_pyspark version 0.3.0.

* Add geo-pyspark on PyPi.

* Change name of the package from geo_pyspark to geospark.

* Change name from geo_pyspark to geospark.

* Add CI script for Python.

* Update documentation for geospark python.

* Update CI script with removing DskipTests attribute.

Bring back mvn clean install instead of mvn -q clean install -DskipTests whic was used to speed up tests.

* Fix issue with CI script.

-q missing flag was causing issue with to much verbosity.

* Fix issue with amount of time with testing.

Remove testing Spark 2.3 with Python, there is tests only for Python 3.7 and Spark 2.4.

* Update jar files for previous GeoSpark SQL releases.

The update was caused by package name change.
jiayuasu pushed a commit that referenced this issue Feb 17, 2020
* Fix Issue, unread block data (#408)

* Add GeoSpark core Python API, version beta.

* Fix issue with additional else statement.

* Add WkbReader to direct imports, Fix issue with version tests.

* Add geo_pyspark version 0.3.0.

* Add geo_pyspark version 0.3.0.

* Update wheel file for geo_pyspark version 0.3.0.

* Improve serialization process for GeoSpark Python.

* Fix Issue with Adapter import.

* Create example notebook for GeoPysparkSQL and GeoPysparkCore.

* Delete ShowCase Notebook.ipynb

* Update GeoSparkCore example notebook.

* Update code for DataBricks platform support.

* Add support for collect SpatialPartitionedRDD.

* Add persist possibility to indexedRDD.

* Add support for serializing rawSpatialRDD.

* Update wheel file for geo_pyspark version 0.3.0.

* Add geo-pyspark on PyPi.

* Change name of the package from geo_pyspark to geospark.

* Change name from geo_pyspark to geospark.

* Add CI script for Python.

* Update documentation for geospark python.

* Update CI script with removing DskipTests attribute.

Bring back mvn clean install instead of mvn -q clean install -DskipTests whic was used to speed up tests.

* Fix issue with CI script.

-q missing flag was causing issue with to much verbosity.

* Fix issue with amount of time with testing.

Remove testing Spark 2.3 with Python, there is tests only for Python 3.7 and Spark 2.4.

* Update jar files for previous GeoSpark SQL releases.

The update was caused by package name change.

# Conflicts:
#	python/README.md
jiayuasu pushed a commit that referenced this issue Feb 17, 2020
* Fix Issue, unread block data (#408)

* Add GeoSpark core Python API, version beta.

* Fix issue with additional else statement.

* Add WkbReader to direct imports, Fix issue with version tests.

* Add geo_pyspark version 0.3.0.

* Add geo_pyspark version 0.3.0.

* Update wheel file for geo_pyspark version 0.3.0.

* Improve serialization process for GeoSpark Python.

* Fix Issue with Adapter import.

* Create example notebook for GeoPysparkSQL and GeoPysparkCore.

* Delete ShowCase Notebook.ipynb

* Update GeoSparkCore example notebook.

* Update code for DataBricks platform support.

* Add support for collect SpatialPartitionedRDD.

* Add persist possibility to indexedRDD.

* Add support for serializing rawSpatialRDD.

* Update wheel file for geo_pyspark version 0.3.0.

* Add geo-pyspark on PyPi.

* Change name of the package from geo_pyspark to geospark.

* Change name from geo_pyspark to geospark.

* Add CI script for Python.

* Update documentation for geospark python.

* Update CI script with removing DskipTests attribute.

Bring back mvn clean install instead of mvn -q clean install -DskipTests whic was used to speed up tests.

* Fix issue with CI script.

-q missing flag was causing issue with to much verbosity.

* Fix issue with amount of time with testing.

Remove testing Spark 2.3 with Python, there is tests only for Python 3.7 and Spark 2.4.

* Update jar files for previous GeoSpark SQL releases.

The update was caused by package name change.

# Conflicts:
#	python/README.md
@Butterflyer043
Copy link

Hi did you solve the problem? I have the same issue.

jiayuasu pushed a commit that referenced this issue May 13, 2020
* Fix Issue, unread block data (#408)

* Add GeoSpark core Python API, version beta.

* Fix issue with additional else statement.

* Add WkbReader to direct imports, Fix issue with version tests.

* Add geo_pyspark version 0.3.0.

* Add geo_pyspark version 0.3.0.

* Update wheel file for geo_pyspark version 0.3.0.

* Improve serialization process for GeoSpark Python.

* Fix Issue with Adapter import.

* Create example notebook for GeoPysparkSQL and GeoPysparkCore.

* Delete ShowCase Notebook.ipynb

* Update GeoSparkCore example notebook.

* Update code for DataBricks platform support.

* Add support for collect SpatialPartitionedRDD.

* Add persist possibility to indexedRDD.

* Add support for serializing rawSpatialRDD.

* Update wheel file for geo_pyspark version 0.3.0.

* Add geo-pyspark on PyPi.

* Change name of the package from geo_pyspark to geospark.

* Change name from geo_pyspark to geospark.

* Add CI script for Python.

* Update documentation for geospark python.

* Update CI script with removing DskipTests attribute.

Bring back mvn clean install instead of mvn -q clean install -DskipTests whic was used to speed up tests.

* Fix issue with CI script.

-q missing flag was causing issue with to much verbosity.

* Fix issue with amount of time with testing.

Remove testing Spark 2.3 with Python, there is tests only for Python 3.7 and Spark 2.4.

* Update jar files for previous GeoSpark SQL releases.

The update was caused by package name change.

* [New version release] Set GeoSpark version to 1.3.1

* Add functions object for GeoSpark functions.

* Replaced GeometrySerializer to use WKB API instead of the ShapeSerde which contains bugs (added test case with a buggy multipolygon)

* Fixed test that before the WKB update passed by mistake (intersection of none intersect polygons returns multipolygon which make no sense)

* Change deserialization methodology to WKB.

* Update osgeo repo to use the new repository

* Removed unused test "test serializing with user Data"

* Removed duplicate test case
Removed failure test Passed St_GeomFromWKT
Removed unused testWkb file

* Removed unused copy jar in the .travis.yml

* Remove old dependencies from travis script.

* Remove temporary files.

* Remove temporary files.

Co-authored-by: Pawel <pawel93kocinski@gmail.com>
@jornfranke
Copy link

I recommend to put the jar geospark_2.11-1.3.1.jar somewhere on HDFS on the cluster. The following example assume that it is in the folder /jars on HDFS in the cluster (you can put it though anywhere on HDFS and adapt the URL below accordingly). Please replace "myhdfshost" with the hdfs url of your cluster in the following fragment:


from pyspark.sql import SparkSession
from geospark.register import GeoSparkRegistrator
from geospark.utils.adapter import Adapter
from geospark.core.spatialOperator import KNNQuery
from geospark.core.formatMapper import GeoJsonReader
from shapely.geometry import Point

from geospark.register import GeoSparkRegistrator
from geospark.utils.serde import GeoSparkKryoRegistrator,KryoSerializer
 
# Load geospark library
spark = SparkSession\
    .builder\
    .config("spark.jars", "hdfs://myhdfshost/jars/geospark_2.11-1.3.1.jar")\
    .config('spark.executor.memory', "2g")\
    .config("spark.driver.memory", "3g")\
    .config("spark.serializer", KryoSerializer.getName)\
    .config("spark.kryo.registrator", GeoSparkKryoRegistrator.getName)\
    .config("spark.kryoserializer.buffer.max", "1g")\
    .appName("Geospark-spatialjoin-PySpark")\
    .getOrCreate()
 
  

jiayuasu pushed a commit that referenced this issue Jun 6, 2020
* Fix Issue, unread block data (#408)

* Add GeoSpark core Python API, version beta.

* Fix issue with additional else statement.

* Add WkbReader to direct imports, Fix issue with version tests.

* Add geo_pyspark version 0.3.0.

* Add geo_pyspark version 0.3.0.

* Update wheel file for geo_pyspark version 0.3.0.

* Improve serialization process for GeoSpark Python.

* Fix Issue with Adapter import.

* Create example notebook for GeoPysparkSQL and GeoPysparkCore.

* Delete ShowCase Notebook.ipynb

* Update GeoSparkCore example notebook.

* Update code for DataBricks platform support.

* Add support for collect SpatialPartitionedRDD.

* Add persist possibility to indexedRDD.

* Add support for serializing rawSpatialRDD.

* Update wheel file for geo_pyspark version 0.3.0.

* Add geo-pyspark on PyPi.

* Change name of the package from geo_pyspark to geospark.

* Change name from geo_pyspark to geospark.

* Add CI script for Python.

* Update documentation for geospark python.

* Update CI script with removing DskipTests attribute.

Bring back mvn clean install instead of mvn -q clean install -DskipTests whic was used to speed up tests.

* Fix issue with CI script.

-q missing flag was causing issue with to much verbosity.

* Fix issue with amount of time with testing.

Remove testing Spark 2.3 with Python, there is tests only for Python 3.7 and Spark 2.4.

* Update jar files for previous GeoSpark SQL releases.

The update was caused by package name change.

* [New version release] Set GeoSpark version to 1.3.1

* Add functions object for GeoSpark functions.

* Add support for partition number in spatialPartitioning.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants