[SEDONA-1] Move to jts 1.18, minimize the dependency on JTSplus #488

jiayuasu · 2020-11-16T06:59:01Z

Is this PR related to a proposed Issue?

https://issues.apache.org/jira/browse/SEDONA-1

This PR is relevant to two JTS PRs made by me:

Add the check of userData in equals(object o): Add the check of userData in equals(object o) locationtech/jts#633
Change the access modifiers of tree indexes and add setter/getters: Support read and initialize internal structure of STRtree and Quadtree locationtech/jts#634

The acceptance of this PR will happen when both JTS PRs are accepted by JTS committers.

What changes were proposed in this PR?

Minimize the dependency on JTSplus. Use a GeomUtils wrapper to process some additional string functions. Remove some unnecessary null checks in the test.

How was this patch tested?

All Scala and Java tests have passed. Python tests failed due to the lack of null check on UserData

Did this PR include necessary documentation updates?

jiayuasu · 2020-11-16T07:13:51Z

Note that, this PR depends on a JTS 1.16.2 fork: https://github.com/jiayuasu/jts

Although we won't be able to use JTS 1.17 official release (due to the issues among JTStoGeoJSON dependency, GeoTools dependency and JTS) even if both two JTS PRs get accepted, this at least can help us to upgrade JTS 1.17 in the future without any API changes.
If the first JTS PR Add the check of userData in equals(object o) locationtech/jts#633 is declined by JTS committers, we have to remove SpatialJoinQuery <Geom, HashSet> function or use a so-called custom HashStrategy. Because Java built-in HashSet has to Geom.equals(object) to compute the HashCode.

@Imbruced Could you please fix the UserData errors in Python accordingly? I have invited you to be the collaborator on my Sedona fork.

Imbruced · 2020-11-17T22:13:58Z

@jiayuasu I was able to fix that, need to add some additional tests and tomorrow I create PR.

…ethods except Quad-Tree and KDB-Tree, remove all HashSet dependency Signed-off-by: Jia Yu <jiayu198910@gmail.com>

jiayuasu · 2020-11-20T08:41:27Z

Status

This PR is relevant to two JTS PRs made by me:

Add the check of userData in equals(object o): Add the check of userData in equals(object o) locationtech/jts#633 : Check UserData in Geometry is totally removed now
Change the access modifiers of tree indexes and add setter/getters: Support read and initialize internal structure of STRtree and Quadtree locationtech/jts#634

New changes

To complete eliminate HashSet in Sedona, I have removed the unnecessary spatial partitioning methods: Equal grids, R-Tree, Voronoi, and Hilbert curve. This is because these partitioning methods (except EqualGrid) cannot use the advanced "Reference Point" technique to remove duplicates and have to leverage "HashSet" and Geometry "equals" to eliminate duplicates.

Based on our experiments before, these partitoining methods are slower than Quad-Tree and KDB-Tree partitioning. So we can remove them anyway.

jnh5y · 2020-11-20T22:33:15Z

@jiayuasu related to the partitioning, is it possible configure a pre-determined grid to use when partitioning?

Signed-off-by: Jia Yu <jiayu198910@gmail.com>

jiayuasu · 2020-11-23T06:23:10Z

Hi all @Imbruced @netanel246 @Sarwat

I managed to fix all the issues. In a nutshell:

Changes on JTS side

This PR is relevant to two JTS PRs made by me:

Add the check of userData in equals(object o): Add the check of userData in equals(object o) locationtech/jts#633 : this PR is no longer needed.
Change the access modifiers of tree indexes and add setter/getters: Support read and initialize internal structure of STRtree and Quadtree locationtech/jts#634: this PR has been updated. We only need to change a few variables to package-private. I am confident that JTS committers will accept this PR soon.

Changes on Sedona side

The following spatial partitioning methods, Equal grids, R-Tree, Voronoi, and Hilbert curve, have been removed because they can no longer yield correct results due to the change in JTS. Only Quad-Tree and KDB-Tree are kept.
JoinQuery.SpatialJoinQuery/DistanceJoinQuery now returns <Geometry, List> instead of <Geometry, HashSet> because we can no longer use HashSet in Sedona for duplicates removal.
The duplicates preserving strategy in JoinQuery.SpatialJoinQuery/DistanceJoinQuery is changed.
1. After this PR, duplicate geometries present in the input queryWindowRDD, regardless of their non-spatial attributes, will not be reflected in the join results. Duplicate geometries present in the input spatialRDD, regardless of their non-spatial attributes, will be reflected in the join results.
2. Before this PR, duplicate geometries present in the input queryWindowRDD, if their non-spatial attributes are also same, will be reflected in the final result. Duplicate geometries present in the input SpatialRDD (if their non-spatial attributes are also same), will not be reflected in the final result.
Now use Sedona Core GeomUtils to do Geometry print and equalityCheck. Adapter.scala in Sedona SQL has been updated accordingly but no API change needed.
JoinQuery.SpatialJoinQueryFlat/DistanceJoinQueryPlat still have the same signature. No change in terms of API and behavior. Duplicate geometries present in both input datasets will be reflected in the output. The query correctness is still guaranteed.
IndexSerializer of JTS Quad-Tree and R-Tree are now under org.locationtech.jts.index.quadtree and strtree. The purpose is to leverage "package private" to minimize the Sedona changes to JTS core.
JTS version has been upgrade to JTS 1.18.0-SNAPSHOT is to get ready for the next JTS release which includes my JTS PR.
GeoTools upgraded to 24.0 and Jts2GeoJSON upgraded to 1.4.3. Jts2GeoJSON 1.4.3 (the latest version) does not support JTS 1.17+. To overcome this, I copied "GeoJSONWRITER" from Jts2GeoJSON to Sedona "GeoJSONWRITERNew". This fixed the version conflict but we may need to list Jts2GeoJSON's MIT license somewhere in Sedona.

To-Dos

Wait for JTS to accept the PR and release JTS 1.18 on Maven Central. Then I will change the JTS dependency in Sedona to its Maven coordinate. I expect this to be done in the coming days.
@Imbruced Can you please fix the Python APIs based on the aforementioned Change 1, 2, 3, 4. I am sure that Python APIs cannot compile now. And some test cases may fail because 1 and 3. For Change 6,7,8, I am not sure whether Python APIs should be updated. Please check.

Imbruced · 2020-11-24T22:32:53Z

@jiayuasu I am working on that, should be ready within 1-2 days.

jiayuasu · 2020-11-26T18:41:11Z

To-Dos

Wait for JTS to accept the PR and release JTS 1.18 on Maven Central. Then I will change the JTS dependency in Sedona to its Maven coordinate. I expect this to be done in the coming days.

This PR (locationtech/jts#634) has been accepted by JTS. Now waiting for JTS committers to publish JTS 1.18 to Maven Central.

@Imbruced Can you please fix the Python APIs based on the aforementioned Change 1, 2, 3, 4. I am sure that Python APIs cannot compile now. And some test cases may fail because 1 and 3. For Change 6,7,8, I am not sure whether Python APIs should be updated. Please check.

Imbruced · 2020-11-29T09:38:33Z

Ad 2. The duplicates preserving strategy in JoinQuery.SpatialJoinQuery/DistanceJoinQuery is changed.
Don't you think that will be the most common question from the user perspective ? Actually we force users to pre aggregate the data to get correct result and then be able to join with given input id. In many cases users will have to find some workarounds to not lose their data. Join based on wkt after getting spatial result may be also confusing. WDYT ?

jiayuasu · 2020-11-29T19:39:33Z

Ad 2. The duplicates preserving strategy in JoinQuery.SpatialJoinQuery/DistanceJoinQuery is changed.
Don't you think that will be the most common question from the user perspective ? Actually we force users to pre aggregate the data to get correct result and then be able to join with given input id. In many cases users will have to find some workarounds to not lose their data. Join based on wkt after getting spatial result may be also confusing. WDYT ?

@Imbruced You made a good point. I think I can change the logic in JoinQuery.SpatialJoinQuery/DistanceJoinQuery so this query will not change any original duplicates in the raw data. The users don't need to develop any workaround. Let me try it out today.

jiayuasu · 2020-11-30T02:01:45Z

@Imbruced I have fixed this issue. So now SpatialJoinQuery/DistanceJoinQuery will preserve all duplicates that originally present in both input datasets. This is even better than the code we had prior to this PR.

Imbruced · 2020-11-30T21:40:19Z

@jiayuasu do we need countWithoutDuplicates method within SpatialRDD objects ? It is inconsistent with spatial joins geometry equality checking. Also it is not suitable for huge rdd. I can see usege of collect method and hash set instance within this function. Inteliji hints that there is no usage of this method within the code.

Imbruced · 2020-11-30T21:42:51Z

@jiayuasu Python version is fixed. I think there is no required docs changes. I start to work on Adapter speed up within Python version.

jiayuasu · 2020-11-30T22:13:06Z

@Imbruced countWithoutDuplicates code has been completely removed. Because, in this new PR, there is no way to support both count with / without dup. We only have the following public APIs for the user:

SpatialJoinQuery/DistanceJoinQuery
SpatialJoinQueryFlat/DistanceJoinQueryFlat

In all cases, duplicates introduced by spatial partitioning will be automatically removed, duplicates in the original data will be automatically kept in the final result.

Python API - jts 1.18

jiayuasu · 2020-12-02T05:56:53Z

@Imbruced The Python API failed the test cases...

Imbruced · 2020-12-02T07:38:52Z

@jiayuasu To me it looks like issue with travis itself. On my local machine all tests passed.

jiayuasu · 2020-12-02T10:44:11Z

@Imbruced The timeout always happens at python/core/test_core_spatial_relations.py. The join query particularly calls the the PythoAdapter.translateSpatialPairRDDWithListToPython

It looks like, due to the changes in this PR, SpatialJoinQuery may return more results. The PythonAdapter becomes much slower on the Travis-CI VM. Eventually the test environment will crash. Do you have any idea about this?

Imbruced · 2020-12-02T11:17:13Z

@jiayuasu Within the tests we are using default spark configuration (Python). Changing hashset to list increased number of instances needed to serialize from jvm to Python. Also within this test I am using huge files. Probably we cross some threshold of memory usage. I will fix that, also I almost finished no SerDe jvm->python conversion functionality (rdd join result to df or saving to file). I think what we need within this test is to increase number of partitions.

Imbruced · 2020-12-02T11:18:30Z

Also I will make how hashset -> list impacted on memory consumption while SerDE.

Imbruced · 2020-12-02T21:28:36Z

My build getting stuck (it is queued more than 2 hours). Only one thing which should be changed
within python/tests/core/test_core_spatial_relations.py

  point_rdd.spatialPartitioning(GridType.KDBTREE, num_partitions=10)

jiayuasu · 2020-12-03T01:39:53Z

My build getting stuck (it is queued more than 2 hours). Only one thing which should be changed
within python/tests/core/test_core_spatial_relations.py
  point_rdd.spatialPartitioning(GridType.KDBTREE, num_partitions=10)

@Imbruced I didn't get it. Do you want me to change it or you will change it?

Imbruced · 2020-12-03T07:20:42Z

@jiayuasu I was sure that my build get stuck forever, but somehow it manage to ran. Build is no longer failing with this change. I created PR to your fork.

…8-repartition Fix the out of memory test failure.

jiayuasu added 7 commits November 15, 2020 20:47

Fix some test cases according to JTS updates

6fd0adc

Fix some test cases according to JTS updates

9325002

Update which submodule commit to track

3857ddb

Remove the SNAPSHOT tag in JTS dependency

90b69e2

Add the findLeafFunctions

8c5e131

Update which submodule commit to track

3025a87

Clean up the code

7eb7358

jiayuasu added this to the sedona-1.0.0 milestone Nov 16, 2020

Remove the dependency on Geometry "Equals", remove all partitioning m…

276074a

…ethods except Quad-Tree and KDB-Tree, remove all HashSet dependency Signed-off-by: Jia Yu <jiayu198910@gmail.com>

jiayuasu added 8 commits November 22, 2020 15:42

Fix the ScalaExample in Viz

327218c

Fix the scala test

bfe464f

Refactor IndexSerde

604ddd9

Update the jts dependency to the latest

16d8a1d

Upgrade to JTS 1.18.0-SNAPSHOT and GeoTools 24.0

cbe0112

Fix the JTS2GeoJSON issue on JTS 1.17+

09fca2d

Fix the bug in the adapter

fbe91b2

Finalize the package private solution for IndexSerde

2713b29

Signed-off-by: Jia Yu <jiayu198910@gmail.com>

jiayuasu changed the title ~~[SEDONA-1] Move to jts 1.16.1, minimize the dependency on JTSplus~~ [SEDONA-1] Move to jts 1.18, minimize the dependency on JTSplus Nov 23, 2020

jiayuasu added affect public APIs improvement sedona-core sedona-python sedona-sql sedona-viz labels Nov 23, 2020

Update Sedona PR according to the changes in JTS PR

f33af27

Fix the logic in SpatialJoinQuery to make it preserve all duplicates

6b0082f

Remove additional grid types.

c6ffd4b

Imbruced and others added 3 commits November 30, 2020 23:32

Bring back geom and wkt for uniqueness checking.

7dcf30c

Remove hash from variables name.

c9f8d2a

Merge pull request #2 from jiayuasu/improve-jts-dependency-python-1.18

58355f4

Python API - jts 1.18

jiayuasu added 2 commits December 2, 2020 00:28

Speed up the HashCode computation in JoinQuery collect

1d799cc

Fix the null issue

9d33d2c

Fix the out of memory test failure.

29ac28a

Merge pull request #3 from Imbruced/improve-jts-dependency-python-1.1…

7af5712

…8-repartition Fix the out of memory test failure.

jiayuasu merged commit 4800676 into apache:master Dec 3, 2020

jiayuasu deleted the improve-jts-dependency branch December 10, 2020 09:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SEDONA-1] Move to jts 1.18, minimize the dependency on JTSplus #488

[SEDONA-1] Move to jts 1.18, minimize the dependency on JTSplus #488

jiayuasu commented Nov 16, 2020 •

edited

jiayuasu commented Nov 16, 2020

Imbruced commented Nov 17, 2020

jiayuasu commented Nov 20, 2020

jnh5y commented Nov 20, 2020

jiayuasu commented Nov 23, 2020

Imbruced commented Nov 24, 2020

jiayuasu commented Nov 26, 2020

To-Dos

Imbruced commented Nov 29, 2020

jiayuasu commented Nov 29, 2020

jiayuasu commented Nov 30, 2020

Imbruced commented Nov 30, 2020

Imbruced commented Nov 30, 2020 •

edited

jiayuasu commented Nov 30, 2020 •

edited

jiayuasu commented Dec 2, 2020

Imbruced commented Dec 2, 2020

jiayuasu commented Dec 2, 2020

Imbruced commented Dec 2, 2020 •

edited

Imbruced commented Dec 2, 2020

Imbruced commented Dec 2, 2020 •

edited

jiayuasu commented Dec 3, 2020

Imbruced commented Dec 3, 2020

[SEDONA-1] Move to jts 1.18, minimize the dependency on JTSplus #488

[SEDONA-1] Move to jts 1.18, minimize the dependency on JTSplus #488

Conversation

jiayuasu commented Nov 16, 2020 • edited

Is this PR related to a proposed Issue?

What changes were proposed in this PR?

How was this patch tested?

Did this PR include necessary documentation updates?

jiayuasu commented Nov 16, 2020

Imbruced commented Nov 17, 2020

jiayuasu commented Nov 20, 2020

Status

New changes

jnh5y commented Nov 20, 2020

jiayuasu commented Nov 23, 2020

Changes on JTS side

Changes on Sedona side

To-Dos

Imbruced commented Nov 24, 2020

jiayuasu commented Nov 26, 2020

To-Dos

Imbruced commented Nov 29, 2020

jiayuasu commented Nov 29, 2020

jiayuasu commented Nov 30, 2020

Imbruced commented Nov 30, 2020

Imbruced commented Nov 30, 2020 • edited

jiayuasu commented Nov 30, 2020 • edited

jiayuasu commented Dec 2, 2020

Imbruced commented Dec 2, 2020

jiayuasu commented Dec 2, 2020

Imbruced commented Dec 2, 2020 • edited

Imbruced commented Dec 2, 2020

Imbruced commented Dec 2, 2020 • edited

jiayuasu commented Dec 3, 2020

Imbruced commented Dec 3, 2020

jiayuasu commented Nov 16, 2020 •

edited

Imbruced commented Nov 30, 2020 •

edited

jiayuasu commented Nov 30, 2020 •

edited

Imbruced commented Dec 2, 2020 •

edited

Imbruced commented Dec 2, 2020 •

edited