Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SEDONA-1] Move to jts 1.18, minimize the dependency on JTSplus #488

Merged
merged 26 commits into from Dec 3, 2020
Merged

[SEDONA-1] Move to jts 1.18, minimize the dependency on JTSplus #488

merged 26 commits into from Dec 3, 2020

Conversation

jiayuasu
Copy link
Member

@jiayuasu jiayuasu commented Nov 16, 2020

Is this PR related to a proposed Issue?

https://issues.apache.org/jira/browse/SEDONA-1

This PR is relevant to two JTS PRs made by me:

  1. Add the check of userData in equals(object o): Add the check of userData in equals(object o) locationtech/jts#633

  2. Change the access modifiers of tree indexes and add setter/getters: Support read and initialize internal structure of STRtree and Quadtree locationtech/jts#634

The acceptance of this PR will happen when both JTS PRs are accepted by JTS committers.

What changes were proposed in this PR?

Minimize the dependency on JTSplus. Use a GeomUtils wrapper to process some additional string functions. Remove some unnecessary null checks in the test.

How was this patch tested?

All Scala and Java tests have passed. Python tests failed due to the lack of null check on UserData

Did this PR include necessary documentation updates?

@jiayuasu jiayuasu added this to the sedona-1.0.0 milestone Nov 16, 2020
@jiayuasu
Copy link
Member Author

Note that, this PR depends on a JTS 1.16.2 fork: https://github.com/jiayuasu/jts

  1. Although we won't be able to use JTS 1.17 official release (due to the issues among JTStoGeoJSON dependency, GeoTools dependency and JTS) even if both two JTS PRs get accepted, this at least can help us to upgrade JTS 1.17 in the future without any API changes.

  2. If the first JTS PR Add the check of userData in equals(object o) locationtech/jts#633 is declined by JTS committers, we have to remove SpatialJoinQuery <Geom, HashSet> function or use a so-called custom HashStrategy. Because Java built-in HashSet has to Geom.equals(object) to compute the HashCode.

@Imbruced Could you please fix the UserData errors in Python accordingly? I have invited you to be the collaborator on my Sedona fork.

@Imbruced
Copy link
Member

@jiayuasu I was able to fix that, need to add some additional tests and tomorrow I create PR.

…ethods except Quad-Tree and KDB-Tree, remove all HashSet dependency

Signed-off-by: Jia Yu <jiayu198910@gmail.com>
@jiayuasu
Copy link
Member Author

Status

This PR is relevant to two JTS PRs made by me:

  1. Add the check of userData in equals(object o): Add the check of userData in equals(object o) locationtech/jts#633 : Check UserData in Geometry is totally removed now

  2. Change the access modifiers of tree indexes and add setter/getters: Support read and initialize internal structure of STRtree and Quadtree locationtech/jts#634

New changes

To complete eliminate HashSet in Sedona, I have removed the unnecessary spatial partitioning methods: Equal grids, R-Tree, Voronoi, and Hilbert curve. This is because these partitioning methods (except EqualGrid) cannot use the advanced "Reference Point" technique to remove duplicates and have to leverage "HashSet" and Geometry "equals" to eliminate duplicates.

Based on our experiments before, these partitoining methods are slower than Quad-Tree and KDB-Tree partitioning. So we can remove them anyway.

@jnh5y
Copy link

jnh5y commented Nov 20, 2020

@jiayuasu related to the partitioning, is it possible configure a pre-determined grid to use when partitioning?

@jiayuasu
Copy link
Member Author

Hi all @Imbruced @netanel246 @Sarwat

I managed to fix all the issues. In a nutshell:

Changes on JTS side

This PR is relevant to two JTS PRs made by me:

Changes on Sedona side

  1. The following spatial partitioning methods, Equal grids, R-Tree, Voronoi, and Hilbert curve, have been removed because they can no longer yield correct results due to the change in JTS. Only Quad-Tree and KDB-Tree are kept.

  2. JoinQuery.SpatialJoinQuery/DistanceJoinQuery now returns <Geometry, List> instead of <Geometry, HashSet> because we can no longer use HashSet in Sedona for duplicates removal.

  3. The duplicates preserving strategy in JoinQuery.SpatialJoinQuery/DistanceJoinQuery is changed.

    1. After this PR, duplicate geometries present in the input queryWindowRDD, regardless of their non-spatial attributes, will not be reflected in the join results. Duplicate geometries present in the input spatialRDD, regardless of their non-spatial attributes, will be reflected in the join results.
    2. Before this PR, duplicate geometries present in the input queryWindowRDD, if their non-spatial attributes are also same, will be reflected in the final result. Duplicate geometries present in the input SpatialRDD (if their non-spatial attributes are also same), will not be reflected in the final result.
  4. Now use Sedona Core GeomUtils to do Geometry print and equalityCheck. Adapter.scala in Sedona SQL has been updated accordingly but no API change needed.

  5. JoinQuery.SpatialJoinQueryFlat/DistanceJoinQueryPlat still have the same signature. No change in terms of API and behavior. Duplicate geometries present in both input datasets will be reflected in the output. The query correctness is still guaranteed.

  6. IndexSerializer of JTS Quad-Tree and R-Tree are now under org.locationtech.jts.index.quadtree and strtree. The purpose is to leverage "package private" to minimize the Sedona changes to JTS core.

  7. JTS version has been upgrade to JTS 1.18.0-SNAPSHOT is to get ready for the next JTS release which includes my JTS PR.

  8. GeoTools upgraded to 24.0 and Jts2GeoJSON upgraded to 1.4.3. Jts2GeoJSON 1.4.3 (the latest version) does not support JTS 1.17+. To overcome this, I copied "GeoJSONWRITER" from Jts2GeoJSON to Sedona "GeoJSONWRITERNew". This fixed the version conflict but we may need to list Jts2GeoJSON's MIT license somewhere in Sedona.

To-Dos

  1. Wait for JTS to accept the PR and release JTS 1.18 on Maven Central. Then I will change the JTS dependency in Sedona to its Maven coordinate. I expect this to be done in the coming days.
  2. @Imbruced Can you please fix the Python APIs based on the aforementioned Change 1, 2, 3, 4. I am sure that Python APIs cannot compile now. And some test cases may fail because 1 and 3. For Change 6,7,8, I am not sure whether Python APIs should be updated. Please check.

@jiayuasu jiayuasu changed the title [SEDONA-1] Move to jts 1.16.1, minimize the dependency on JTSplus [SEDONA-1] Move to jts 1.18, minimize the dependency on JTSplus Nov 23, 2020
@Imbruced
Copy link
Member

@jiayuasu I am working on that, should be ready within 1-2 days.

@jiayuasu
Copy link
Member Author

To-Dos

  1. Wait for JTS to accept the PR and release JTS 1.18 on Maven Central. Then I will change the JTS dependency in Sedona to its Maven coordinate. I expect this to be done in the coming days.

This PR (locationtech/jts#634) has been accepted by JTS. Now waiting for JTS committers to publish JTS 1.18 to Maven Central.

  1. @Imbruced Can you please fix the Python APIs based on the aforementioned Change 1, 2, 3, 4. I am sure that Python APIs cannot compile now. And some test cases may fail because 1 and 3. For Change 6,7,8, I am not sure whether Python APIs should be updated. Please check.

@Imbruced
Copy link
Member

Ad 2. The duplicates preserving strategy in JoinQuery.SpatialJoinQuery/DistanceJoinQuery is changed.
Don't you think that will be the most common question from the user perspective ? Actually we force users to pre aggregate the data to get correct result and then be able to join with given input id. In many cases users will have to find some workarounds to not lose their data. Join based on wkt after getting spatial result may be also confusing. WDYT ?

@jiayuasu
Copy link
Member Author

Ad 2. The duplicates preserving strategy in JoinQuery.SpatialJoinQuery/DistanceJoinQuery is changed.
Don't you think that will be the most common question from the user perspective ? Actually we force users to pre aggregate the data to get correct result and then be able to join with given input id. In many cases users will have to find some workarounds to not lose their data. Join based on wkt after getting spatial result may be also confusing. WDYT ?

@Imbruced You made a good point. I think I can change the logic in JoinQuery.SpatialJoinQuery/DistanceJoinQuery so this query will not change any original duplicates in the raw data. The users don't need to develop any workaround. Let me try it out today.

@jiayuasu
Copy link
Member Author

@Imbruced I have fixed this issue. So now SpatialJoinQuery/DistanceJoinQuery will preserve all duplicates that originally present in both input datasets. This is even better than the code we had prior to this PR.

@Imbruced
Copy link
Member

@jiayuasu do we need countWithoutDuplicates method within SpatialRDD objects ? It is inconsistent with spatial joins geometry equality checking. Also it is not suitable for huge rdd. I can see usege of collect method and hash set instance within this function. Inteliji hints that there is no usage of this method within the code.

@Imbruced
Copy link
Member

Imbruced commented Nov 30, 2020

@jiayuasu Python version is fixed. I think there is no required docs changes. I start to work on Adapter speed up within Python version.

@jiayuasu
Copy link
Member Author

jiayuasu commented Nov 30, 2020

@Imbruced countWithoutDuplicates code has been completely removed. Because, in this new PR, there is no way to support both count with / without dup. We only have the following public APIs for the user:

  1. SpatialJoinQuery/DistanceJoinQuery
  2. SpatialJoinQueryFlat/DistanceJoinQueryFlat

In all cases, duplicates introduced by spatial partitioning will be automatically removed, duplicates in the original data will be automatically kept in the final result.

@jiayuasu
Copy link
Member Author

jiayuasu commented Dec 2, 2020

@Imbruced The Python API failed the test cases...

@Imbruced
Copy link
Member

Imbruced commented Dec 2, 2020

@jiayuasu To me it looks like issue with travis itself. On my local machine all tests passed.

@jiayuasu
Copy link
Member Author

jiayuasu commented Dec 2, 2020

@Imbruced The timeout always happens at python/core/test_core_spatial_relations.py. The join query particularly calls the the PythoAdapter.translateSpatialPairRDDWithListToPython

It looks like, due to the changes in this PR, SpatialJoinQuery may return more results. The PythonAdapter becomes much slower on the Travis-CI VM. Eventually the test environment will crash. Do you have any idea about this?

@Imbruced
Copy link
Member

Imbruced commented Dec 2, 2020

@jiayuasu Within the tests we are using default spark configuration (Python). Changing hashset to list increased number of instances needed to serialize from jvm to Python. Also within this test I am using huge files. Probably we cross some threshold of memory usage. I will fix that, also I almost finished no SerDe jvm->python conversion functionality (rdd join result to df or saving to file). I think what we need within this test is to increase number of partitions.

@Imbruced
Copy link
Member

Imbruced commented Dec 2, 2020

Also I will make how hashset -> list impacted on memory consumption while SerDE.

@Imbruced
Copy link
Member

Imbruced commented Dec 2, 2020

My build getting stuck (it is queued more than 2 hours). Only one thing which should be changed
within python/tests/core/test_core_spatial_relations.py

  point_rdd.spatialPartitioning(GridType.KDBTREE, num_partitions=10)

@jiayuasu
Copy link
Member Author

jiayuasu commented Dec 3, 2020

My build getting stuck (it is queued more than 2 hours). Only one thing which should be changed
within python/tests/core/test_core_spatial_relations.py

  point_rdd.spatialPartitioning(GridType.KDBTREE, num_partitions=10)

@Imbruced I didn't get it. Do you want me to change it or you will change it?

@Imbruced
Copy link
Member

Imbruced commented Dec 3, 2020

@jiayuasu I was sure that my build get stuck forever, but somehow it manage to ran. Build is no longer failing with this change. I created PR to your fork.

…8-repartition

Fix the out of memory test failure.
@jiayuasu jiayuasu merged commit 4800676 into apache:master Dec 3, 2020
@jiayuasu jiayuasu deleted the improve-jts-dependency branch December 10, 2020 09:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants