New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SEDONA-1] Move to jts 1.18, minimize the dependency on JTSplus #488
Conversation
Note that, this PR depends on a JTS 1.16.2 fork: https://github.com/jiayuasu/jts
@Imbruced Could you please fix the UserData errors in Python accordingly? I have invited you to be the collaborator on my Sedona fork. |
@jiayuasu I was able to fix that, need to add some additional tests and tomorrow I create PR. |
…ethods except Quad-Tree and KDB-Tree, remove all HashSet dependency Signed-off-by: Jia Yu <jiayu198910@gmail.com>
StatusThis PR is relevant to two JTS PRs made by me:
New changesTo complete eliminate HashSet in Sedona, I have removed the unnecessary spatial partitioning methods: Equal grids, R-Tree, Voronoi, and Hilbert curve. This is because these partitioning methods (except EqualGrid) cannot use the advanced "Reference Point" technique to remove duplicates and have to leverage "HashSet" and Geometry "equals" to eliminate duplicates. Based on our experiments before, these partitoining methods are slower than Quad-Tree and KDB-Tree partitioning. So we can remove them anyway. |
@jiayuasu related to the partitioning, is it possible configure a pre-determined grid to use when partitioning? |
Signed-off-by: Jia Yu <jiayu198910@gmail.com>
Hi all @Imbruced @netanel246 @Sarwat I managed to fix all the issues. In a nutshell: Changes on JTS sideThis PR is relevant to two JTS PRs made by me:
Changes on Sedona side
To-Dos
|
@jiayuasu I am working on that, should be ready within 1-2 days. |
This PR (locationtech/jts#634) has been accepted by JTS. Now waiting for JTS committers to publish JTS 1.18 to Maven Central.
|
Ad 2. The duplicates preserving strategy in JoinQuery.SpatialJoinQuery/DistanceJoinQuery is changed. |
@Imbruced You made a good point. I think I can change the logic in JoinQuery.SpatialJoinQuery/DistanceJoinQuery so this query will not change any original duplicates in the raw data. The users don't need to develop any workaround. Let me try it out today. |
@Imbruced I have fixed this issue. So now SpatialJoinQuery/DistanceJoinQuery will preserve all duplicates that originally present in both input datasets. This is even better than the code we had prior to this PR. |
@jiayuasu do we need countWithoutDuplicates method within SpatialRDD objects ? It is inconsistent with spatial joins geometry equality checking. Also it is not suitable for huge rdd. I can see usege of collect method and hash set instance within this function. Inteliji hints that there is no usage of this method within the code. |
@jiayuasu Python version is fixed. I think there is no required docs changes. I start to work on Adapter speed up within Python version. |
@Imbruced countWithoutDuplicates code has been completely removed. Because, in this new PR, there is no way to support both count with / without dup. We only have the following public APIs for the user:
In all cases, duplicates introduced by spatial partitioning will be automatically removed, duplicates in the original data will be automatically kept in the final result. |
@Imbruced The Python API failed the test cases... |
@jiayuasu To me it looks like issue with travis itself. On my local machine all tests passed. |
@Imbruced The timeout always happens at python/core/test_core_spatial_relations.py. The join query particularly calls the the PythoAdapter.translateSpatialPairRDDWithListToPython It looks like, due to the changes in this PR, SpatialJoinQuery may return more results. The PythonAdapter becomes much slower on the Travis-CI VM. Eventually the test environment will crash. Do you have any idea about this? |
@jiayuasu Within the tests we are using default spark configuration (Python). Changing hashset to list increased number of instances needed to serialize from jvm to Python. Also within this test I am using huge files. Probably we cross some threshold of memory usage. I will fix that, also I almost finished no SerDe jvm->python conversion functionality (rdd join result to df or saving to file). I think what we need within this test is to increase number of partitions. |
Also I will make how hashset -> list impacted on memory consumption while SerDE. |
My build getting stuck (it is queued more than 2 hours). Only one thing which should be changed point_rdd.spatialPartitioning(GridType.KDBTREE, num_partitions=10) |
@Imbruced I didn't get it. Do you want me to change it or you will change it? |
@jiayuasu I was sure that my build get stuck forever, but somehow it manage to ran. Build is no longer failing with this change. I created PR to your fork. |
…8-repartition Fix the out of memory test failure.
Is this PR related to a proposed Issue?
https://issues.apache.org/jira/browse/SEDONA-1
This PR is relevant to two JTS PRs made by me:
Add the check of userData in equals(object o): Add the check of userData in equals(object o) locationtech/jts#633
Change the access modifiers of tree indexes and add setter/getters: Support read and initialize internal structure of STRtree and Quadtree locationtech/jts#634
The acceptance of this PR will happen when both JTS PRs are accepted by JTS committers.
What changes were proposed in this PR?
Minimize the dependency on JTSplus. Use a GeomUtils wrapper to process some additional string functions. Remove some unnecessary null checks in the test.
How was this patch tested?
All Scala and Java tests have passed. Python tests failed due to the lack of null check on UserData
Did this PR include necessary documentation updates?