Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SEDONA-468] Add useSpheroid provision in ST_DWithin #1208

Merged
merged 32 commits into from
Jan 31, 2024

Conversation

iGN5117
Copy link
Contributor

@iGN5117 iGN5117 commented Jan 19, 2024

Did you read the Contributor Guide?

Is this PR related to a JIRA ticket?

What changes were proposed in this PR?

  • Add provision to useSpheroid parameter in ST_DWithin, which uses ST_DistanceSpheroid
  • Add optimized spatial and broadcast join support for the new variant

How was this patch tested?

  • Added new tests

Did this PR include necessary documentation updates?

 into develop_Nilesh_1.5.1

# Conflicts:
#	python/tests/sql/test_dataframe_api.py
#	spark/common/src/main/scala/org/apache/sedona/sql/UDF/Catalog.scala
#	spark/common/src/test/scala/org/apache/sedona/sql/dataFrameAPITestScala.scala
# Conflicts:
#	spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/strategy/join/JoinQueryDetector.scala
#	spark/common/src/test/scala/org/apache/sedona/sql/BroadcastIndexJoinSuite.scala
#	spark/common/src/test/scala/org/apache/sedona/sql/TestBaseScala.scala
#	spark/common/src/test/scala/org/apache/sedona/sql/predicateJoinTestScala.scala
@@ -132,6 +132,12 @@ class JoinQueryDetector(sparkSession: SparkSession) extends Strategy {
Some(JoinQueryDetection(left, right, leftShape, ST_Buffer(Seq(rightShape, distance)), SpatialPredicate.INTERSECTS, isGeography = false, condition, Some(extraCondition)))
case Some(And(extraCondition, ST_DWithin(Seq(leftShape, rightShape, distance)))) =>
Some(JoinQueryDetection(left, right, leftShape, ST_Buffer(Seq(rightShape, distance)), SpatialPredicate.INTERSECTS, isGeography = false, condition, Some(extraCondition)))
case Some(ST_DWithin(Seq(leftShape, rightShape, distance, useSpheroid))) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When useSpheroid is true , this should not use ST_Buffer because buffering a lon/lat geometry with meter distance does not make any sense and will lead to wrong result. When useSpheroid = true, we should make it equivalent to ST_DistanceSpheroid() < XXX.

I think we should totally get rid of ST_Buffer to avoid confusion. All ST_DWithin patterns, once get matched, should be translated to distance join, like what we did for ST_Distance and ST_DistanceSpheroid. Buffering a geometry with a spheroid distance is already taken care by the optimized distance join triggered ST_DistanceSpheroid() < XXX, which is quite complex.

	case Some(LessThanOrEqual(ST_Distance(Seq(leftShape, rightShape)), distance)) =>
	          Some(JoinQueryDetection(left, right, leftShape, rightShape, SpatialPredicate.INTERSECTS, false, condition, Some(distance)))
        // ST_DistanceSpheroid
        case Some(LessThanOrEqual(ST_DistanceSpheroid(Seq(leftShape, rightShape)), distance)) =>
          Some(JoinQueryDetection(left, right, leftShape, rightShape, SpatialPredicate.INTERSECTS, true, condition, Some(distance)))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, done

@@ -60,9 +60,11 @@ true

## ST_DWithin

Introduction: Returns true if 'leftGeometry' and 'rightGeometry' are within a specified 'distance'. This function essentially checks if the shortest distance between the envelope of the two geometries is <= the provided distance.
Introduction: Returns true if 'leftGeometry' and 'rightGeometry' are within a specified 'distance' (in metres). This function essentially checks if the shortest distance between the envelope of the two geometries is <= the provided distance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change every Sphere to Spheroid in the doc and other code. In the geodesic distance world, Sphere distance means great circle distance while spheroid distance means geodesic distance. These are 2 different things.

In our case, we use spheroid not sphere.

In addition, it is worth noting that when we use spheroid distance, there is a current limitation in Sedona: we calculate the distance between the centroids of two geometries, not the shortest distance between two geometries. This tiny error usually does not matter much to end users because spheroid distances are mainly used for getting the distance between two objects across the globe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The explanation should be

Introduction: Returns true if 'leftGeometry' and 'rightGeometry' are within a specified 'distance'.

If useSpheroid is passed true, ST_DWithin uses Sedona's ST_DistanceSpheroid to check the spheroid distance between the centroids of two geometries. The unit of the distance in this case is meter.

If useSpheroid is passed false, ST_DWithin uses Euclidean distance and the unit of the distance is the same as the CRS of the geometries. To obtain the correct result, please consider using ST_Transform to put data in an appropriate CRS.

If useSpheroid is not given, it defaults to false

@@ -60,9 +60,11 @@ true

## ST_DWithin

Introduction: Returns true if 'leftGeometry' and 'rightGeometry' are within a specified 'distance'. This function essentially checks if the shortest distance between the envelope of the two geometries is <= the provided distance.
Introduction: Returns true if 'leftGeometry' and 'rightGeometry' are within a specified 'distance' (in metres). This function essentially checks if the shortest distance between the envelope of the two geometries is <= the provided distance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The explanation should be

Introduction: Returns true if 'leftGeometry' and 'rightGeometry' are within a specified 'distance'.

If useSpheroid is passed true, ST_DWithin uses Sedona's ST_DistanceSpheroid to check the spheroid distance between the centroids of two geometries. The unit of the distance in this case is meter.

If useSpheroid is passed false, ST_DWithin uses Euclidean distance and the unit of the distance is the same as the CRS of the geometries. To obtain the correct result, please consider using ST_Transform to put data in an appropriate CRS.

If useSpheroid is not given, it defaults to false


Format: `ST_DWithin (leftGeometry: Geometry, rightGeometry: Geometry, distance: Double)`
If `useSpheroid` is passed true, ST_DWithin uses Sedona's ST_DistanceSpheroid. If `useSpheroid` is passed false or not passed at all, DWithin uses Euclidean distance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the same explanation above.

@@ -95,6 +99,28 @@ trait TestBaseScala extends FunSpec with BeforeAndAfterAll {
sparkSession.read.format("binaryFile").load(path)
}


def generateTestData(): Seq[(Int, Double, Geometry)] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manually generating test data is good. But please do not use randomized data for test because this will create uncertain behaviors in the test and it is really hard to debug. In addition, please make sure it takes a parameter that allows to generate any size of test data as desired.

In addition, when create test data for ST_DWithin join, please make sure you know what the exact number of rows should appear in the result. And you should always use test cases that produce non-zero results.

@jiayuasu
Copy link
Member

@Kontinuation Please take a look when you have time. Thanks!

@jiayuasu
Copy link
Member

@Kontinuation Would you please review again when you have time? Thanks!

@jiayuasu jiayuasu merged commit a24a9ff into apache:master Jan 31, 2024
47 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants