Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SEDONA-231] Redundant Serde Elimination #792

Merged
merged 24 commits into from Mar 15, 2023

Conversation

douglasdennis
Copy link
Contributor

@douglasdennis douglasdennis commented Mar 8, 2023

Did you read the Contributor Guide?

Is this PR related to a JIRA ticket?

What changes were proposed in this PR?

To avoid redundant serde during Spark execution, I propose the following changes:

  • Create a SerdeAware trait that has a single abstract method called evalWithoutSerialization. This method acts the normal eval methods except that it will not attempt to serialize its results into a Spark internal type.
  • Apply this trait to the null safe expression classes as well as whichever other function classes that it will work with. Implement the evalWithoutSerialization for those classes such that they do not attempt to serialize any Geometry object they return (and returning the actual Geometry object instead).
  • When a sedona function goes to extract the values from its input, it first checks to see if the input expression is SerdeAware. If the input expression is SerdeAware then the evalWithoutSerialization method is called and an actual Geometry object is returned, skipping the serialization and deserialization that would normally happen.

To facilitate this, some changes had to be made to a few functions. Specifically, any function that could return LinearRing objects had to be changed such that they guarantee not doing that. This allows for compliance with OGC to be maintained.

I have three points of concern, which is why I marked this PR as draft:

  1. I'm not sure if my approach is the most succinct or performant. If there are better ways to eliminate the redundant serde then I am open to comment. I can either implement the suggestions myself or I can close this PR and would be happy to see another implementation come out.
  2. I don't know how to test this. At one point I had print statements in the serializer (which you can see in the issue) that helped show that I had in fact eliminated the redundant serde, but I'm not sure how to write a test to automate that. Is there a way to mock the geometry serializer? Or maybe add debug logging that runs for the test but is disabled in some kind of release mode? I'm not sure what the convention is here.
  3. By enforcing LineStrings for certain functions, I have subtly changed a public API in flink. I'm not sure if that needs to be documented and, if so, where. Spark should not be affected since the serializer already does not allow for LinearRings.

How was this patch tested?

Unit tests pass. A few in flink had to be updated since certain functions no longer return LinearRing objects.

As for ensuring the redundant serde is eliminated, please see point # 2 above.

Did this PR include necessary documentation updates?

  • No, see above.

Copy link
Contributor

@Kimahriman Kimahriman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know much about java/scala testing frameworks, is there a way to like spy on the geometry serializer to see how many times it gets called as a means to add a test?

@jiayuasu
Copy link
Member

jiayuasu commented Mar 8, 2023

@douglasdennis To check the number of serialization called in each query, you can probably consider using SparkListener. Sedona already has a test listener for RDD join (https://github.com/apache/sedona/blob/master/core/src/main/scala/org/apache/sedona/core/monitoring/Listener.scala). It will accumulate Sedona Metric (https://github.com/apache/sedona/blob/master/core/src/main/scala/org/apache/sedona/core/monitoring/Metric.scala). You can probably refactor it a bit to incorporate this test.

In the test listener, you can do some sort of assertion to validate the result.

@jiayuasu
Copy link
Member

jiayuasu commented Mar 8, 2023

@Kontinuation Hi Kristin, any comments?

@Kontinuation
Copy link
Member

Kontinuation commented Mar 9, 2023

We can mock the static methods in GeometrySerializer and count the number of invocations. I've tried out Mockito and found it quite suitable for this case.

To verify that serialization calls were actually eliminated, we can construct a simple catalyst expression and eval it instead of running a full-fledged Spark job. We mock GeometrySerializer when evaluating the expression, then verify invocation counts after eval.

Here is an example of using org.mockito:mockito-core:5.1.1 to test this serde-aware mechanism:

class SerdeAwareFunctionSpec extends AnyFunSpec {

  describe("SerdeAwareFunction") {
    it("should save us some serialization and deserialization cost") {
      // Mock GeometrySerializer
      val factory = new GeometryFactory
      val stubGeom = factory.createPoint(new Coordinate(1, 2))
      val mocked = mockStatic(classOf[GeometrySerializer])
      mocked.when(() => GeometrySerializer.deserialize(any(classOf[Array[Byte]]))).thenReturn(stubGeom)
      mocked.when(() => GeometrySerializer.serialize(any(classOf[Geometry]))).thenReturn(Array[Byte](1, 2, 3))

      val expr = ST_Union(Seq(
        ST_Buffer(Seq(ST_GeomFromText(Seq(Literal("POINT (1 2)"), Literal(0))), Literal(1.0))),
        ST_Point(Seq(Literal(1.0), Literal(2.0), Literal(null)))
      ))

      try {
        // Evaluate an expression
        expr.eval(null)

        // Verify number of invocations
        mocked.verify(
          () => GeometrySerializer.deserialize(any(classOf[Array[Byte]])),
          atMost(0))
        mocked.verify(
          () => GeometrySerializer.serialize(any(classOf[Geometry])),
          atMost(1))
      } finally {
        // Undo the mock
        mocked.close()
      }
    }
  }
}

This test will pass on this branch, and will fail on the master branch:

Wanted at most 0 times but was 3
org.mockito.exceptions.verification.MoreThanAllowedActualInvocations: 
Wanted at most 0 times but was 3
	at org.apache.sedona.common.geometrySerde.GeometrySerializer.deserialize(GeometrySerializer.java:64)
	at org.apache.sedona.sql.functions.SerdeAwareFunctionSpec.$anonfun$new$5(SerdeAwareFunctionSpec.scala:52)
	at org.apache.sedona.sql.functions.SerdeAwareFunctionSpec.$anonfun$new$2(SerdeAwareFunctionSpec.scala:53)
...

Note that we've mocked the GeometrySerializer to give constant results, so the result of eval is incorrect. It does not matter since the purpose of this test is to collect information on the code path that actually got executed. Any stub values that make it go through should be fine.

@douglasdennis
Copy link
Contributor Author

Thank you, @Kontinuation ! I tried looking at a similar path but I wasn't sure how to get mockito to work. I'll get this added to the PR.

@douglasdennis douglasdennis marked this pull request as ready for review March 15, 2023 04:13
@douglasdennis
Copy link
Contributor Author

@Kontinuation Thank you again for the test. I wasn't able to use 5.1.1 of mockito, though. It didn't seem compatible with Java 8. I went with version 4.11.0 instead.

@jiayuasu This PR adds mockito as a testing dependency. I think this is also ready to go if everyone agrees.

@jiayuasu
Copy link
Member

@douglasdennis The Mockito dependency looks good to me. If @Kontinuation has no problem with this, I think we are good to go.

@Kontinuation
Copy link
Member

@douglasdennis The Mockito dependency looks good to me. If @Kontinuation has no problem with this, I think we are good to go.

LGTM.

@jiayuasu jiayuasu added this to the sedona-1.4.0 milestone Mar 15, 2023
@jiayuasu jiayuasu merged commit 7dfa2c3 into apache:master Mar 15, 2023
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants