SOLR-16292 : NVector for alternative great-circle distance calculations #940

danrosher · 2022-07-13T15:25:09Z

https://issues.apache.org/jira/browse/SOLR-16292

Description

N-Vector is a three-parameter representation that can be used to calculate the great-circle distance (assuming a spherical Earth).

It uses FastInvTrig, which for small distances it compares well with Math.acos. NVector is a faster way of calculating the great-circle distance than Haversine.

Solution

Store N-Vectors in solr index via CoordinateFieldType with 3 values for the nvector into single value double subfields, use java Math class for indexing these
Use an maclaurin approximation for acos for calculating great-circle distance at query-time via a function query

Tests

Tests for the FastInvTrigTest impl to compare it's acos with Math.acos.
Tests for NVectorUtil, NVectorDist , latLongToNVector and NVectorToLatLong
Tests for indexing N-Vectors and calculating the great-circle distance via the function query.

Checklist

Please review the following and check all that apply:

I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
I have created a Jira issue and added the issue ID to my pull request title.
I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
I have developed this patch against the main branch.
I have run ./gradlew check.
I have added tests for my changes.
I have added documentation for the Reference Guide

madrob

This looks interesting, and I don't quite understand the math behind all of it, but I left some comments about the code itself.

madrob · 2022-07-13T18:27:48Z

solr/core/src/java/org/apache/solr/schema/NVector.java

+                }
+                out[i] = externalVal.substring(start, end);
+                start = idx + 1;
+                end = externalVal.indexOf(',', start);


Numbers will always be stored as decimals with . and separator as ,? Need to make sure this doesn't mess up in other locales that use comma as the decimal separator.

I've added a separator init param and parse doubles using NumberFormat with default locale

solr/core/src/test/org/apache/solr/util/NVectorUtilTest.java

solr/core/src/java/org/apache/solr/search/function/distance/NVector.java

madrob · 2022-07-13T18:39:03Z

solr/core/src/java/org/apache/solr/util/FastInvTrig.java

+
+package org.apache.solr.util;
+
+public class FastInvTrig {


Did you write this code? Is it borrowed from a library? Is it an implementation of something found in a text book?

Yes I wrote this, original in my github repo. It's a Maclaurin series expansion of acos. TABLE stores the coefficients that can be re-used for subsequent calculations, it also reuses x^2 for x^3,x^5 etc. Also initially my implementation required a lot of terms for convergence, until I found this https://stackoverflow.com/questions/20196000/own-asin-function-with-taylor-series-not-accurate which allows for faster convergence near -1,1. I have a benchmark in my repo to show it more performant than Math.acos or FastMath.acos, accuracy appears OK in my tests. NVector with acos faster than Haversine.

solr/core/src/java/org/apache/solr/util/NVectorUtil.java

solr/core/src/test-files/solr/collection1/conf/schema-nvector.xml

solr/core/src/test/org/apache/solr/util/FastInvTrigTest.java

madrob · 2022-07-14T14:56:43Z

please also run ./gradlew tidy to make sure that your code adhered to our formatting conventions. :)

…Field and NVectorFunction to avoid confusion, use better asserts for tests

madrob · 2022-07-21T00:30:33Z

I found your source FastInvTrig repo and ran the benchmarks from there, with a few tweaks to improve measurement accuracy and also added Lucene's SloppyMath into the competition.

My results are directionally the same as yours -

Benchmark                                    Mode  Cnt    Score    Error  Units
FastInvTrigBenchmark.acosBM                  avgt    3   17.576 ±  1.243  ns/op
FastInvTrigBenchmark.fastMathAcosBM          avgt    3   87.301 ±  4.303  ns/op
FastInvTrigBenchmark.haversineBM             avgt    3   95.062 ± 17.137  ns/op
FastInvTrigBenchmark.mathAcosBM              avgt    3  140.391 ±  5.508  ns/op
FastInvTrigBenchmark.sloppyHaversineMeters   avgt    3   46.647 ±  8.927  ns/op
FastInvTrigBenchmark.sloppyHaversineSortKey  avgt    3   33.637 ±  6.364  ns/op

One concern for comparison that I have would be that the current Lucene implementation (which is only 2x slower than yours) has an upper error bound of 40cm. I think we can get there with your series expansions by adding additional terms, but I don't remember my calculus well enough to know how many we need. If we do that, how much does performance suffer? Are we still competitive? You're currently at 10m, which I suspect might be too large of a delta to be useful for the applications that I am familiar with.

danrosher · 2022-07-21T14:00:25Z

Thanks for looking into this more @madrob .

I wasn't aware of SloppyMath. If we assume Math.acos correct, then to get the same accuracy with FastInvTrig to within 40cm, I've tested it takes 17 terms, which after running a benchmark means that:

Benchmark                                   (num_points)  Mode  Cnt    Score   Error  Units
FastInvTrigBenchmark.acosBM10                    2000000  avgt    2   44.149          ms/op
FastInvTrigBenchmark.acosBM17                    2000000  avgt    2   67.977          ms/op
FastInvTrigBenchmark.fastMathAcosBM              2000000  avgt    2  252.578          ms/op
FastInvTrigBenchmark.haversineBM                 2000000  avgt    2  288.209          ms/op
FastInvTrigBenchmark.mathAcosBM                  2000000  avgt    2  300.509          ms/op
FastInvTrigBenchmark.mathAsin                    2000000  avgt    2  320.407          ms/op
FastInvTrigBenchmark.sloppyAsin                  2000000  avgt    2   50.152          ms/op
FastInvTrigBenchmark.sloppyHaversinMeters        2000000  avgt    2  134.309          ms/op
FastInvTrigBenchmark.sloppyHaversinSortKey       2000000  avgt    2   82.845          ms/op

So then:

SloppyMath.HaversinSortKey is 1.9x slower than FastInvTrig with 10 terms (as you found)
SloppyMath.HaversinSortKey is 1.2x slower than FastInvTrig with 17 terms

However I also noticed that SloppyMath has an asin implementation, and with pi/2-asin(x) = acos(x)

https://www.wolframalpha.com/input?i2d=true&i=%5C%2840%29Divide%5Bpi%2C2%5D-asin%5C%2840%29x%5C%2841%29%5C%2841%29-acos%5C%2840%29x%5C%2841%29

after adding this into the benchmark, using nvector and this identity ^ for acos then:

FastInvTrig with 17 terms is 1.3x slower than SloppyMath.asin !

The caveat is that SloppyMath.asin uses more memory for caching values I think.

So I'm wondering now whether to abandon the FastInvTrig series expansion, and use SloppyMath.asin for NVector? What do you think ?

madrob · 2022-07-21T17:11:39Z

I'm a little confused as to what the sloppyAsin results are...

From your tests, we should still prefer NVector with MacLaurian expansion at 17 terms over using Sloppy Haversine, right? Is Sloppy Asin using NVector with no series expansion for the trig and instead the lookup tables from Sloppy Math? Which gives us the 40cm accuracy at almost the original 10 term expansion performance level?

madrob · 2022-07-21T17:38:57Z

Can we do a similar trick to split the calculation to get an n-vector sort key and an n-vector meters and get even more speedup for the cases where we don't care about absolute distances?

danrosher · 2022-07-22T13:13:05Z

I'm a little confused as to what the sloppyAsin results are...

From your tests, we should still prefer NVector with MacLaurian expansion at 17 terms over using Sloppy Haversine, right? Is Sloppy Asin using NVector with no series expansion for the trig and instead the lookup tables from Sloppy Math? Which gives us the 40cm accuracy at almost the original 10 term expansion performance level?

We calculate the Great circle distance as d=R*acos(a.b) where d = distance, R = radius, and a,b are NVectors (a.b is the scalar dot product)

We also know acos(x) = pi/2-asin(x)

So I compared FastInvTrig.acos (with 17 terms) with SloppyMath.asin, and found that FastInvTrig.acos with 17 terms is 1.3x slower than SloppyMath.asin.

This is what I meant with FastInvTrigBenchmark.sloppyAsin

so SloppyMath.asin, at the required precision, is faster than FastInvTrig.acos,

So I was thinking of abandoning FastInvTrig.acos in favour of SloppyMath.asin, what do you think?

Can we do a similar trick to split the calculation to get an n-vector sort key and an n-vector meters and get even more speedup for the cases where we don't care about absolute distances?

Yes! From looking at the acos plot ( https://www.wolframalpha.com/input?i2d=true&i=acos%5C%2840%29x%5C%2841%29 ) it's a 1 to 1 function, so well suited for comparison. I did a quick test which confirmed that the dot product is enough for comparison between values (which is all we need to do, as we cache the nvectors in the NVectorField). So we then have the following benchmark (with NVectorSortKey as this comparison and the fastest) :

Benchmark                                   (num_points)  Mode  Cnt    Score   Error  Units
FastInvTrigBenchmark.NVectorSortKey              2000000  avgt    2   13.846          ms/op
FastInvTrigBenchmark.acosBM10                    2000000  avgt    2   45.620          ms/op
FastInvTrigBenchmark.acosBM17                    2000000  avgt    2   68.673          ms/op
FastInvTrigBenchmark.fastMathAcosBM              2000000  avgt    2  249.075          ms/op
FastInvTrigBenchmark.haversineBM                 2000000  avgt    2  289.814          ms/op
FastInvTrigBenchmark.mathAcosBM                  2000000  avgt    2  304.411          ms/op
FastInvTrigBenchmark.mathAsin                    2000000  avgt    2  323.126          ms/op
FastInvTrigBenchmark.sloppyAsin                  2000000  avgt    2   49.749          ms/op
FastInvTrigBenchmark.sloppyHaversinMeters        2000000  avgt    2  135.847          ms/op
FastInvTrigBenchmark.sloppyHaversinSortKey       2000000  avgt    2   82.149          ms/op

So in Solr perhaps we can use NVectorSortKey for sort comparisons then.

madrob · 2022-07-22T14:44:57Z

Ok, I get it now. Yes, let's do the SloppyMath.asin approach, splitting out the sort key (I would call it dot product and add comments about why we can use it as sort key instead of calling the method SortKey)

…g.acos, faster at 40cm resolution

danrosher · 2022-07-28T09:39:48Z

SloppyMath.asin now replaces FastInvTrig.acos
Sorting in the function query is on the dot product, values still fetch distance

madrob

I think the core of the idea is good here. Big thing missing is an update to the ref guide, probably function-queries.adoc or spatial-search.adoc?

Another thought I had was how exactly do we expect users to use this. If they're still going to be providing indexable data in lat/long and also expecting lat/long for output information, then will this really be faster than using haversine? Or does it move the computation to the indexing side when we only have to do it once, so over multiple queries the total time taken gets reduced...

solr/core/src/java/org/apache/solr/schema/NVectorField.java

madrob · 2022-07-22T14:54:40Z

solr/core/src/java/org/apache/solr/schema/NVectorField.java

+import java.util.Locale;
+import java.util.Map;
+
+public class NVectorField extends CoordinateFieldType {


should this extend from PointType instead? is there more that we get from it?

NVectorField is different I think to PointType. NVector replaces lat/lon and overrides most of the important methods in PointType anyway. NVectorField would need a specialized getSpecializedRangeQuery too (not sure how to implement this for nvector yet)

Ok, fair enough about not extending PointType. As for range queries... does NVector efficiently support shape intersections? I'm under the impression that it doesn't - that's part of the way that we get such a speed up on distance...

solr/core/src/java/org/apache/solr/util/NVectorUtil.java

madrob · 2022-07-29T21:03:42Z

solr/core/src/java/org/apache/solr/schema/NVectorField.java

+     * @return An array of the values that make up the point (aka vector)
+     * @throws SolrException if the dimension specified does not match the number found
+     */
+    public static String[] parseCommaSeparatedList(String externalVal, int dimension, String separator)


Can we have some unit tests for this? Parsing is one of those thing that seems easy but there are always edge cases and it's easy to introduce regressions.

use Math for radian calc

danrosher · 2022-08-08T11:16:33Z

I think the core of the idea is good here. Big thing missing is an update to the ref guide, probably function-queries.adoc or spatial-search.adoc?

When i get a moment I'll ad something to spatial-search.adoc perhaps?

Another thought I had was how exactly do we expect users to use this. If they're still going to be providing indexable data in lat/long and also expecting lat/long for output information, then will this really be faster than using haversine? Or does it move the computation to the indexing side when we only have to do it once, so over multiple queries the total time taken gets reduced...

This moves most of the calculation to the indexing side. We then only need to calculate an n-vector for the input lat/lon. The sorting can then be done on the dot-product alone. Users can optionally index spatial data in multiple formats (e.g. LanLonSpatialField and NVectorField) should they find that a performance boost. N-Vector provides faster comparison (and great circle distance calculation) than haversine, additionally without caveats, or accuracy degradation, for calculations at poles/equator etc.

madrob

Few more comments, mainly about documentation of the feature.

madrob · 2022-08-08T15:59:22Z

solr/core/src/java/org/apache/solr/schema/NVectorField.java

+import java.util.Locale;
+import java.util.Map;
+
+public class NVectorField extends CoordinateFieldType {


Ok, fair enough about not extending PointType. As for range queries... does NVector efficiently support shape intersections? I'm under the impression that it doesn't - that's part of the way that we get such a speed up on distance...

madrob · 2022-08-08T16:00:51Z

solr/core/src/java/org/apache/solr/util/NVectorUtil.java

+    };
+  }
+
+  public static double[] NVectorToLatLong(double[] n) {


nit: method names should start with lowercase (here and others)

madrob · 2022-08-08T16:01:04Z

solr/core/src/java/org/apache/solr/util/NVectorUtil.java

+
+import static org.locationtech.spatial4j.distance.DistanceUtils.EARTH_MEAN_RADIUS_KM;
+
+public class NVectorUtil {


can we add javadoc for the class and methods?

madrob · 2022-08-08T16:06:56Z

solr/core/src/java/org/apache/solr/util/NVectorUtil.java

+
+  public static double[] NVectorToLatLong(double[] n) {
+    return new double[] {
+      Math.asin(n[2]) * (180 / Math.PI), Math.atan(n[1] / n[0]) * (180 / Math.PI)


Math.toDegrees

madrob · 2022-08-08T16:09:24Z

solr/core/src/test/org/apache/solr/search/function/distance/NVectorDistTest.java

+        req(
+            "defType", "lucene",
+            "q", "*:*",
+            "nvd", "nvdist(52.01966071979866, -0.4983083573742952,nvector)",


I missed where 'nvdist' is defined as a function. Do we need to add to ValueSourceParser static init block?

added to ValueSourceParser

- lc method names - use Math static method where appropriate

- lc method names

github-actions · 2024-02-19T00:00:45Z

This PR had no visible activity in the past 60 days, labeling it as stale. Any new activity will remove the stale label. To attract more reviewers, please tag someone or notify the dev@solr.apache.org mailing list. Thank you for your contribution!

github-actions · 2024-10-09T00:00:36Z

This PR is now closed due to 60 days of inactivity after being marked as stale. Re-opening this PR is still possible, in which case it will be marked as active again.

Dan Rosher added 4 commits July 13, 2022 16:15

SOLR-16292 : NVector for alternative great-circle distance calculations

b9ae73a

SOLR-16292 : NVector for alternative great-circle distance calculations

f2f2625

SOLR-16292 : NVector for alternative great-circle distance calculations

493af2b

SOLR-16292 : NVector for alternative great-circle distance calculations

fa3a33d

madrob reviewed Jul 13, 2022

View reviewed changes

Dan Rosher added 7 commits July 15, 2022 16:44

use test random(),fix EARTH_MEAN_RADIUS_KM, rename NVector -> Nvector…

f7ec79e

…Field and NVectorFunction to avoid confusion, use better asserts for tests

use better asserts for tests

cf4b4ff

reduce schema size, fix assert values for adjusted earth radius

2e4069e

gradle tidy

4a24223

add separator and use default locale to parse doubles

acd5d10

tidy up

4ace4cc

remove unused import

90a9173

Dan Rosher added 4 commits July 28, 2022 10:01

remove FastInvTrig as SloppyMath.asin faster at 40cm resolution

7168e56

use custom sorting on dotproduct as enough for comparison

50c486f

Add dotproduct calculation, use SloppyMath.asin instead of FastInvTri…

0389ca4

…g.acos, faster at 40cm resolution

more testing for sorting

7b325ce

madrob reviewed Jul 29, 2022

View reviewed changes

psf for DEFAULT_SEPARATOR.

0119d31

use Math for radian calc

madrob reviewed Aug 8, 2022

View reviewed changes

Dan Rosher added 2 commits August 9, 2022 14:44

lc method names

6e745ec

- javadoc

d9b7fe9

- lc method names - use Math static method where appropriate

Dan Rosher added 2 commits August 9, 2022 14:45

- alow tolerance

83aae86

- lc method names

- add nvdist parser

f9952bb

github-actions bot added the stale PR not updated in 60 days label Feb 19, 2024

github-actions bot added the closed-stale Closed after being stale for 60 days label Oct 9, 2024

github-actions bot closed this Oct 9, 2024


		import static org.locationtech.spatial4j.distance.DistanceUtils.EARTH_MEAN_RADIUS_KM;

		public class NVectorUtil {

Uh oh!

SOLR-16292 : NVector for alternative great-circle distance calculations #940

SOLR-16292 : NVector for alternative great-circle distance calculations #940

Uh oh!

Conversation

danrosher commented Jul 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Solution

Tests

Checklist

Uh oh!

madrob left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

madrob commented Jul 14, 2022

Uh oh!

madrob commented Jul 21, 2022

Uh oh!

danrosher commented Jul 21, 2022

Uh oh!

madrob commented Jul 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

madrob commented Jul 21, 2022

Uh oh!

danrosher commented Jul 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

madrob commented Jul 22, 2022

Uh oh!

danrosher commented Jul 28, 2022

Uh oh!

madrob left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danrosher commented Aug 8, 2022

Uh oh!

madrob left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danrosher commented Jul 13, 2022 •

edited

Loading

madrob commented Jul 21, 2022 •

edited

Loading

danrosher commented Jul 22, 2022 •

edited

Loading