Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[similarity] add similarity distance #1176

Open
wants to merge 8 commits into
base: develop
Choose a base branch
from

Conversation

barendgehrels
Copy link
Collaborator

@barendgehrels barendgehrels commented Jul 12, 2023

This is a new similarity distance, comparable to Hausdorff or Frechet.

It calculates the area between two lines and divides them by the line length. This is an average distance, measuring quite well how similar two lines are. It is less sensible to outliers (such as Hausdorff/Frechet) and not discrete.

For example this case (west1/west2):

image

They are quite similar. Distance impressions are:
Frechet: 53.4426 Hausdorff: 53.4426 Average: 5.61308

How it works in detail:

  • it projects all points on the other line
  • and vice versa
  • it adds their own points too. Then both collections have an equal amount of points, and because of the projections, their locations are comparable (especially if they are already similar)
  • it checks if one of the lines should be reversed
  • then it calculates areas of all four-corner rings and sums them
  • and it divides by the shortest length

(See the comment below for an enhanced description)

Zoomed in:
image

Apart from this, this PR also adds an experimental geojson_writer

If you like it, I'll also add a version for rings. For polygons/multi it is a bit more complex (especially if the number of objects are unequal), to be worked out.

The geojson_writer is really useful for visualizing results (or debug steps). But it is not completely finished either.

Complete output of the test program:

Simplex cases: 
Frechet: 0.1 Hausdorff: 0.1 Average: 0.1
Frechet: 0.509902 Hausdorff: 0.509902 Average: 0.1
Frechet: 1.58114 Hausdorff: 1.58114 Average: 0.114

Real cases: 
Frechet: 53.4426 Hausdorff: 53.4426 Average: 5.61308
Frechet: 81.0897 Hausdorff: 81.0897 Average: 3.98684

Reversed:
Frechet: 990.622 Hausdorff: 53.4426 Average: 5.61308
Frechet: 1199.04 Hausdorff: 81.0897 Average: 3.98684

Simplified (west): 
Frechet: 494.575 Hausdorff: 494.575 Average: 61.4674
Frechet: 475.107 Hausdorff: 475.107 Average: 24.5124
Frechet: 54.7065 Hausdorff: 54.7065 Average: 0.733346
Frechet: 23.9904 Hausdorff: 23.9904 Average: 0

Simplified (east): 
Frechet: 520.245 Hausdorff: 520.245 Average: 124.301
Frechet: 74.5731 Hausdorff: 74.5731 Average: 7.53191
Frechet: 74.5731 Hausdorff: 74.5731 Average: 0.511786
Frechet: 0 Hausdorff: 0 Average: 0

Not similar: 
Frechet: 8760.18 Hausdorff: 7737.14 Average: 4925.39
Frechet: 8771.21 Hausdorff: 7746.79 Average: 4848.01

Parts (TODO):
Frechet: 831.744 Hausdorff: 831.744 Average: 7.25765
Frechet: 1094.28 Hausdorff: 1094.28 Average: 199.363

Identical: 
Frechet: 0 Hausdorff: 0 Average: 0
Frechet: 0 Hausdorff: 0 Average: 0

Performance:
Frechet: 105 ms
Hausdorff: 78 ms
Average: 498 ms

Showing that:

  • hausdorff was not symmetric (I created an issue for that)
  • frechet gives different results for reversed cases
  • frechet and hausdorff are often, but not always, the same
  • the new method is much slower (we can use an index to enhance this)
  • because frechet/hausdorff are discrete, its behavior is not good when there are points in between (see simplex 2, or the simplified cases)
  • because frechet/hausdorff use a max distance, a spike large influence (simplex 3)

I also tried to give a "partial" result, which happens if one line is partly identical to the other line. This is possible, but not yet finished. It uses the projections and then recursively finds the best division point. But this is not part of the PR.

@barendgehrels barendgehrels added the experimental Experimental PR - might need work or discussion label Jul 12, 2023
@barendgehrels barendgehrels self-assigned this Jul 12, 2023
@barendgehrels
Copy link
Collaborator Author

barendgehrels commented Jul 13, 2023

A rephrased description, with a bit of AI help:

To compute the Average Similarity Distance (ASD) between two linestrings, let's consider linestring P and linestring Q. The ASD is derived by comparing the geometric characteristics of the two linestrings.

  1. Projection: Each point of linestring Q is projected onto linestring P, identifying the shortest distance for projection. This projection process creates a modified version of linestring P, denoted as P'. Similarly, each point of linestring P is projected onto linestring Q, resulting in a modified version of linestring Q called Q'.
  2. Augmentation: All points of linestring P are added to P', maintaining their original position, with an offset value of 0 and increasing segment distance. Similarly, all points of linestring Q are added to Q'.
  3. Alignment: P' and Q' are sorted based on their segment index and offset values in increasing order. This sorting ensures that both collections are comparable.
  4. Reversal: If necessary, one of the linestrings is reversed to achieve better alignment between corresponding points in P' and Q'. This step ensures consistency in the comparison.
  5. Area Calculation: The areas of all quadrilaterals, formed by connecting each point with its subsequent point in both P' and Q', are calculated. The sum of these areas is computed.
  6. Normalization: The total sum of areas is divided by the shortest length of the two linestrings P and Q to normalize the ASD value.

The resulting ASD value provides an average distance measure that captures the geometric similarity between the linestrings P and Q. By incorporating projection, augmentation, alignment, and area calculation steps, the ASD offers an enhanced understanding of the similarity between the two linestrings.

Copy link
Member

@vissarion vissarion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!! That's a new algorithm right? There is a similar work where they compute the area between two linestrings but in a slightly different way https://link.springer.com/article/10.1007/s12289-018-1421-8 (ends up in a different area).

A did a first parse and put some comments in the code.

Is there cases where other measures (Fréchet or Hausdorff) is better than the new one? How the new method perform when the inputs differ a lot between them?

auto get_projection(std::size_t index, Point const& point, Geometry const& target)
{
using coor_t = typename coordinate_type<Point>::type;
using closest_strategy = geometry::strategy::closest_points::detail::compute_closest_point_to_segment<coor_t>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should passed as a function parameter following a similar mechanism as in all other algorithms.

auto const& current = target[i];
auto const& next = target[i + 1];
projection<Point> other;
auto const pp = closest_strategy::apply(point, current, next);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very similar to how closest points are computed for a linestring. I guess you are not using that algorithm because you want to keep track also the index. Maybe the closest_points procedure can be enhanced with this extra information (e.g. providing the index) and reused here?

From performance point of view there is some duplication in the computation of the closest point here with the distance computed in two lines below. Also, you may consider using the comparable distance instead of the real distance.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at it. It seems that some code duplication is also there in the detail::closest_feature::point_to_point_range function, where the closest segment is determined based on (comparable) distance. Which, internally, already finds the projection - but it is discarded (in the strategy itself).
Then the closest point is calculated.

Here I do it the other way round, calling first closest point strategy (which already does some calculations to find it - but does not calculate the distance yet) and then calculate its distance (I could use comparable instead - changed it right now).

I'm not sure if it is worth the effort to harmonize this code for this reason only, it will make it more complex.

What could be useful is a strategy which gives all the internal information to avoid code duplication in other places. Something like that is already in extensions (calculate_distance_info) but never made it. So maybe also that is not really worth the effort...

auto const sum_area = get_areal_sum(boost::begin(p_enriched), boost::end(p_enriched),
boost::begin(q_enriched), boost::end(q_enriched), ring, visitor);

auto const length = (std::min)(geometry::length(p), geometry::length(q));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at least for the reversed linestring the length is already computed (maybe this can be reused here)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see that, unless you mean the distance calculation done for reversal.

int const side_p0_q = side_strategy::apply(q0.point, q1.point, p0.point);
int const side_p1_q = side_strategy::apply(q0.point, q1.point, p1.point);

if (side_q0_p == 0 && side_q1_p == 0 && side_p0_q == 0 && side_p1_q == 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't side_q0_p == 0 && side_q1_p == 0 be enough?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should indeed be enough, at least for the cases I tested. However, I'm not sure about FP-precision where there are sometimes weird results (such as 3 times 0 and one non-zero). And the sides have to be calculated later (used later) anyway.
But with those FP-precision errors the area is probably minimal - so it probably won't make a real difference.

I will think about what my final preference is here.

}

template <typename Projection, typename Ring>
bool fill_quadrilateral(Projection const& p0, Projection const& p1, Projection const& q0, Projection const& q1, Ring& ring)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it always a quadrileteral? In some cases could be a triangle or even degenerated to a segment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, it can. But sorting that out is probably not worth it, because the area will be the same.

@awulkiew
Copy link
Member

Thanks, that's interesting!

AFAIU the CS-specific operations are closest points, side and area so technically this distance should work in all coordinate systems correct?

Do I understand correctly that you wanted to show a prototype to start a discussion and you plan to implement it with Boost.Geometry structure (strategy, dispatch, etc.) if we agree that this is a good addition to the library?

Is ASD an official name for this measure?
Would giving the function a name more specific than similarity_distance be a good idea?
Or do you think that this should be baseline, default similarity measure in the library?

#include <algorithm>
#include <iterator>
#include <utility>
#include <limits>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recall that this order of headers is prefered in general. In our case this would be Boost.Geometry headers, then other Boost headers, then standard headers. Do we want to make a guideline about this?

I'd add one thing. The headers are not sorted alphabetically. They probably should.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that in other file the order is different so I'm curious why were they moved here?

Comment on lines 13 to 18
#include <cstdint>
#include <vector>

#include <boost/geometry/core/point_type.hpp>
#include <boost/geometry/algorithms/distance.hpp>
#include <boost/geometry/algorithms/length.hpp>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here standard headers are before Boost.Geometry headers though.

@barendgehrels
Copy link
Collaborator Author

Thank you both for your comments! Cool!
Yes, I think similarity_distance would be a good name, it also crossed my mind.
I will come back to your comments in more detail next week on Thursday, probably.

@awulkiew
Copy link
Member

To clarify, I'm considering whether or not giving it such general name is a good idea. There are various measures of similarity after all. And AFAIU they are not interchangeable. Unless it would be possible to define some kind of standard of similarity allowing to generate comparable results between various algorithms? So the question is which one if any should be considered as the default one in the library? If it's not possible to answer this question then maybe this algorithm should have a specific name like the other ones.

@barendgehrels
Copy link
Collaborator Author

AFAIU the CS-specific operations are closest points, side and area so technically this distance should work in all coordinate systems correct?

Yes, correct, I added a call for geographic coordinate system as well.

Do I understand correctly that you wanted to show a prototype to start a discussion and you plan to implement it with Boost.Geometry structure (strategy, dispatch, etc.) if we agree that this is a good addition to the library?

Yes, indeed

Is ASD an official name for this measure? Would giving the function a name more specific than similarity_distance be a good idea? Or do you think that this should be baseline, default similarity measure in the library?

Okay, I now have average_distance - which is, in fact, what it is. And I think it's a better measure for similarity than hausdorff or frechet, because that is more a "max distance" which might have a large value while all other points are close or equal, apart from the problems because it is discrete. But this might be biased and have to be researched a bit more.

@barendgehrels barendgehrels reopened this Jul 27, 2023
@awulkiew
Copy link
Member

Okay, I now have average_distance - which is, in fact, what it is.

Would it be possible to calculate the similarity of areal geometries (rings or polygons) this way, i.e. pass them into this algorithm and internally calculate the similarity of their boundaries? If it is then this name would not work for them because the average distance of two intersecting polygons would always be 0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
experimental Experimental PR - might need work or discussion
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants