Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect clustering of records #3973

Closed
gbif-portal opened this issue Mar 10, 2022 · 17 comments
Closed

Incorrect clustering of records #3973

gbif-portal opened this issue Mar 10, 2022 · 17 comments

Comments

@gbif-portal
Copy link
Collaborator

Cluster, an experimental feature

We encountered an interesting new clustering feature in one of the Dutch databases. Two occurrences are clustered, but probably because one occurrence from Germany lacks data on date & location. See https://www.gbif.org/occurrence/3128544409/cluster
Why are these specimens clustered, and where can I found more information about this feature?


User: See in registry
System: Chrome 99.0.4844 / Windows 10.0.0
Referer: https://www.gbif.org/occurrence/3128544409/cluster
Window size: width 1536 - height 722
API log
Site log
System health at time of feedback: OPERATIONAL

@ManonGros
Copy link

In this case, it is because the catalogue number of the SMF record (746) overlaps with the "other catalogue numbers" of the NMR record and the lack of country, date and coordinates in the SMF record makes it non-conflicting.

@timrobertson100 is there a way to specify that two records shouldn't be clustered?
Should the algorithm take into account the collection and institution codes/identifiers when provided? Or would it be an issue to detect herbarium duplicates across institutions?

@timrobertson100
Copy link
Member

Thanks @ManonGros

It's tricky with sparsely populated records like this.

There's no current way to identify a specific record pair to ignore, but we could tighten the rules a bit - e.g. making it such that you need a match for either a date or location and not non_conflicting in both. That would mean to remove this rule.

What do you think?

@ManonGros
Copy link

ManonGros commented Mar 10, 2022

thanks @timrobertson100, yes it might be a good idea to removed the non-conflicting date and location rule. Now that I see this example, I think this rule doesn't enough give evidence to infer the cluster.
Let me know if/when you remove it (I will update the blogpost accordingly).

@timrobertson100
Copy link
Member

This has now been run in production and the records no longer cluster. This shows on the detail pages for the occurrences now, but it will take a little time before they disappear from the "is in cluster" search option. I've initiated a recrawl/reprocess for the two datasets involved, and that needs to complete for the search option to disappear.

Thanks for offering to update the blogpost @ManonGros - can you please do that and then close this issue?
I'm not sure if we'd want to state within the text that it's changed today, but if so, we could refer to this release note?

@timrobertson100 timrobertson100 changed the title Cluster, an experimental feature Incorrect clustering of records Mar 11, 2022
@ManonGros
Copy link

Thanks Tim! I updated the blogpost (I originally put the link to this issue but I just changed it to point to the release note)

@abubelinha
Copy link

abubelinha commented Sep 1, 2022

Hi. I wonder if you prefer us to re-use this issue title for other incorrect clustering of records we found, or you prefer separated posts for this. I am posting them here for now (please mention me if you open a new issue, so I can track it).

  • Plant occurrence 180425491 is matched to an animal (api link).
    For some reason, cluster info says "same accepted species".

  • Occurrence 2821298442, contains "isInCluster":true (in the api link)
    But for some reason, I see nothing in the related occurrences web and api pages.

If those are expected behaviours I would like to understand the reasons.

Thanks a lot
@abubelinha

@ManonGros
Copy link

@timrobertson100 could you take a look?
I cannot explain why the occurrence 180425491 is in this cluster.

For the other issue mentioned above, perhaps it is time to run an update?

@ManonGros ManonGros reopened this Sep 2, 2022
@timrobertson100
Copy link
Member

Mmmm, this is an unusual bug

The API call for that page is this:
https://api.gbif.org/v1/occurrence/180425491/experimental/related

On there you can see the current record has a "gbifId": 1804254910. For some reason this is showing the cluster of 1804254910 and not 180425491 (extra 0 at the end).

@timrobertson100
Copy link
Member

timrobertson100 commented Sep 2, 2022

This is all fixed in code now.

Plant occurrence 180425491 is matched to an animal (api link).

That record no longer shows a cluster (it showed the cluster for 1804254910 not 180425491)

Occurrence 2821298442, contains "isInCluster":true (in the api link)

This is fixed in code, and being released and deployed in production data pipelines now. After which we'll reprocess the dataset to clear the mistake in the search index. It was the same bug as the one above but applied to how we build the search index.

@abubelinha - thank you for raising this. Due to the way we hash records it would only appear occasionally, so went unnoticed before.

I'll close this knowing it's addressed in code and that the data will shortly be updated.

@abubelinha
Copy link

Thank you @ManonGros and @timrobertson100 !

Occurrence 2821298442, contains "isInCluster":true (in the api link)

This is fixed in code, and being released and deployed in production data pipelines now. After which we'll reprocess the dataset to clear the mistake in the search index. It was the same bug as the one above but applied to how we build the search index.

I understand the bug above was the "other catalogue numbers" overlap + "lack of data in certain other fields". Correct?
And as you have fixed the clustering algorithm, that's the reason for occurrence 2821298442 not longer being in that cluster. Correct?

So I couldn't actually see the original cluster ... but I searched for the taxon+location+date combination and I am pretty sure this was it: Occurrence 29606717 cluster (which still contains 8 occurrences).

I can confirm it is a good cluster (I know these all are copies of the same herbarium specimen, shared in exchange to several institutions).
But the cluster was better before: I mean it was also correct having occurrence 2821298442 into that cluster, as it was before the fix.
So, I would like to dive a little bit more into this:

  1. What makes 2821298442 now different to the other occurrences, for not being kept in that cluster anymore? I mean, what dwc info is missing which data provider should add to make it matching again.
    I don't think that dwc:otherCatalogNumbers had played any role here, because that field was not provided in 2821298442 at any time.

  2. After your fix ... are dwc:otherCatalogNumbers still somehow useful to match records from different datasets?
    It would be great if you provide guidelines on how to do that, so it takes priority over other dwc fields.

I am particularly interested in this situation:
Figure out a clustered occurrence is revised, and its dwc:scientificName no longer matches with the other occurrences in that original cluster.
Is there any way data curators may use dwc:otherCatalogNumbers to keep the occurrence in that cluster? (i.e., pointing it to one or more of the other occurrences in the cluster).
This would help a lot to propagate taxonomic revisions between datasets (as long as their curators take care of tracking gbif clusters).

Thanks a lot for this useful feature anyway!

@timrobertson100
Copy link
Member

timrobertson100 commented Sep 5, 2022

This thread is getting a little difficult to read, but I will do my best to answer.

The bug spotted didn't need a change in the clustering algorithm (how we detect related records) or e.g. the use of otherCatalogNumbers as I think you note. It was simply how we used the output of the clustering where we were misreading the IDs and 1) incorrectly showing a different cluster on the record in a few cases, and 2) incorrectly setting the "isInCluster" flag (same bug) in the search index. That is now fixed, but we haven't changed any of the actual clusters.

The SANT:SANT:44553-A record was never in this cluster which I can confirm by looking at the backend database that holds the links. So, let's understand why that is...

The blog post has this table that summarises the conditions that can trigger a detection:
image

Looking at the Sant record and the cluster we can see they have the same accepted species, date, but coordinates that are 875m apart and differently formatted collector names which could be improved as logged here.

In this case, if the collector name were made identical M. Campos instead of M.Campos you can see that the second last column would be satisfied (within 2km, same recorder, same species and the dates being equal).

Please remember we currently run clustering frequently, but not automatically, so any publication would require us to rerun clustering before it appeared.

I am particularly interested in this situation:
Figure out a clustered occurrence is revised, and its dwc:scientificName no longer matches with the other occurrences in that original cluster.
Is there any way data curators may use dwc:otherCatalogNumbers to keep the occurrence in that cluster? (i.e., pointing it to one or more of the other occurrences in the cluster).
This would help a lot to propagate taxonomic revisions between datasets (as long as their curators take care of tracking gbif clusters).

I think we would need to define new rules for that. Currently, unless they are related to type specimens they all rely on the same accepted species, so reidentifications are not captured. I think we'd need quite a tight ruleset including identifiers overlap, collector overlap and similar date/location, and perhaps some higher order taxon to avoid too many false positives.

Does this help with understanding what you have observed please, @abubelinha? Thanks again for the feedback

@abubelinha
Copy link

Thanks @timrobertson100 for your clear explanation.

As that observation had been in a cluster, and removed from it ... I was blindly assumming it had to be that cluster because it was the good one. My bad.

As for the coordinates difference, I am afraid all occurrences in the cluster have been rounded to 0.01 Lat/Lon degrees precision. The cluster-excluded observation hasn't been rounded, which explains the discrepance.

I agree that normalizing some characters and removing spaces could improve clusters a lot.

Regarding my suggestion of using dwc:otherCatalogNumbers based clusters, I am not quite so sure you understood me: I was suggesting a rule based on that field alone, like the first two assertions on top of the table.

But as you say this thread is getting difficult to read, I opened a new issue about this.

@timrobertson100
Copy link
Member

As for the coordinates difference, I am afraid all occurrences in the cluster have been rounded to 0.01 Lat/Lon degrees precision. The cluster-excluded observation hasn't been rounded, which explains the discrepance.

Thanks - I had suspected something like that had happened and your record actually had the more accurate georeference. I chose the limits of 200m and 2km to accommodate 3 and 4 decimal place rounding globally, but here we are at 2 decimal places.

I think we've arrived at a good point in understanding how things work and fixed the bug identified (i.e. that it wasn't working as expected). Let's continue the discussion on improving the normalisation of collector names and improvements relating to otherCatalogNumbers on those threads.

Thanks for your interest in this!

@abubelinha
Copy link

abubelinha commented Sep 25, 2022

I found this pair of occurrences which I can't figure out why not in a cluster.

I think they should be for a number of reasons (same taxon, typification relationship, same coordinates, location, collector and date).
But I can't see the "cluster" link at the top:

https://www.gbif.org/occurrence/1936346601
https://www.gbif.org/occurrence/1936158404

@timrobertson100
Copy link
Member

I found this pair of occurrences which I can't figure out why not in a cluster.

They are in the same dataset. We only look for links between records across datasets to help e.g. transfer knowledge between institutions.

It's also the case that there are many datasets that would just cluster everything (e.g. gut analysis) that brought a technical consideration with cardinalities, and our feasibility of actually calculating these in a timely manner.

@abubelinha
Copy link

abubelinha commented Sep 25, 2022

Ah OK. You mean there are datasets which contain lots of repetitive occurrences, don't you?

I should have figured this out. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants