Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve handling of records declaring absence data #268

Closed
timrobertson100 opened this issue May 6, 2020 · 40 comments
Closed

Improve handling of records declaring absence data #268

timrobertson100 opened this issue May 6, 2020 · 40 comments
Assignees

Comments

@timrobertson100
Copy link
Member

timrobertson100 commented May 6, 2020

Some datasets provide evidence of species absences. While this can be a difficult area to accommodate properly as modeling effort and confidence are required, there is a lot we can do to improve the current situation where consumers are given the burden of interpreting the data shared. In some cases, consumers will not have even enough information to detect this and will use absence records as presence records.

I propose we introduce the following:

  1. Introduce a search filter for occurrenceStatus in the occurrence search and download API and then expose it on the web site. We should review the data to determine if the current vocabulary is reasonable for the observed use in data. Where individualCount states 0 we should set occurrenceStatus = ABSENT if it is NULL and add a flag OCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_COUNT to true. If occurrenceStatus is NULL we set it to PRESENT as a sensible default
  2. Add a flag for INDIVIDUAL_COUNT_CONFLICTS_WITH_OCCURRENCE_STATUS setting it to true when the count is zero but the status declares it exists (could be several values) or when the count is >0 and it is declared as absent.
  3. Add flags for INDIVIDUAL_COUNT_UNPARSABLE and OCCURRENCE_STATUS_UNPARSABLE setting them appropriately when data cannot be parsed.
@ahahn-gbif
Copy link

Thanks - I also appreciate the conflict flags, as we are bound to have a few false positives caused by database default values and the like.
A consideration on defaults (going back to earlier discussions): should true absence records be an opt-in for data users, i.e. filtered from view be default, and activated only on explicit request, similar to coordinates with known errors? I would expect that to be the most user-friendly option on the assumption that the majority of users would be looking for occurrences, not absences.

@MattBlissett
Copy link
Member

We have an actual vocabulary for occurrenceStatus: http://rs.gbif.org/vocabulary/gbif/occurrence_status.xml

@MattBlissett
Copy link
Member

It might be useful to have a present or hasPresence or similar filter, in the same way we have hasCoordinate, which summarizes individualCount and occurrenceStatus.

@timrobertson100
Copy link
Member Author

timrobertson100 commented May 6, 2020

We have an actual vocabulary for occurrenceStatus: http://rs.gbif.org/vocabulary/gbif/occurrence_status.xml

This is important. If we are to refine this for occurrence data only, removing those terms that are targetting checklist use and possibly adding new ones, we need to

  1. Create a new enumeration in code
  2. Create a new vocabulary XML in rs.gbif.org
  3. Modify the occurrence and event core schemas to reference the new vocabulary

@qgroom
Copy link

qgroom commented May 6, 2020

We have an actual vocabulary for occurrenceStatus:
http://rs.gbif.org/vocabulary/gbif/occurrence_status.xml

This is a horrible vocabulary for this term, because we should not be mixing up presence and absence with abundance. It would be much easier for everyone if occurrenceStatus was just present, absent, doubtful and excluded.

Secondly, absence is only resolvable when there are some spatial and temporal limits. Shouldn't there be a check on eventDate, location and/or country to warn people there is an unbounded absence. Otherwise, the record sort of means it is absent everywhere and/or for all time.

@MattBlissett
Copy link
Member

MattBlissett commented May 6, 2020

There was previous discussion about absence records in these issues:

Those are good points, Quentin. Is there a term for recording abundance? I can't see one. The vast majority of data gives present/absent, but there is some giving abundance.

These are the verbatim values we have for occurrenceStatus with frequency > 1000:

occurrenceStatus count
\N 1047300464
-- --
present 190140103
Present 88470094
Présent 69978620
absent 10433299
P 1091847
Q 774046
Ne Sait Pas 321113
confirmed breeding 284635
established 256481
Presente 223434
stocked 215235
unknown 79623
complet 75906
presence 69363
Rare 1-4 56083
Presence 49337
probable breeding 48533
incomplet 43683
NA 41735
possible breeding 31631
Common 5-19 28557
Absent 26417
Confirmed Present 24652
Confirmed Breeding 20403
Abundant 20-99 20175
doubtful 19980
Possibly Breeding 17800
Común 17170
Probably Breeding 16677
1 15406
irregular 12661
Common 12554
rare 11010
Very abundant 100-499 7913
Occasional 7602
Abundant 7198
Rare (p < 1%) 6525
collected 6492
probably breeding 6402
possibly breeding 5863
Rare 5803
Present (1% <= p < 5%) 4940
Average Cover: 1-5% Maximum Cover: 1-5% 4127
unclear breeding certaint 3453
Very very abundant > 500 2813
Non observé 2710
Песня, голос 2378
Average Cover: 1-5% Maximum Cover: 6-25% 2015
Common (5% <= p < 10%) 1960
NT 1847
Dominant (20% <= p) 1274
Observed in Breeding Season 1266
Abundant (10% <= p < 20%) 1220
Average Cover: 76-95% Maximum Cover: 96-100% 1135
Reported 1084
Ausente 1055
Damaged 1051
Визуально 1003

@timrobertson100
Copy link
Member Author

It would be much easier for everyone if occurrenceStatus was just present, absent, doubtful and excluded.

Thanks for raising this. If you look at the data you'll also find attempts to convey things like invasive, threatened etc which would be better elsewhere too.

Secondly, absence is only resolvable when there are some spatial and temporal limits. Shouldn't there be a check on eventDate, location and/or country to warn people there is an unbounded absence. Otherwise, the record sort of means it is absent everywhere and/or for all time.

The suggestion to add a flag for UNBOUNDED_ABSENCE seems sensible and pragmatic. I'm mindful that modeling absence can become more complex (e.g. quantifying likelihood of observation) which shouldn't be a restriction to improving usage of presence data.

Is there a term for recording abundance?

individualCount, organismQuantity and organismQuantityType?

@MortenHofft
Copy link
Member

It would be much easier for everyone if occurrenceStatus was just present, absent, doubtful and excluded.

@qgroom I understand present and absent, but what does doubtful and excluded mean for an individual occurrence?

@albenson-usgs
Copy link

Exciting! I hope it will be very clear to users that absence data ARE available. Sounds like it will be but just want to make sure. The P and Q are me (from a time before I was officially in charge of OBIS-USA), I'll make sure to get those corrected.

@peterdesmet
Copy link
Member

peterdesmet commented May 6, 2020

Completely agree with what @timrobertson100 (how to parse it + flags) and @ahahn-gbif (exclude absences from views by default) suggest. Some notes:

  1. Some datasets provide organismQuantity and not individualCount. Will this be rolled into individualCount before assessment of individualCount = 0?
  2. Some datasets provide occurrenceStatus = absent (and variations), but not individualCount = 0. Will occurrenceStatus = ABSENT be set for those? Is a flag needed?
  3. To allow differentiation of INDIVIDUAL_COUNT_CONFLICTS_WITH_OCCURRENCE_STATUS and OCCURRENCE_STATUS_UNPARSABLE you will probably need to process the most occurring occurrenceStatus values that exist in the wild to either ABSENT, PRESENT (or not able to parse)?

@ahahn-gbif
Copy link

ahahn-gbif commented May 6, 2020

+1 for 1. - we should indeed look at organismQuantity as well

  1. Some datasets provide occurrenceStatus = absent (and variations), but not individualCount = 0. Will occurrenceStatus = ABSENT be set for those? Is a flag needed?
  • If the individualCount is not 0, but NULL, occurrenceStatus = ABSENT is plausible
    We also have the opposite case, where
  • individualCount = 0, but occurrenceStatus = PRESENT. In these cases, I would value occurrenceStatus over individualCount, e.g. assuming a database or import default value, and maintain occurrenceStatus = PRESENT, suggesting individualCount likely = NULL
  • If the individualCount is not 0, but an actual (positive) value, and the occurrenceStatus = ABSENT, the flag would indeed make good sense - we will want to resolve that with publishers

@albenson-usgs
Copy link

  • individualCount = 0, but occurrenceStatus = PRESENT. In these cases, I would value occurrenceStatus over individualCount, e.g. assuming a database or import default value, and maintain occurrenceStatus = PRESENT, suggesting individualCount likely = NULL

For the datasets I work with, this would not be a good assumption to make. Usually the individualCount is included first and the occurrenceStatus is created based on the individualCount or organismQuantity so if individualCount = 0 but occurrenceStatus = PRESENT it means something went wrong in the code to create occurrenceStatus.

@ahahn-gbif
Copy link

ahahn-gbif commented May 6, 2020

Usually the individualCount is included first and the occurrenceStatus is created based on the individualCount or organismQuantity

Thanks, good point, I hadn't considered that. In that case, I agree it should get the same conflict flag as @peterdesmet suggested under point 3

@peterdesmet
Copy link
Member

peterdesmet commented May 6, 2020

I agree, I would also prioritize individualCount over occurrenceStatus. Trying to summarize:

individualCount occurrenceStatus inferred occurrenceStatus flag
NULL NULL PRESENT
NULL present* PRESENT
NULL absent* ABSENT  
NULL rubbish PRESENT OCCURRENCE_STATUS_UNPARSABLE
>0 NULL PRESENT OCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_COUNT
>0 present* PRESENT
>0 absent* ABSENT INDIVIDUAL_COUNT_CONFLICTS_WITH_OCCURRENCE_STATUS
>0 rubbish PRESENT OCCURRENCE_STATUS_UNPARSABLE, OCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_COUNT
0 NULL ABSENT OCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_COUNT
0 present* PRESENT INDIVIDUAL_COUNT_CONFLICTS_WITH_OCCURRENCE_STATUS
0 absent* ABSENT
0  rubbish ABSENT OCCURRENCE_STATUS_UNPARSABLE, OCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_COUNT
rubbish  NULL PRESENT INDIVIDUAL_COUNT_UNPARSABLE
rubbish present* PRESENT INDIVIDUAL_COUNT_UNPARSABLE
rubbish absent* ABSENT INDIVIDUAL_COUNT_UNPARSABLE
rubbish rubbish PRESENT INDIVIDUAL_COUNT_UNPARSABLE, OCCURRENCE_STATUS_UNPARSABLE

*= or similar values

@albenson-usgs
Copy link

@peterdesmet I'm not understanding why this one would be flagged:

0 | absent* | ABSENT | OCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_COUNT

I would think that one shouldn't get a flag since things are all in agreement?

@peterdesmet
Copy link
Member

@albenson-usgs it's a choice 🤷‍♂️: behind the scenes I would always infer from individualCount if that is available and not rubbish, but you could opt not to indicate it as such (OCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_COUNT) if everything is in agreement.

@timrobertson100
Copy link
Member Author

I'm afraid I'd disagree.

I would propose only inferring if required, otherwise what is the point of the field? This would be similar to how we handle others e.g. decimalLatitude, decimalLongitude and country where country is only inferred if it is null or needs changed where other information is added explaining why.

Therefore I'd suggest:

individualCount occurrenceStatus interpretted occurrenceStatus flag
0 absent* ABSENT

@peterdesmet
Copy link
Member

That's fine by me (have adapted in table above). But we still choose individualCount over occurrenceStatus if those disagree?

@timrobertson100
Copy link
Member Author

But we still choose individualCount over occurrenceStatus if those disagree?

Do you mean choose how to populate the interpreted occurrenceStatus field? I would suggest:

  • interpret the value supplied
  • infer ABSENCE if it is NULL but other fields (ìndividualCount or organismQuantity) imply that, or assume PRESENT (adding a flag to state as much)
  • flag the record if there is conflicting information (e.g. individualCount=666 occurrenceStatus=ABSENT)

@MortenHofft
Copy link
Member

MortenHofft commented May 11, 2020

But we still choose individualCount over occurrenceStatus if those disagree?

In support of Tim above (I think)

Normally we respect values that are there, but flag them as odd if they are in conflict with other values. E.g. coordinates: [in Paraguay] and country: Brazil would keep both country and coordinates but get an issue flag. If the country was missing it would be filled as Paraguay.

I would argue we do the same for individualCount, organismQuantity and occurrenceStatus. We infer occurrenceStatus if missing, but if it is provided, we do not mess with it (despite conflicts with other fields); instead we add issue flags.

@peterdesmet
Copy link
Member

@MortenHofft that makes sense, but does this mean that individualCount: 0 + occurrenceStatus: present will be interpreted as a PRESENT occurrence (and shown on maps etc.)?

@MortenHofft
Copy link
Member

MortenHofft commented May 11, 2020

but does this mean that individualCount: 0 + occurrenceStatus: present will be interpreted as a PRESENT occurrence (and shown on maps etc.)

Yes. Just like we show null island, despite it probably being faulty data. If we consider it particular critical we can add an extra warning like we do on maps.

Screenshot 2020-05-11 at 14 41 38

It isn't that I want to make it difficult for users, I just think we will be in more trouble if we start to rewrite data. In time I'd rather that

  • Publishers fix data with conflicts
  • We update/improve/fix issues with the data validator (for pre publishing reports)
  • We add default values (or custom overwrites) for dataset on a case-by-case basis
  • We allow negations on issue filters
  • Make quality filters/reports a more prominent feature/filter in the UI
  • Allow community annotation/flagging
  • We provide clearer guidance on how the fields are to be used

It is a lot more work though :)

Here and now I like what MattBlissett mentions. Adding something similar to has_coordinate + has has_geospatial_issue that filter away those cases we consider critical - for the UI that might be the best option? But those are GBIF specific flags for easy filtering, without changing incoming data.

@mdoering
Copy link
Member

mdoering commented May 11, 2020

We have an actual vocabulary for occurrenceStatus:
http://rs.gbif.org/vocabulary/gbif/occurrence_status.xml

This is a horrible vocabulary for this term, because we should not be mixing up presence and absence with abundance. It would be much easier for everyone if occurrenceStatus was just present, absent, doubtful and excluded.

I would argue the current occurrenceStatus vocabulary is more of an abundance vocabulary than a simple boolean. More like ACFOR: https://en.wikipedia.org/wiki/Abundance_(ecology), but including doubtful, absent & excluded.

I would prefer to create a new distribution status vocabulary to be used for species distribution checklists and shrink the existing occurrenceStatus one to be just present and absent like DwC suggests. Its probably also safer to change the distribution extension to point to a new vocabulary than changing the occurrence core to point to a new one.

@albenson-usgs
Copy link

albenson-usgs commented May 11, 2020

Quick note just to say that occurrenceStatus is a required term for OBIS and only present or absent are accepted so this falls in line with what's been outlined here.

From the OBIS Manual: occurrenceStatus (required term) is a statement about the presence or absence of a taxon at a location. It is an important term, because it allows us to distinguish between presence and absence records. It is a required term and should be filled in with either present or absent.

@peterdesmet
Copy link
Member

#268 (comment):

We infer occurrenceStatus if missing, but if it is provided, we do not mess with it (despite conflicts with other fields); instead we add issue flags.

Ok, that is clearer (even though individualCount might have more reliable information, see #268 (comment)). I have updated my table at #268 (comment) (in italic) to reflect this decision.

@MattBlissett
Copy link
Member

I've made the relevant changes in the GBIF schema sandbox, I think exactly as @mdoering suggests.

Is that reasonable for everyone?

muttcg added a commit that referenced this issue Jul 15, 2020
@muttcg
Copy link
Member

muttcg commented Jul 15, 2020

blocked by #325

muttcg added a commit to gbif/gbif-api that referenced this issue Jul 20, 2020
muttcg added a commit to gbif/occurrence that referenced this issue Jul 20, 2020
Added new values for occurrence status search/downloads
muttcg added a commit to gbif/occurrence that referenced this issue Jul 20, 2020
muttcg added a commit to gbif/gbif-api that referenced this issue Jul 20, 2020
muttcg added a commit to gbif/occurrence that referenced this issue Jul 20, 2020
muttcg added a commit to gbif/occurrence that referenced this issue Jul 20, 2020
muttcg added a commit to gbif/occurrence that referenced this issue Jul 21, 2020
muttcg added a commit to gbif/occurrence that referenced this issue Jul 21, 2020
muttcg added a commit to gbif/occurrence that referenced this issue Jul 21, 2020
@timrobertson100
Copy link
Member Author

In reviewing the code I spotted an error in the table.

individualCount occurrenceStatus inferred occurrenceStatus flag
0 present* PRESENT INDIVIDUAL_COUNT_CONFLICTS_WITH_OCCURRENCE_STATUS, OCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_COUNT

Should read

individualCount occurrenceStatus inferred occurrenceStatus flag
0 present* PRESENT INDIVIDUAL_COUNT_CONFLICTS_WITH_OCCURRENCE_STATUS

We are not inferring presence from an individualCount of 0, but we do want to raise that there is conflict.

@MattBlissett
Copy link
Member

I think there's a second, similar error in the table:

individualCount occurrenceStatus inferred occurrenceStatus flag
>0 absent* ABSENT INDIVIDUAL_COUNT_CONFLICTS_WITH_OCCURRENCE_STATUS, OCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_COUNT

Should instead be:

individualCount occurrenceStatus inferred occurrenceStatus flag
>0 absent* ABSENT INDIVIDUAL_COUNT_CONFLICTS_WITH_OCCURRENCE_STATUS

muttcg added a commit that referenced this issue Jul 22, 2020
* #268 Added occurrence status field, interpretation, converter and updated ES schema for it
@peterdesmet
Copy link
Member

Thanks for noticing, I have updated the table

@muttcg
Copy link
Member

muttcg commented Sep 3, 2020

API and interpretation in production

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants