Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

query by modified #219

Open
thomasstjerne opened this issue Dec 8, 2020 · 17 comments
Open

query by modified #219

thomasstjerne opened this issue Dec 8, 2020 · 17 comments
Assignees
Labels
documentation Better documentation would help enhancement

Comments

@thomasstjerne
Copy link

I understand that modified is already in ES.
Please allow to search through API v1 like this:
https://api.gbif.org/v1/occurrence/search?modified=2020-12-01,*

@timrobertson100
Copy link
Member

timrobertson100 commented Dec 8, 2020

We need to be very clear what this means.

The intuitive use of this will be the date the record changes in the GBIF index which could be due to a source change, a new vocabulary, a new backbone, a bug fix etc.

If it is intended to be the date the publisher declares that the record changes, I suggest we use a term like dateModifiedAtSource or so to be clear on the semantics.

@MattBlissett
Copy link
Member

That is a bit awkward, as the field name is already "modified" on the Occurrence response.

@timrobertson100
Copy link
Member

I suggest we leave that null and introduce a new term. Would one not expect a modified field on an occurrence record to indicate when that record changed, rather than the time an upstream object changed?

@djtfmartin
Copy link

djtfmartin commented Dec 8, 2020

Just my 2 cents, but i would have expected modified to be when the record was last changed at source.
I would have thought other variants of modified to do with gbif mechanisms would be less interesting to users of the API.

@MattBlissett
Copy link
Member

I suggest we leave that null and introduce a new term.

It's already not null for millions of records: https://api.gbif.org/v1/occurrence/2831340301

@mdoering
Copy link
Member

mdoering commented Dec 8, 2020

I agree with @djtfmartin that I would have expected modified on the API to be the one closest to the source as possible - not influenced by reinterpretation and alike

@MattBlissett
Copy link
Member

I see both points, but I think it's best to stick with the exiting behaviour.

We have "lastInterpreted" to cover (potential) changes due to GBIF's processing, and "modified" for changes from the publisher.

https://gbif.github.io/gbif-api/apidocs/org/gbif/api/model/occurrence/Occurrence.html#getModified--

https://www.gbif.org/developer/occurrence#p_lastInterpreted

@thomasstjerne
Copy link
Author

And we should have proper documentation in the API docs on the differences between lastInterpreted and modified

@MortenHofft
Copy link
Member

MortenHofft commented Dec 9, 2020

I think it is wonderful to have all fields exposed in the API, but I wouldn't want to expose it in the UI without many caveats. But that isn't the use case that prompted this anyway.

Intuitively I too would expect it to be either of

  • when it was changed at source (core and/or relevant extensions included)
  • when it was changed after interpretation (the API response has changed)

But it is neither - in most cases it probably isn't even there.

Perfect naming would be wonderful, but documentation on what individual fields mean would probably be needed anyway to fully explain it.

@michaelrangstrup
Copy link

Since we have no other way than modified date (from source) to extract logically from API just those records that have changed for huge datasets, this feature is still extremely important to us. Currently, if we want to e.g. get just the delta records (say 1000 records) for a large dataset, we have no way to do this. Only option we have is to get the entire dataset again (tens of millions of records). Not all datasets may offer this value, but the ones that are important to us do, so we would probably decrease our use of your API drastically if we could get this feature. I wonder if there is any chance this will be supported?

@MortenHofft MortenHofft added documentation Better documentation would help enhancement labels May 5, 2021
@MortenHofft
Copy link
Member

MortenHofft commented May 5, 2021

@michaelrangstrup in case it isn't clear from above

modified
The current field modified is under the publishers control and is a part of the Darwin Core standard. It might or might not be used. And it might or might not be updated by the publisher. If it used, then it is probably correct, just like the remaining fields hopefully are. But it might not be useful for this use case - even if it was exposed in the API as a filter.

lastInterpreted
Then there is the field lastInterpreted - that tells us when this record was last interpreted. Perhaps without any changes in source nor in the interpreted output.

Not existing fields that might be useful to introduce
If I understand correctly then something else is needed. Perhaps one of these currently non-existing fields:

  • sourceModified that would indicate when this record was last changed at the source (either its core record or any of its associated extensions I suppose?).
  • interpretationModified that would tell us when the interpreted version had changed (what you would get in a simple download?).
  • recordModified either source or interpretation has changed - I guess this one would reflect wether the dwcA download has changed for those records.

Example

  • The first time a record is published sourceModified and interpretationModified is updated.
  • The publisher provided an illegal elevation and changes that field. But it is once again invalid. The source has changed, but the interpreted record remains the same.
  • The publisher updates the Elevation again, this time it is valid. Both source and interpreted record is changed.
  • GBIF updates its backbone taxonomy, and the classification changes. The source is the same, but the interpreted version has changed.

@michaelrangstrup
Copy link

michaelrangstrup commented May 5, 2021

@MortenHofft thanks Morten. I believe I understand the current status.

I do see the modified field in the occurrence object I retrieve via the API and for the main datasets I care about, the value in this field is exactly how it should be.

But as long as I cannot filter by this field in your API, it does not help me much. I need to be able to ask the API for all occurrences for a specific dataset where the sourceModified date is > a filter date I specify. Without this I would have to download all occurrences (40 million+) each time to find out which ones are modified at source after a specific date. This is not viable on a recurring basis.

I initially used the lastInterpreted date as filter, but this date is not useful as it is reset each time you ingest a dataset from a DWCArchive regardless of whether any data of the occurrence has changed.

I need to contact your API weekly in order to find and sync new and modified records within specific datasets to my solution, but currently there is no way for me to filter out only those that are new or modified.

The optimal solution would be a filter like recordModified that understands both if GBIF changed interpretation or the source made changes, but it would need not to be reset on each ingestion (as happens now) and thus not a true representation of whether the occurrence in question actually has any changes that I need to sync.

@timrobertson100
Copy link
Member

timrobertson100 commented May 5, 2021

Thanks for elaborating on your use case @michaelrangstrup.

I'm sure much of what I write below is obvious, but I'll voice it here in case it isn't for all readers.

Relying on a modified only, where that is the date of change in the upstream publishing system would result in drift over time without having a periodic sync job. This is because records in GBIF will vary based on the following scenarios (there may be more):

  1. The publisher changes the record
  2. The schema in GBIF is modified (e.g. adding a field, adding an extension in DwC-A)
  3. A bug is fixed to existing interpretation routines
  4. An external reference used in data processing changes, including:
    1. The backbone taxonomy
    2. A vocabulary such as LifeStage
    3. Geospatial references such as GADM.org
    4. Changes in the collections catalogue

Since the GBIF system does not provide the ideal solution of "date record has changed in the index" the burden currently lies with the consumer. Note that only having a search on modified will not detect deleted records. We will give consideration to what we can do, but it's not a super easy fix for us.

Given you are aiming for a weekly synchronisation, and it sounds like 40M records is a burden, one approach to consider would be:

  1. Establish a key-value store that maps gbifID -> {recordHash, verified, modified, deleted}. With only a few tens of millions of records, any database will do here, e.g. Berkley DB, MapDB, MariaDB, PostgreSQL etc
  2. Develop a script to download the data of interest from GBIF periodically
  3. Iterate over the download and or each record
    1. Calculate the hash
    2. Compare it with the existing hash for the record
      a. If it has changed, update the verified and modified timestamps and feed the record into your system
      b else, update the verified timestamp only
  4. Once finished, scan the lookup and any records where verified is before the start of this process are considered deleted and you should prune them from your system and update the deleted timestamp

Would a process like this be manageable for your situation do you think?

One option could be GBIF could providing a tool that simplifies this process, or implement similar in our own processing.

@ManonGros
Copy link

ManonGros commented May 5, 2021

I think the idea is that they wanted to avoid re-downloading all the records.

@mdoering
Copy link
Member

mdoering commented May 5, 2021

It sounds to me as if a new GBIF controlled field modifiedInGBIF would be useful that indicated the time the record has last changed in the GBIF index - for any of the reasons @timrobertson100 listed above. That could drive any sync and also allow users to sort by the latest changes.

@michaelrangstrup
Copy link

Thanks for the detailed proposal @timrobertson100
I assume your idea is technically feasible, but for now outside financial scope as our current implementation fully relies on the GBIF API and usage of the filter options offered by it. Getting access to filter by modified date was our quick fix to lastIntepreted date working differently than we expected. I will have a talk with the project team to find out how to proceed from here.

@timrobertson100
Copy link
Member

Thanks @michaelrangstrup - please free to arrange a call with me if it helps.

a new GBIF controlled field modifiedInGBIF would be useful that indicated the time the record has last changed in the GBIF index ... could drive any sync and also allow users to sort by the latest changes.

Yes, but just to re-emphasize that it is insufficient alone as you need an additional check for what has been removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Better documentation would help enhancement
Projects
None yet
Development

No branches or pull requests

9 participants