Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multi-value search for existing fields #665

Closed
timrobertson100 opened this issue Feb 10, 2022 · 7 comments
Closed

Support multi-value search for existing fields #665

timrobertson100 opened this issue Feb 10, 2022 · 7 comments
Assignees

Comments

@timrobertson100
Copy link
Member

timrobertson100 commented Feb 10, 2022

The fields listed below are in ES and treated as single value strings.
Without breaking any pubic APIs, we can provide better search by treating them as multi-value fields.

For example, consider a record arriving with recordedBy: Morten Hoefft | Tim Robertson.
It is not possible today, to search for Tim Robertson and discover this record, along with others having this value. See also #178

Terms that currently support auto-suggest are noted, which may bring additional considerations.

Term Auto-suggest
datasetID
datasetName yes (to be added)
otherCatalogNumbers yes (to be added)
typeStatus
recordedBy yes
identifiedBy yes
preparations
samplingProtocol yes

This issue is intended to focus only on existing fields in ES, and those already being added in work in progress, and not to propose additional fields.

@MortenHofft
Copy link
Member

It believe it would be an appreciated feature if ordering could be retained (when serialized back into a string).

Perhaps related: gbif/portal-feedback#3292 (comment) for the desire to keep ordering in the multimedia array.

muttcg added a commit that referenced this issue Feb 14, 2022
…herCatalogNumbers_preparations

#662 #664 #665 #667 multivalue fields
marcos-lg added a commit that referenced this issue Feb 16, 2022
* adapted ALA pipelines to new fields added to basic record

* run ALA ITs

* Turn on ITs

* Revert "Turn on ITs"

This reverts commit acc264c.

* turn off tests ALA

Co-authored-by: Nikolay Volik <nvolik@gbif.org>
@marcos-lg
Copy link
Contributor

Deployed to PROD

@sylvain-morin
Copy link
Contributor

Hi,

I'm really wondering why datasetID and datasetName are multi-value?
(I just noticed this change in ALA since I'm migrating to the new ALA / Pipeline)

How can an occurrence be part of 2 datasets?

Thanks

@timrobertson100
Copy link
Member Author

I'm really wondering why datasetID and datasetName are multi-value?

The request came from the GBIF node community, where (if I can remember correctly) they are using datasetID and datasetName to encode the various projects that a record is associated with, and also when dealing with aggregating data into a single dataset for GBIF with multiple origins.

@sylvain-morin
Copy link
Contributor

Thanks for the explanation!

We need a multi-value projectID field ! (another similar topic about projectID in the metadata :-)

@timrobertson100
Copy link
Member Author

Yes.

It is indeed projectIDs they're encoding though, as you can see on this discussion

Projects aren't covered in DwC terms which is why I think they used the dataset (I guess assuming this is the dataset created by a project) but we do have projects and programmes in the GBIF API. They are used for the projects and programmes the GBIF organisation itself runs though (BID, BIFA etc) so it might become overloaded for us to introduce that.

@dagendresen
Copy link

I think the issue is more deeper rooted in the data model structure in that the simple Darwin Core Occurrence model denormalizes the real-world objects such as a collection specimen or a monitored organism into the Occurrence view where they are in practice only identified by the occurrenceID - which ultimately is representing a simple DwC data records and not the real-world entities of actual interest.

The "data records" can thus take part in multiple "datasets". There is more than one way to organize the data records into sets of data records. Including sets of data records (which represent denormalized real-world entities) for different real-world projects. The more technical dataset model for the purpose of publishing these data records into GBIF is not the main concern here, but rather how to group records belonging to different "projects".

One important reason or rationale is to group records produced or updated from different project funding. Similar to how the GBIF BID, BIFA, and CESP projects list datasets produced by this project funding. However, often we see project funding for georeferencing, or taxonomic validation and desire to "tag" the data records (or actually ultimately rather desire to "tag" the actual real-world collection specimens) that were georeferenced from a specific project funding --> to credit the funder and track fulfillment of the promise to the funder of e.g. georeferencing 10 000 collection specimens...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants