Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about an unusable dataset #4522

Open
Mesibov opened this issue Jan 14, 2023 · 4 comments
Open

Questions about an unusable dataset #4522

Mesibov opened this issue Jan 14, 2023 · 4 comments

Comments

@Mesibov
Copy link

Mesibov commented Jan 14, 2023

In version 1.0 of a recently uploaded Darwin Core archive, every occurrence record in the dataset was unusable, because 10 of the 21 fields in event.txt had been shuffled or deleted by the data publisher:

samplingProtocol entries had been shifted to the samplingEffort field
samplingEffort entries to the year field
eventDate entries to the geodeticDatum field
month entries to the eventDate field
day entries to the month field
year entries to the day field
decimalLongitude entries to the coordinateUncertaintyInMeters field
geodeticDatum entries to the decimalLongitude field (with an incremental fill-down error, "WGS84" to "WGS149")
coordinateUncertaintyInMeters to the footprintWKT field
footprintWKT entries deleted

GBIF indexed the records and made the dataset publicly available, flagging only "Coordinate invalid", "Recorded date invalid" and "footprintWKT invalid" for every record it assembled from event.txt and occurrence.txt, and also "Taxon match fuzzy" for a minor spelling error in scientificName in occurrence.txt.

Another 6 records in occurrence.txt were ignored by GBIF because eventID in occurrence.txt was not present in event.txt (referential integrity error). This issue was not flagged.

I don't want to embarrass the data publisher by linking to the defective dataset, but GBIF staff are welcome to email me for the dataset ID.

My questions are:

(1) How defective does a dataset have to be before GBIF will refuse to make it publicly available?

(2) In a case like this, with every record defective, does GBIF notify the data publisher directly and ask that the dataset be fixed?

@CecSve
Copy link

CecSve commented Feb 28, 2023

To answer your first question:

Unless the dataset breaks the interpretation, it will not be stopped from becoming publicly available on GBIF.org. We flag the issues we can detect programmatically, but we will very rarely manually go over data quality issues unless we are asked by the publisher or users, due to limited resources. We manually check datasets that come in from GBIF-handled publishing grants (for example BID and BIFA projects), but whether all issues are fixed are up to the resources available in the data publishing organization.

Suggestions for including new issues and flags in the data ingestion is always welcome, and we are continuously working on updating and documenting the issues and flags so the data publishers are aware how they may increase the quality of their datasets prior to (re-)publishing to GBIF.

To expand a bit on the answer to the second question you got from the publisher:

No, we do not automatically notify the data publisher other than the automated warnings the publisher would see in the IPT, if they use an IPT to publish their data. As stated above, we only contact the data publisher directly if extra resources are provided in context of GBIF-handled publishing grants or if users or publishers contact us directly through helpdesk and ask for support.

@Mesibov
Copy link
Author

Mesibov commented Feb 28, 2023

@CecSve, many thanks for your detailed answers. I appreciate that GBIF has limited resources, so the question for GBIF is "Can we sandbox a seriously defective dataset so that it does not become publicly available until the publisher fixes its many mistakes?"

I would have thought the answer to that question is in two parts:
Yes, it is technically possible for GBIF to sandbox such datasets, which would greatly limit the number of datasets that GBIF staff would manually check as time and resources allow. But no, GBIF cannot compel publishers to fix their mistakes, so the dataset would remain in the sandbox and would not become publicly available.

The alternative to sandboxing is to allow seriously defective datasets into the public domain through the GBIF "gateway", and to hope that either the publisher pays attention to GBIF's flags or that a third-party individual or service contacts the publisher (as in this case). This is not very effective quality control, as noted and discussed 10 years ago here.

@debpaul
Copy link

debpaul commented Mar 8, 2023

Hm, I'd wonder about doing "data validation" on a dataset before ingestion? Datasets like these, where there's data shifted incorrectly, would fail any data validation on some of those fields because of failing expected data types in those given fields. Is there still a data validation tool (other than the IPT?) where a data provider can check their DwC-A file before providing it to GBIF for ingestion? I'm guessing not all data files come from the IPT?

Would we want the IPT to fail to build a DwC-A in the above case? Until the shift is fixed?

@Mesibov
Copy link
Author

Mesibov commented Mar 8, 2023

@debpaul, GBIF has an excellent online Data Validator tool here that can be used by data providers. It doesn't flag some classes of data errors, but it should pick up major snafus. I assume it's either the same as, or similar to, the data validation done by GBIF.

GBIF might have stats on how many providers use the Data Validator, but probably does not have stats on how many providers edit their datasets based on a Validator report. The performance (efficacy) of the publicly available Data Validator is probably not measured in this way, so as a quality control mechanism it's more hopeful than proven.

I don't think there is a barrier to providers failing a Data Validator report and then sharing the failed dataset with GBIF, which returns the problem to GBIF and what it does after its own validations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants