Adapt all download formats and exports to use the newly added multivalue fields in pipelines #283

marcos-lg · 2022-02-15T12:51:31Z

The issue gbif/pipelines#665 brought some new interpreted fields and changed the typeStatus from string to array.

Some of the new fields added were used before as strings because they were being carried from the verbatim values. But now they are interpreted fields in the basic record.

You can see the changes done in the avro schemas here.

All the download formats and cloud exports needs to be adapted to these changes to either use arrays or convert the arrays into strings.

The changes for ES search and Dwc and csv downloads are here but should be reviewed too.

marcos-lg · 2022-03-03T13:41:36Z

@dshorthouse we are changing some fields to be arrays instead of strings (see above) and some of these fields are included in the bionomia downloads. I changed them to be arrays too, you can see the changes here.

Is this ok to you? you can also test it in UAT if you want. It's not in production yet.

dshorthouse · 2022-03-03T13:53:46Z

Thanks, @marcos-lg. I'm not sure what are the implications here, but it sounds like you have introduced a mechanism to explode a string into an array for recordedBy and identifiedBy & that these will be expressed as arrays in the avro exports. Correct? If so, this will be a severely breaking change for the Bionomia download format that expects these to be verbatim strings unless these can be concatenated to be precisely the same as that sent by the publisher. Instead of making use of these arrays, I'd be far more comfortable using verbatim fields. Exploding recordedBy or identifiedBy into an array is more complicated than other fields in DwC. See https://github.com/bionomia/dwc_agent/blob/master/lib/dwc_agent/constants.rb#L130.

marcos-lg · 2022-03-03T14:00:35Z

yes @dshorthouse. We are now interpreting those fields and we converted them into an array because sometimes they contain more than 1 value and this way we can improve the search in our portal and in downloads.

But it's ok, I'll change the bionomia download to use the verbatim fields for recordedBy and identifiedBy. This way you shouldn't notice any difference.

dshorthouse · 2022-03-03T14:02:45Z

I just took a closer look at how @MattBlissett had made the queries at https://github.com/gbif/occurrence/blob/dev/occurrence-download/src/main/resources/download-workflow/bionomia/hive-scripts/execute-bionomia-query.q#L89 and it looks like he's use v_recordedBy and v_identifiedBy (verbatim equivalents) so am not sure the above changes will affect the Bionomia download at all.

marcos-lg · 2022-03-03T14:06:36Z

Right. Then we just need to remove the recordedBy and identifiedBy and leave the verbatim ones only. Until now the verbatim and the interpreted fields were the same so it seems that they were redundant.

dshorthouse · 2022-03-03T14:10:54Z

Aha - I drop those two columns in the spark queries at my end and use v_recordedBy and v_identifiedBy anyway so it's unlikely that your changes above will matter to the processing of the Bionomia download format.

That said, we might one day work on an Elasticsearch plugin to properly contend with material in recordedBy or identifiedBy.

marcos-lg · 2022-04-28T13:10:12Z

All the downloads formats are adapted and in PROD now.

marcos-lg assigned timrobertson100 and MattBlissett Feb 15, 2022

marcos-lg self-assigned this Mar 3, 2022

MattBlissett closed this as completed May 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adapt all download formats and exports to use the newly added multivalue fields in pipelines #283

Adapt all download formats and exports to use the newly added multivalue fields in pipelines #283

marcos-lg commented Feb 15, 2022

marcos-lg commented Mar 3, 2022

dshorthouse commented Mar 3, 2022

marcos-lg commented Mar 3, 2022

dshorthouse commented Mar 3, 2022

marcos-lg commented Mar 3, 2022

dshorthouse commented Mar 3, 2022 •

edited

Loading

marcos-lg commented Apr 28, 2022

Adapt all download formats and exports to use the newly added multivalue fields in pipelines #283

Adapt all download formats and exports to use the newly added multivalue fields in pipelines #283

Comments

marcos-lg commented Feb 15, 2022

marcos-lg commented Mar 3, 2022

dshorthouse commented Mar 3, 2022

marcos-lg commented Mar 3, 2022

dshorthouse commented Mar 3, 2022

marcos-lg commented Mar 3, 2022

dshorthouse commented Mar 3, 2022 • edited Loading

marcos-lg commented Apr 28, 2022

dshorthouse commented Mar 3, 2022 •

edited

Loading