Skip to content

Problem: Dataverse tabular derivatives no longer correctly represented in Archivematica METS due to hardcoding of format names in convert_dataverse_structure.py #1057

@gehurley

Description

@gehurley

Expected behaviour

As part of its ingest workflow, Dataverse produces derivative versions of submitted original tabular data files in TSV (Tab Separated Values text format, referred to as “tab” files from here on out). For example, if someone uploads an Excel file, Dataverse will attempt to create a derivative in tab format on ingest. The tab file is presented to users for download via the Dataverse interface and is also used for Dataverse’s data visualization and analysis functions. Tab files are delivered as part of the package of a Dataverse dataset’s files to Archivematica when a Dataverse transfer is initiated. As part of the Dataverse transfer process, Archivematica generates a Dataverse METS.xml file from information contained in the dataset’s dataset.json file that describes the contents of the dataset as received from Dataverse, including describing the derivative files. These details eventually end up in the final Archivematica METS to document the provenance of the derivatives and their relationships with the originally-submitted files.

The main expected behaviour for tab files (and any other Dataverse-generated derivatives) when it comes to their representation in an Archivematica METS file is:

  • A METS fileGrp section labeled "derivative," which lists all Dataverse-created derivative files. These files are associated with their originals (listed in the ‘original’ section of the fileGrp) and linked via a shared group ID.
  • PREMIS events labelled "derivation" recording the creation of the derivative files, with the Dataverse instance linked as an agent
  • premis:relationship entries under under premis:object for the relevant derivatives, linking the originals and derivatives and the associated PREMIS events. For example, an original Excel file is listed under premis:relationshipSubType as "is source of" its derivative tabular file; the tabular file’s relationshipSubType would link the Excel file as "has source."

Current behaviour

Currently, derivatives in tab format are not recorded as such in the Archivematica METS file with one exception: when the original source of a derivative is a CSV file. Aside from this exception:

  • Tab files appear under ‘original’ with their own group ID in the METS fileGrp
  • There are no derivation events and relationship links recorded

In the Dataverse METS, the original file is also not listed in the fileSec at all (neither under "original" or "derivative"). The tab file gets listed twice in the structMap and original is not listed.

Note as above that transfers function as expected with respect to representing tab derivative files in METS when the original is a CSV file.

The expected behaviour for tab files was present in Archivematica 1.9.2 paired with version 4.10 of Dataverse. In October 2019, we updated our Dataverse instances to v. 4.17.5. In comparing a prior successful transfer from January 2019 and the current behaviour, there are several key differences noted:

  • One difference present in dataset.json is that Dataverse now represents the names of file formats slightly differently in the field originalFormatLabel. For example, the prior version of the transfer records the original SPSS file as "SPSS SAV" while the newer version records it as "SPSS Binary."
  • There have also been some added lines, such as "fileAcessRequest" in the header and "creationDate" (for each file). However, these lines do not seem to impact the transfer since a CSV-based transfer still functions correctly.

It looks like the source of the issue is that format names are hardcoded into convert_dataverse_structure.py and either need to be altered to match the new names or be more hospitable to any changes to the way Dataverse generates the names for dataset.json.

One potential route would be to use the originalFileFormat field instead, which might be more stable since it appears to be referring to mime type. It is unclear where the format names in originalFormatLabel are derived from; I don’t see any pull requests in Dataverse referencing the change. However, we can reach out to the Dataverse folks if required to retrieve an updated list.

Steps to reproduce

Your environment (version of Archivematica, operating system, other relevant details)

Tested in Archivematica 1.9.2 and 1.10.1 with Dataverse v. 4.17.5

Many thanks!

For Artefactual use:

Before you close this issue, you must check off the following:

  • All pull requests related to this issue are properly linked
  • All pull requests related to this issue have been merged
  • A testing plan for this issue has been implemented and passed (testing plan information should be included in the issue body or comments)
  • Documentation regarding this issue has been written and merged (if applicable)
  • Details about this issue have been added to the release notes (if applicable)

Metadata

Metadata

Assignees

No one assigned

    Labels

    OCUL: AM-DataverseOCUL: AM-DataverseStatus: refiningThe issue needs additional details to ensure that requirements are clear.Type: regressionA bug that is a regression of supported functionality from previous releasesⓂ️ mets/premisMETS/PREMIS issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions