Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem: mets-reader-writer (metsrw) cannot handle comments in third-party XML #1277

Closed
5 tasks
ross-spencer opened this issue Aug 9, 2020 · 1 comment
Closed
5 tasks
Labels
Ⓜ️ mets/premis METS/PREMIS issues Picturae Type: bug A flaw in the code that causes the software to produce an incorrect or unexpected result.
Milestone

Comments

@ross-spencer
Copy link
Contributor

Expected behaviour

Given a METS file produced by archivematica, METS-reader-writer should be able to consume that file in its entirety so that the library can be used externally for whatever intended purpose.

Current behaviour

Given a METS file produced by Archivematica. If the external tool output in the object characteristics extension contains XML comments then we will see an AttributeError.

Traceback (most recent call last):
  File "mets.py", line 131, in <module>
    main()
  File "mets.py", line 102, in main
    for premis_object in aipFile.get_premis_objects():
  File "/home/ross-spencer/Desktop/Artefactual/Clients/Wellcome/mets/venv/lib/python3.6/site-packages/metsrw/fsentry.py", line 499, in get_premis_objects
    self.PREMIS_OBJECT, self.premis_object_class
  File "/home/ross-spencer/Desktop/Artefactual/Clients/Wellcome/mets/venv/lib/python3.6/site-packages/metsrw/fsentry.py", line 491, in get_subsections_of_type
    for ss in self.amdsecs[0].subsections
  File "/home/ross-spencer/Desktop/Artefactual/Clients/Wellcome/mets/venv/lib/python3.6/site-packages/metsrw/fsentry.py", line 492, in <listcomp>
    if ss.contents.mdtype == mdtype
  File "/home/ross-spencer/Desktop/Artefactual/Clients/Wellcome/mets/venv/lib/python3.6/site-packages/metsrw/plugins/premisrw/premis.py", line 250, in fromtree
    return cls(data=premis_to_data(tree))
  File "/home/ross-spencer/Desktop/Artefactual/Clients/Wellcome/mets/venv/lib/python3.6/site-packages/metsrw/plugins/premisrw/premis.py", line 728, in premis_to_data
    return _lxml_el_to_data(premis_lxml_el, "premis", nsmap)
  File "/home/ross-spencer/Desktop/Artefactual/Clients/Wellcome/mets/venv/lib/python3.6/site-packages/metsrw/plugins/premisrw/premis.py", line 687, in _lxml_el_to_data
    ret.append(_lxml_el_to_data(sub_el, ns, nsmap, snake=snake))
  File "/home/ross-spencer/Desktop/Artefactual/Clients/Wellcome/mets/venv/lib/python3.6/site-packages/metsrw/plugins/premisrw/premis.py", line 687, in _lxml_el_to_data
    ret.append(_lxml_el_to_data(sub_el, ns, nsmap, snake=snake))
  File "/home/ross-spencer/Desktop/Artefactual/Clients/Wellcome/mets/venv/lib/python3.6/site-packages/metsrw/plugins/premisrw/premis.py", line 687, in _lxml_el_to_data
    ret.append(_lxml_el_to_data(sub_el, ns, nsmap, snake=snake))
  File "/home/ross-spencer/Desktop/Artefactual/Clients/Wellcome/mets/venv/lib/python3.6/site-packages/metsrw/plugins/premisrw/premis.py", line 679, in _lxml_el_to_data
    tag_name = _to_colon_ns(lxml_el.tag, default_ns=ns, nsmap=nsmap)
  File "/home/ross-spencer/Desktop/Artefactual/Clients/Wellcome/mets/venv/lib/python3.6/site-packages/metsrw/plugins/premisrw/premis.py", line 647, in _to_colon_ns
    parts = [x.strip("{") for x in bracket_ns.split("}")]
AttributeError: 'cython_function_or_method' object has no attribute 'split'

Steps to reproduce

Using a default Archivematica, create a transfer with the Disk Image sample transfer and download the METS. The tool output should come from fiwalk I believe, and there will be a number of <!-- xml comments -->.

Next, create a sample metsrw script.

Something like:

import metsrw
mets = metsrw.METSDocument.fromfile( __filename__ )  
for aip_file in mets.all_files():
	for premis_object in aip_file.get_premis_objects():
		pass

et voila!

And your error should be ready.

Example cut-down METS

<?xml version='1.0' encoding='UTF-8'?>
<mets:mets xmlns:mets="http://www.loc.gov/METS/" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/METS/ http://www.loc.gov/standards/mets/version1121/mets.xsd">
  <mets:metsHdr CREATEDATE="2020-08-09T18:48:32"/>
  <mets:dmdSec ID="dmdSec_1">
    <mets:mdWrap MDTYPE="PREMIS:OBJECT">
      <mets:xmlData>
        <premis:object xmlns:premis="http://www.loc.gov/premis/v3" xsi:type="premis:intellectualEntity" xsi:schemaLocation="http://www.loc.gov/premis/v3 http://www.loc.gov/standards/premis/v3/premis.xsd" version="3.0">
          <premis:objectIdentifier>
            <premis:objectIdentifierType>UUID</premis:objectIdentifierType>
            <premis:objectIdentifierValue>f9eb51e2-603b-40ac-b444-23a314fb9b80</premis:objectIdentifierValue>
          </premis:objectIdentifier>
          <premis:originalName>tiny_mets-f9eb51e2-603b-40ac-b444-23a314fb9b80</premis:originalName>
        </premis:object>
      </mets:xmlData>
    </mets:mdWrap>
  </mets:dmdSec>
  <mets:amdSec ID="amdSec_1">
    <mets:techMD ID="techMD_1">
      <mets:mdWrap MDTYPE="PREMIS:OBJECT">
        <mets:xmlData>
          <premis:object xmlns:premis="http://www.loc.gov/premis/v3" xsi:type="premis:file" xsi:schemaLocation="http://www.loc.gov/premis/v3 http://www.loc.gov/standards/premis/v3/premis.xsd" version="3.0">
            <premis:objectIdentifier>
              <premis:objectIdentifierType>UUID</premis:objectIdentifierType>
              <premis:objectIdentifierValue>102c60d9-55c6-42d7-94dd-bc67aa9cde10</premis:objectIdentifierValue>
            </premis:objectIdentifier>
            <premis:objectCharacteristics>
              <premis:compositionLevel>0</premis:compositionLevel>
              <premis:fixity>
                <premis:messageDigestAlgorithm>sha256</premis:messageDigestAlgorithm>
                <premis:messageDigest>637f6e5a93b50765196411fd8b0c816901f6d4eba5b8c8f41c36b17a9729f295</premis:messageDigest>
              </premis:fixity>
              <premis:size>11</premis:size>
              <premis:format>
                <premis:formatDesignation>
                  <premis:formatName>Unknown</premis:formatName>
                </premis:formatDesignation>
              </premis:format>
              <premis:creatingApplication>
                <premis:dateCreatedByApplication>2020-08-09</premis:dateCreatedByApplication>
              </premis:creatingApplication>
              <premis:objectCharacteristicsExtension>
                <!-- This is output from a third-party tool -->
                <foo>
                  <bar>foobar</bar>
                </foo>
              </premis:objectCharacteristicsExtension>
            </premis:objectCharacteristics>
            <premis:originalName>%transferDirectory%objects/helloworld.txt</premis:originalName>
          </premis:object>
        </mets:xmlData>
      </mets:mdWrap>
    </mets:techMD>
  </mets:amdSec>
  <mets:amdSec ID="amdSec_2">
    <mets:techMD ID="techMD_2">
      <mets:mdWrap MDTYPE="PREMIS:OBJECT">
        <mets:xmlData>
          <premis:object xmlns:premis="http://www.loc.gov/premis/v3" xsi:type="premis:file" xsi:schemaLocation="http://www.loc.gov/premis/v3 http://www.loc.gov/standards/premis/v3/premis.xsd" version="3.0">
            <premis:objectIdentifier>
              <premis:objectIdentifierType>UUID</premis:objectIdentifierType>
              <premis:objectIdentifierValue>bb94253e-0d62-4c9e-a267-2703ee1b9a41</premis:objectIdentifierValue>
            </premis:objectIdentifier>
            <premis:objectCharacteristics>
              <premis:compositionLevel>0</premis:compositionLevel>
              <premis:fixity>
                <premis:messageDigestAlgorithm>sha256</premis:messageDigestAlgorithm>
                <premis:messageDigest>4be8389f3bd491438b43480c4efcb7a226d15d640382e4edbb9dc1841125e32f</premis:messageDigest>
              </premis:fixity>
              <premis:size>10467</premis:size>
              <premis:format>
                <premis:formatDesignation>
                  <premis:formatName>XML</premis:formatName>
                  <premis:formatVersion>1.0</premis:formatVersion>
                </premis:formatDesignation>
                <premis:formatRegistry>
                  <premis:formatRegistryName>PRONOM</premis:formatRegistryName>
                  <premis:formatRegistryKey>fmt/101</premis:formatRegistryKey>
                </premis:formatRegistry>
              </premis:format>
              <premis:creatingApplication>
                <premis:dateCreatedByApplication>2020-08-09</premis:dateCreatedByApplication>
              </premis:creatingApplication>
            </premis:objectCharacteristics>
            <premis:originalName>%SIPDirectory%objects/submissionDocumentation/transfer-tiny_mets-37b9bc03-4cdb-4a95-abc2-21f61ea16c98/METS.xml</premis:originalName>
          </premis:object>
        </mets:xmlData>
      </mets:mdWrap>
    </mets:techMD>
  </mets:amdSec>
  <mets:fileSec>
    <mets:fileGrp USE="original">
      <mets:file GROUPID="Group-102c60d9-55c6-42d7-94dd-bc67aa9cde10" ID="file-102c60d9-55c6-42d7-94dd-bc67aa9cde10" ADMID="amdSec_1">
        <mets:FLocat xlink:href="objects/helloworld.txt" LOCTYPE="OTHER" OTHERLOCTYPE="SYSTEM"/>
      </mets:file>
    </mets:fileGrp>
    <mets:fileGrp USE="submissionDocumentation">
      <mets:file GROUPID="Group-bb94253e-0d62-4c9e-a267-2703ee1b9a41" ID="file-bb94253e-0d62-4c9e-a267-2703ee1b9a41" ADMID="amdSec_2">
        <mets:FLocat xlink:href="objects/submissionDocumentation/transfer-tiny_mets-37b9bc03-4cdb-4a95-abc2-21f61ea16c98/METS.xml" LOCTYPE="OTHER" OTHERLOCTYPE="SYSTEM"/>
      </mets:file>
    </mets:fileGrp>
  </mets:fileSec>
  <mets:structMap ID="structMap_1" LABEL="Archivematica default" TYPE="physical">
    <mets:div LABEL="tiny_mets-f9eb51e2-603b-40ac-b444-23a314fb9b80" TYPE="Directory" DMDID="dmdSec_1">
      <mets:div LABEL="objects" TYPE="Directory">
        <mets:div LABEL="helloworld.txt" TYPE="Item">
          <mets:fptr FILEID="file-102c60d9-55c6-42d7-94dd-bc67aa9cde10"/>
        </mets:div>
        <mets:div LABEL="submissionDocumentation" TYPE="Directory">
          <mets:div LABEL="transfer-tiny_mets-37b9bc03-4cdb-4a95-abc2-21f61ea16c98" TYPE="Directory">
            <mets:div LABEL="METS.xml" TYPE="Item">
              <mets:fptr FILEID="file-bb94253e-0d62-4c9e-a267-2703ee1b9a41"/>
            </mets:div>
          </mets:div>
        </mets:div>
      </mets:div>
    </mets:div>
  </mets:structMap>
</mets:mets>

Your environment (version of Archivematica, operating system, other relevant details)

mets-reader-writer 0.3.15.

Additional info

A little from stack-overflow about one way to avoid this: https://stackoverflow.com/a/18313932


For Artefactual use:

Before you close this issue, you must check off the following:

  • All pull requests related to this issue are properly linked
  • All pull requests related to this issue have been merged
  • A testing plan for this issue has been implemented and passed (testing plan information should be included in the issue body or comments)
  • Documentation regarding this issue has been written and merged (if applicable)
  • Details about this issue have been added to the release notes (if applicable)
@ross-spencer ross-spencer added the Ⓜ️ mets/premis METS/PREMIS issues label Aug 9, 2020
@ross-spencer ross-spencer changed the title Problem: Mets-reader-writer (metsrw) cannot handle comments in third-party XML Problem: mets-reader-writer (metsrw) cannot handle comments in third-party XML Aug 9, 2020
ross-spencer added a commit to artefactual-labs/AIPscan that referenced this issue Aug 10, 2020
ross-spencer added a commit to artefactual-labs/AIPscan that referenced this issue Aug 17, 2020
@sromkey sromkey added Status: refining The issue needs additional details to ensure that requirements are clear. Type: bug A flaw in the code that causes the software to produce an incorrect or unexpected result. labels Aug 25, 2020
@replaceafill replaceafill added this to the 1.14.0 milestone Apr 18, 2023
@replaceafill replaceafill added Status: review The issue's code has been merged and is ready for testing/review. and removed Status: refining The issue needs additional details to ensure that requirements are clear. labels Apr 24, 2023
@replaceafill
Copy link
Member

Verified this by first installing metsrw==0.3.22 in a Python 3.10 virtual environment and running the snippet and METS file provided in the issue. I got the AttributeError. After updating metsrw==0.3.23 the snippet succeeds.

@replaceafill replaceafill removed the Status: review The issue's code has been merged and is ready for testing/review. label May 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ⓜ️ mets/premis METS/PREMIS issues Picturae Type: bug A flaw in the code that causes the software to produce an incorrect or unexpected result.
Projects
None yet
Development

No branches or pull requests

3 participants