Skip to content
Manuel Arturo Izquierdo edited this page Jun 13, 2016 · 1 revision

UTF-8 encoding issue on EULFedora Digital Object's fields.

The following issue occurs with EULFedora 1.5.2 (branch master), working in Linux under Python3. In the context of a data migration task, it was found that EULFedora showed problems when non-ascii characters were used in DigitalObjects fields, as for insance obj.label, or the members of obj.dc.content Dublin Core metadata. For us this is a serious issue, as most of our collection contains metadata info with accents, already encoded in utf-8.

An example code that reproduces the problem:

from eulfedora.models import DigitalObject, FileDatastream
from eulfedora.server import Repository

# A simple Object Model
class FCModel(DigitalObject):
    FILE_CONTENT_MODEL = 'info:fedora/genrepo:File-1.0'
    CONTENT_MODELS = [ FILE_CONTENT_MODEL ]
    
    a_datastream = FileDatastream("A_DATASTREAM", "An example datastream", defaults={
            'versionable': True,
    })

# Open the Fedora repo   
repository = Repository(\
    "http://fedora_server:8080/fedora/",\
    "fcAdmin", \
    "*******")

# creates an object
obj = repository.get_object(type=FCModel)

# Inserts a UTF-8 encoded string in the object label
obj.label = 'é'

obj.pid = 'ppp:123-123'
obj.save()

Wen run, this code produces the following error message:

Traceback (most recent call last):
  File "problem.py", line 30, in <module>
    obj.save()
  File "/home/aizquier/problem_example/eulfedora/eulfedora/models.py", line 1534, in save
    self._ingest(logMessage)
  File "/home/aizquier/problem_example/eulfedora/eulfedora/models.py", line 1613, in _ingest
    r = self.api.ingest(foxml.decode('utf-8'), logMessage)
  File "/home/aizquier/problem_example/eulfedora/eulfedora/api.py", line 572, in ingest
    return self.post(url, data=text, params=http_args, headers=headers)
  File "/home/aizquier/problem_example/eulfedora/eulfedora/api.py", line 146, in post
    return self._make_request(self.session.post, *args, **kwargs)
  File "/home/aizquier/problem_example/eulfedora/eulfedora/api.py", line 133, in _make_request
    raise RequestFailed(response)
eulfedora.util.RequestFailed: 400 <?xml version="1.0" encoding="UTF-8"?><management:validation  xmlns:management="http://www.fedora.info/definitions/1/0/management/" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.fedora.info/definitions/1/0/management/ http://www.fedora.info/definitions/1/0/validation.xsd" pid="unknown"  valid="true">
  <management:contentModels>
  </management:contentModels>
  <management:problems>
    <management:problem>Schematron validation failed:org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 277; Invalid byte 2 of 3-byte UTF-8 sequence.</management:problem>
  </management:problems>
  <management:datastreamProblems>
  </management:datastreamProblems>
</management:validation>

Which is a reaction to a Error 400 thown by Fedora Commons. Notice that the error message sent by Fedora indicates that it threw an exception associated to a malformed UTF-8 input:

<management:problems>
    <management:problem>Schematron validation failed:org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 277; Invalid byte 2 of 3-byte UTF-8 sequence.</management:problem>
</management:problems>

This suggests that EULFedora is sending an invalid UTF-8 stream to Fedora. In fact, when the data flow between EULFedora and Fedora is analized, it is discovered that the character 'é' in the label field is in fact bad encoded:

$ less FOXML_dump.xml

POST /fedora/objects/new HTTP/1.1
Host: xxx.xxx.xxx.xxx:8080
Accept-Encoding: identity
verify: True
User-Agent: eulfedora/1.5.2 CPython/3.5.1 Linux/3.19.0-32-generic
Content-Type: text/xml
Content-Length: 1332
Authorization: Basic JhQWRtaW4xdxddxd=

...

<foxml:property NAME="info:fedora/fedora-system:def/model#label" VALUE="<E9>"/>

...

the value 0xE9 corresponds to the iso-8859-1 (Latin1) encoding, in UTF-8 the character é should be two-byte-encoded as 0xC3A9, however.

Why this value is turning into Latin1 when in the source code is set as UTF-8? Tracking the flow of information inside EULFedora, the encoding of the strings are UTF-8 all the time, but just after the information is passed to the communications library Requests (who process the connection to Fedora), the encoding changes to iso-8859-1. According Requests' documentation, this is not a bug, but a feature, as the default encoding of the http protocol is iso-8859-1, when no other encoding is indicated. After some try and error, we figured out that when EULFedora passes object field values as string variables to Requests, even if they are UFT-8 encoded, this latter decides to reencode them into iso-8859-1, which creates a clash with Fedora Commons.

Under python3, a solution is to change the type of the field value variables from string to bytes.

text = bytes(text.encode('utf-8'))

This prevents that Requests perform any transformation on the data, and the produced XML is satisfactory to Fedora.

Clone this wiki locally