Skip to content

Problem: Transcribe job may hang the MCPClient worker queue #1751

@replaceafill

Description

@replaceafill

Expected behaviour

The MCPClient worker queue handles errors raised by the Transcribe job.

Current behaviour

There are a few key aspects to this problem:

  • Task batching. The MCPServer creates transcription task batches by iterating through:
    1. Files listed in the database, then
    2. Files found in the filesystem under the objects directory.
  • Transcription output. The Transcribe job writes tesseract output to the objects/metadata/OCRfiles directory.
  • Argument parsing. The Transcribe job expects two arguments: a task ID and a file UUID. Both are validated as UUIDs during argument parsing.

When processing a transfer with a large number of transcribable files, the MCPServer may submit transcription tasks to gearman before it finishes iterating through all files. This was discovered in a client transfer with 1.7k files using the default 128 batch size in the MCPServer settings.

As a result the initial set of tasks begins creating output in objects/metadata/OCRfiles. These newly created files are picked up during the filesystem iteration, even though they didn’t exist at the start of the batching process. Since these files don't have a corresponding UUID (as they are output files, not input transfer files), the MCPServer assigns None as the UUID.

When the transcription task is later run on these new files, the argument parser raises a SystemExit error due to invalid arguments. This error is not handled by the MCPClient worker queue, causing it to enter a broken state and spawn zombie processes. Over time, this leads to increased memory usage and stalls the job.

This is what the MCPClient logs show when the problem occurs:

Jun 02 18:00:47 server python[3280065]: usage: archivematicaClient.py [-h] task_uuid file_uuid
Jun 02 18:00:47 server python[3280065]: archivematicaClient.py: error: argument file_uuid: invalid UUID value: 'None'

Your environment (version of Archivematica, operating system, other relevant details)

Archivematica 1.17.0.

In Archivematica 1.16.0, the Transcribe job does not validate the file UUID parameter. As a result, it later fails when attempting to look up the file in the database, producing the following error:

['“None” is not a valid UUID.']Traceback (most recent call last):
  File "/pyenv/data/versions/3.9.22/lib/python3.9/site-packages/django/db/models/fields/__init__.py", line 2688, in to_python
    return uuid.UUID(**{input_form: value})
  File "/pyenv/data/versions/3.9.22/lib/python3.9/uuid.py", line 177, in __init__
    raise ValueError('badly formed hexadecimal UUID string')
ValueError: badly formed hexadecimal UUID string

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/src/src/MCPClient/lib/client/job.py", line 142, in JobContext
    yield
  File "/src/src/MCPClient/lib/clientScripts/transcribe_file.py", line 182, in call
    job.set_status(main(job, task_uuid, file_uuid))
  File "/src/src/MCPClient/lib/clientScripts/transcribe_file.py", line 108, in main
    file_ = File.objects.get(uuid=file_uuid)
  File "/pyenv/data/versions/3.9.22/lib/python3.9/site-packages/django/db/models/manager.py", line 87, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/pyenv/data/versions/3.9.22/lib/python3.9/site-packages/django/db/models/query.py", line 623, in get
    clone = self._chain() if self.query.combinator else self.filter(*args, **kwargs)
  File "/pyenv/data/versions/3.9.22/lib/python3.9/site-packages/django/db/models/query.py", line 1436, in filter
    return self._filter_or_exclude(False, args, kwargs)
  File "/pyenv/data/versions/3.9.22/lib/python3.9/site-packages/django/db/models/query.py", line 1454, in _filter_or_exclude
    clone._filter_or_exclude_inplace(negate, args, kwargs)
  File "/pyenv/data/versions/3.9.22/lib/python3.9/site-packages/django/db/models/query.py", line 1461, in _filter_or_exclude_inplace
    self._query.add_q(Q(*args, **kwargs))
  File "/pyenv/data/versions/3.9.22/lib/python3.9/site-packages/django/db/models/sql/query.py", line 1546, in add_q
    clause, _ = self._add_q(q_object, self.used_aliases)
  File "/pyenv/data/versions/3.9.22/lib/python3.9/site-packages/django/db/models/sql/query.py", line 1577, in _add_q
    child_clause, needed_inner = self.build_filter(
  File "/pyenv/data/versions/3.9.22/lib/python3.9/site-packages/django/db/models/sql/query.py", line 1492, in build_filter
    condition = self.build_lookup(lookups, col, value)
  File "/pyenv/data/versions/3.9.22/lib/python3.9/site-packages/django/db/models/sql/query.py", line 1319, in build_lookup
    lookup = lookup_class(lhs, rhs)
  File "/pyenv/data/versions/3.9.22/lib/python3.9/site-packages/django/db/models/lookups.py", line 27, in __init__
    self.rhs = self.get_prep_lookup()
  File "/pyenv/data/versions/3.9.22/lib/python3.9/site-packages/django/db/models/lookups.py", line 341, in get_prep_lookup
    return super().get_prep_lookup()
  File "/pyenv/data/versions/3.9.22/lib/python3.9/site-packages/django/db/models/lookups.py", line 85, in get_prep_lookup
    return self.lhs.output_field.get_prep_value(self.rhs)
  File "/pyenv/data/versions/3.9.22/lib/python3.9/site-packages/django/db/models/fields/__init__.py", line 2672, in get_prep_value
    return self.to_python(value)
  File "/pyenv/data/versions/3.9.22/lib/python3.9/site-packages/django/db/models/fields/__init__.py", line 2690, in to_python
    raise exceptions.ValidationError(
django.core.exceptions.ValidationError: ['“None” is not a valid UUID.']

This ValidationError is handled correctly by the MCPClient worker queue.


For Artefactual use:

Before you close this issue, you must check off the following:

  • All pull requests related to this issue are properly linked
  • All pull requests related to this issue have been merged
  • A testing plan for this issue has been implemented and passed (testing plan information should be included in the issue body or comments)
  • Documentation regarding this issue has been written and merged (if applicable)
  • Details about this issue have been added to the release notes (if applicable)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Status: reviewThe issue's code has been merged and is ready for testing/review.

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions