-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem: validate preservation derivatives hangs #44
Comments
I've also been able to reproduce this issue simply with https://github.com/artefactual/archivematica-sampledata/tree/master/SampleTransfers/Multimedia - with ten transfers running simultaneously. |
I also reproduced this running the entire SampleTransfers directory in 1.7x without other processes running. I had previously run the Multimedia directory and it went fine by itself. The bad thing about this bug is that it also blocks other transfers from succeeding, and locks up the entire system, so the system has to be restarted. |
Exact. |
This commit ensures that we only have one instance of MediaConch at a time to avoid the issue described in the issue linked below. Connects to archivematica/Issues#44.
This commit ensures that we only have one instance of MediaConch at a time to avoid the problem described in the issue linked below. This is achieved by removing the `concurrent_instances` attribute from the client script which forces the code in it to be executed inline by MCPClient. Connects to archivematica/Issues#44.
Thanks @JeromeMartinez, that's useful. We can't run MediaConch as a service yet, but I can make sure that we only have one instance running (artefactual/archivematica#1255). After reading Temporary Files Used By SQLite I've started wondering if we could solve the issue by having the environment string |
Feels great to have this one passing again! 🐚 |
Expected behaviour
During the normalize microservice group, after 'normalize for preservation', the preservation derivatives are validated. When the preservation derivatives are mkv files, mediaconch is run by the validate_file.py client script. All mkv files should be tested by mediaconch and should either pass or fail. Processing should move on to the next step in the workflow.
Current behaviour
When the 'Validate preservation derivatives' job is running, there is a point in the processing where log entries no longer appear in the mcp client logs, other than debug. We saw this:
That exact output (with 263 threads reports once per hour, and 23 jobs running) was reported without change for about 40 hours. At that point, there seems to have been a time out of some kind hit perhaps in gearman, because the mcp client logs then showed lots of
No output, or file specified
entries (maybe 100?) and the job showed as failed in the UI and the sip was failed.During that 40 hour period, where the job shows 'executing commands' in the UI, but nothing is actually happening, we observed a couple thing:
We observed a mediaconch process running in the mcp client container. Doing an strace on the mediaconch process showed:
access("/home/archivematica/.local/share/MediaConch/MediaConch.db-wal", F_OK) = -1 ENOENT (No such file or directory)
We speculate that MediaConch is using sqlite (the db-wal file extension is a sqlite write ahead log probably). We further speculate that validate_file is running 4 mediaconch processes at a time (concurrent_instances() is set to the cpu count, and in this environmen the mcp client container has 4 cpu cores), and that there is a deadlock going on, with 4 mediaconch processes all attempting to lock the same db-wal file at the same time.
Steps to reproduce
Create a large transfer that contains many mkv files. For example, use the createtransfers.py script from Archivematica sample data https://github.com/artefactual/archivematica-sampledata
Process the transfer and make sure to choose both 'normalize for preservation' and 'validate '.
Your environment (version of Archivematica, OS version, etc)
qa/1.x in docker-compose (rdss-archivematica, but it should be the same in am.git)
The text was updated successfully, but these errors were encountered: