Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem: validate preservation derivatives hangs #44

Closed
jhsimpson opened this issue Jul 27, 2018 · 5 comments

Comments

Projects
None yet
6 participants
@jhsimpson
Copy link
Member

commented Jul 27, 2018

Expected behaviour
During the normalize microservice group, after 'normalize for preservation', the preservation derivatives are validated. When the preservation derivatives are mkv files, mediaconch is run by the validate_file.py client script. All mkv files should be tested by mediaconch and should either pass or fail. Processing should move on to the next step in the workflow.

Current behaviour
When the 'Validate preservation derivatives' job is running, there is a point in the processing where log entries no longer appear in the mcp client logs, other than debug. We saw this:

DEBUG     2018-07-27 21:56:45  archivematica.mcp.server:taskGroupRunner:_monitor_running_jobs:241:  0 jobs pending; 23 jobs running; 23 known task groups
DEBUG     2018-07-27 21:56:51  archivematica.mcp.server:archivematicaMCP:debugMonitor:233:  Debug monitor: datetime: 2018-07-27 21:56:51.911555+00:00
DEBUG     2018-07-27 21:56:51  archivematica.mcp.server:archivematicaMCP:debugMonitor:234:  Debug monitor: thread count: 263

That exact output (with 263 threads reports once per hour, and 23 jobs running) was reported without change for about 40 hours. At that point, there seems to have been a time out of some kind hit perhaps in gearman, because the mcp client logs then showed lots of No output, or file specified entries (maybe 100?) and the job showed as failed in the UI and the sip was failed.

During that 40 hour period, where the job shows 'executing commands' in the UI, but nothing is actually happening, we observed a couple thing:

$ ps auxf | grep Client
artefac+  4016       \_ grep --color=auto Client
archive+ 15344   \_ /bin/sh -c /src/MCPClient/lib/archivematicaClient.py
archive+ 15391       \_ python2 /src/MCPClient/lib/archivematicaClient.py
archive+ 17253           \_ python2 /src/MCPClient/lib/archivematicaClient.py
archive+ 17264           |   \_ python2 /src/MCPClient/lib/fork_runner.py clientScripts.validate_file
archive+ 17254           \_ python2 /src/MCPClient/lib/archivematicaClient.py
archive+ 17255           \_ python2 /src/MCPClient/lib/archivematicaClient.py
archive+ 17256           \_ python2 /src/MCPClient/lib/archivematicaClient.py

We observed a mediaconch process running in the mcp client container. Doing an strace on the mediaconch process showed:

access("/home/archivematica/.local/share/MediaConch/MediaConch.db-wal", F_OK) = -1 ENOENT (No such file or directory)
We speculate that MediaConch is using sqlite (the db-wal file extension is a sqlite write ahead log probably). We further speculate that validate_file is running 4 mediaconch processes at a time (concurrent_instances() is set to the cpu count, and in this environmen the mcp client container has 4 cpu cores), and that there is a deadlock going on, with 4 mediaconch processes all attempting to lock the same db-wal file at the same time.

Steps to reproduce
Create a large transfer that contains many mkv files. For example, use the createtransfers.py script from Archivematica sample data https://github.com/artefactual/archivematica-sampledata
Process the transfer and make sure to choose both 'normalize for preservation' and 'validate '.

Your environment (version of Archivematica, OS version, etc)
qa/1.x in docker-compose (rdss-archivematica, but it should be the same in am.git)

@jhsimpson jhsimpson added the Jisc RDSS label Jul 27, 2018

@sevein

This comment has been minimized.

Copy link
Contributor

commented Aug 12, 2018

I've also been able to reproduce this issue simply with https://github.com/artefactual/archivematica-sampledata/tree/master/SampleTransfers/Multimedia - with ten transfers running simultaneously.

@ablwr

This comment has been minimized.

Copy link

commented Aug 21, 2018

I also reproduced this running the entire SampleTransfers directory in 1.7x without other processes running. I had previously run the Multimedia directory and it went fine by itself. The bad thing about this bug is that it also blocks other transfers from succeeding, and locks up the entire system, so the system has to be restarted.

@sromkey sromkey added this to the 1.8.0 milestone Aug 21, 2018

@JeromeMartinez

This comment has been minimized.

Copy link

commented Aug 21, 2018

We speculate that MediaConch is using sqlite

Exact.
The design we mostly worked on is with 1 instance of MediaConch server handling all parallel jobs, and if I understand well here you run several instances of MediaConch so the SQLite file is locked at some point and we don't handle well this case.
There are definitely things to change in MediaConch about that, with e.g. better handling of SQLite file lock or offering another SQL engine more suitable for concurrent access (more development).
A workaround is to limit the count of parallel instances but you still may have the problem (just less often).

@sevein sevein self-assigned this Sep 11, 2018

qubot pushed a commit to artefactual/archivematica that referenced this issue Sep 12, 2018

MCPClient: force inline execution of validate_file
This commit ensures that we only have one instance of MediaConch at a
time to avoid the issue described in the issue linked below.

Connects to archivematica/Issues#44.

qubot pushed a commit to artefactual/archivematica that referenced this issue Sep 12, 2018

MCPClient: force inline execution of validate_file
This commit ensures that we only have one instance of MediaConch at a
time to avoid the problem described in the issue linked below. This is
achieved by removing the `concurrent_instances` attribute from the
client script which forces the code in it to be executed inline by
MCPClient.

Connects to archivematica/Issues#44.
@sevein

This comment has been minimized.

Copy link
Contributor

commented Sep 12, 2018

Thanks @JeromeMartinez, that's useful. We can't run MediaConch as a service yet, but I can make sure that we only have one instance running (artefactual/archivematica#1255).

After reading Temporary Files Used By SQLite I've started wondering if we could solve the issue by having the environment string SQLITE_TMPDIR pointing to a new temporary directory on each run. Do you think that could work? I'm not sure where the location .local/share/MediaConch/ comes from.

@ablwr

This comment has been minimized.

Copy link

commented Oct 10, 2018

Feels great to have this one passing again! 🐚

@sromkey sromkey closed this Nov 22, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.