Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem: validate preservation derivatives hangs #44

Closed
jhsimpson opened this issue Jul 27, 2018 · 5 comments
Closed

Problem: validate preservation derivatives hangs #44

jhsimpson opened this issue Jul 27, 2018 · 5 comments
Assignees
Labels
Jisc RDSS Severity: high A high-priority situation where performance or other uses are significantly impacted or degraded. Type: bug A flaw in the code that causes the software to produce an incorrect or unexpected result.
Milestone

Comments

@jhsimpson
Copy link
Member

Expected behaviour
During the normalize microservice group, after 'normalize for preservation', the preservation derivatives are validated. When the preservation derivatives are mkv files, mediaconch is run by the validate_file.py client script. All mkv files should be tested by mediaconch and should either pass or fail. Processing should move on to the next step in the workflow.

Current behaviour
When the 'Validate preservation derivatives' job is running, there is a point in the processing where log entries no longer appear in the mcp client logs, other than debug. We saw this:

DEBUG     2018-07-27 21:56:45  archivematica.mcp.server:taskGroupRunner:_monitor_running_jobs:241:  0 jobs pending; 23 jobs running; 23 known task groups
DEBUG     2018-07-27 21:56:51  archivematica.mcp.server:archivematicaMCP:debugMonitor:233:  Debug monitor: datetime: 2018-07-27 21:56:51.911555+00:00
DEBUG     2018-07-27 21:56:51  archivematica.mcp.server:archivematicaMCP:debugMonitor:234:  Debug monitor: thread count: 263

That exact output (with 263 threads reports once per hour, and 23 jobs running) was reported without change for about 40 hours. At that point, there seems to have been a time out of some kind hit perhaps in gearman, because the mcp client logs then showed lots of No output, or file specified entries (maybe 100?) and the job showed as failed in the UI and the sip was failed.

During that 40 hour period, where the job shows 'executing commands' in the UI, but nothing is actually happening, we observed a couple thing:

$ ps auxf | grep Client
artefac+  4016       \_ grep --color=auto Client
archive+ 15344   \_ /bin/sh -c /src/MCPClient/lib/archivematicaClient.py
archive+ 15391       \_ python2 /src/MCPClient/lib/archivematicaClient.py
archive+ 17253           \_ python2 /src/MCPClient/lib/archivematicaClient.py
archive+ 17264           |   \_ python2 /src/MCPClient/lib/fork_runner.py clientScripts.validate_file
archive+ 17254           \_ python2 /src/MCPClient/lib/archivematicaClient.py
archive+ 17255           \_ python2 /src/MCPClient/lib/archivematicaClient.py
archive+ 17256           \_ python2 /src/MCPClient/lib/archivematicaClient.py

We observed a mediaconch process running in the mcp client container. Doing an strace on the mediaconch process showed:

access("/home/archivematica/.local/share/MediaConch/MediaConch.db-wal", F_OK) = -1 ENOENT (No such file or directory)
We speculate that MediaConch is using sqlite (the db-wal file extension is a sqlite write ahead log probably). We further speculate that validate_file is running 4 mediaconch processes at a time (concurrent_instances() is set to the cpu count, and in this environmen the mcp client container has 4 cpu cores), and that there is a deadlock going on, with 4 mediaconch processes all attempting to lock the same db-wal file at the same time.

Steps to reproduce
Create a large transfer that contains many mkv files. For example, use the createtransfers.py script from Archivematica sample data https://github.com/artefactual/archivematica-sampledata
Process the transfer and make sure to choose both 'normalize for preservation' and 'validate '.

Your environment (version of Archivematica, OS version, etc)
qa/1.x in docker-compose (rdss-archivematica, but it should be the same in am.git)

@sevein
Copy link
Contributor

sevein commented Aug 12, 2018

I've also been able to reproduce this issue simply with https://github.com/artefactual/archivematica-sampledata/tree/master/SampleTransfers/Multimedia - with ten transfers running simultaneously.

@ablwr
Copy link
Contributor

ablwr commented Aug 21, 2018

I also reproduced this running the entire SampleTransfers directory in 1.7x without other processes running. I had previously run the Multimedia directory and it went fine by itself. The bad thing about this bug is that it also blocks other transfers from succeeding, and locks up the entire system, so the system has to be restarted.

@sromkey sromkey added this to the 1.8.0 milestone Aug 21, 2018
@JeromeMartinez
Copy link

We speculate that MediaConch is using sqlite

Exact.
The design we mostly worked on is with 1 instance of MediaConch server handling all parallel jobs, and if I understand well here you run several instances of MediaConch so the SQLite file is locked at some point and we don't handle well this case.
There are definitely things to change in MediaConch about that, with e.g. better handling of SQLite file lock or offering another SQL engine more suitable for concurrent access (more development).
A workaround is to limit the count of parallel instances but you still may have the problem (just less often).

@sromkey sromkey added the Status: ready The issue is sufficiently described/scoped to be picked up by a developer. label Aug 31, 2018
@sevein sevein self-assigned this Sep 11, 2018
@sevein sevein added Status: in progress Issue that is currently being worked on. and removed Status: ready The issue is sufficiently described/scoped to be picked up by a developer. labels Sep 11, 2018
qubot pushed a commit to artefactual/archivematica that referenced this issue Sep 12, 2018
This commit ensures that we only have one instance of MediaConch at a
time to avoid the issue described in the issue linked below.

Connects to archivematica/Issues#44.
qubot pushed a commit to artefactual/archivematica that referenced this issue Sep 12, 2018
This commit ensures that we only have one instance of MediaConch at a
time to avoid the problem described in the issue linked below. This is
achieved by removing the `concurrent_instances` attribute from the
client script which forces the code in it to be executed inline by
MCPClient.

Connects to archivematica/Issues#44.
@sevein
Copy link
Contributor

sevein commented Sep 12, 2018

Thanks @JeromeMartinez, that's useful. We can't run MediaConch as a service yet, but I can make sure that we only have one instance running (artefactual/archivematica#1255).

After reading Temporary Files Used By SQLite I've started wondering if we could solve the issue by having the environment string SQLITE_TMPDIR pointing to a new temporary directory on each run. Do you think that could work? I'm not sure where the location .local/share/MediaConch/ comes from.

@sevein sevein added Severity: high A high-priority situation where performance or other uses are significantly impacted or degraded. Type: bug A flaw in the code that causes the software to produce an incorrect or unexpected result. Status: review The issue's code has been merged and is ready for testing/review. and removed Status: in progress Issue that is currently being worked on. labels Sep 12, 2018
@sevein sevein removed their assignment Sep 24, 2018
@peterVG peterVG assigned ablwr and unassigned peterVG Oct 9, 2018
@ablwr ablwr added Status: verified and removed Status: review The issue's code has been merged and is ready for testing/review. labels Oct 10, 2018
@ablwr
Copy link
Contributor

ablwr commented Oct 10, 2018

Feels great to have this one passing again! 🐚

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Jisc RDSS Severity: high A high-priority situation where performance or other uses are significantly impacted or degraded. Type: bug A flaw in the code that causes the software to produce an incorrect or unexpected result.
Projects
None yet
Development

No branches or pull requests

6 participants