New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework the MCP Server, MCP Client and MCP Client scripts to support batching tasks #1105
Conversation
I've recorded a walkthrough of the major changes to the code: http://tsp.nz/d/7ccefcf27452d9c7fdbfb805acb5dd42a13b0f5d.mp4#/mcp_batching.mp4 It runs at about 30 minutes (sorry) but if you watch it at double speed it still mostly makes sense and makes for more fun watching. |
@marktriggs @jambun I'm really impressed by the work you guys have done here. I think that it makes much simpler to understand how things work in general but particularly in MCPServer which is a piece not very well understood. While I was testing this and going through the changes I noticed a number of smaller issues. I wanted to fix them but as I don't have access to your repo I did it here in One question I have with the batching solution is whether running very long transactions can bring other issues. E.g. if I have a transfer with 50 files of 10G each and the clamscan script does the following: def call(jobs):
with transaction.atomic():
for job in jobs:
with job.JobContext(logger=logger):
job.set_status(scan_file(*job.args[1:])) I think that means that we'd be keeping that transaction open for as long as it'd take to scan all the files. That could potentially take hours. Could that hurt the performance of the database server significantly? |
Thanks @sevein! We've pulled in your changes now, so the branch should be up to date. Good point on the transaction length issue too. We're going to modify that client script to buffer the events and insert them all at the end of the batch, so we won't have to keep that transaction open the whole time. We'll do another pass through the other client scripts and see if any of them need the same treatment. Cheers! |
Thank you @marktriggs. I have one more question. I've started testing with a simple transfer and antivirus scanning failed - this is the error that I'm seeing:
|
Hm, curious... is your fork_runner.py script missing an execute bit? Mine looks like:
Maybe I need to set that explicitly... |
It has it. Also I've checked and the repo tracks it too, e.g. https://github.com/hudmol/archivematica/blob/mcp-batching/src/MCPClient/lib/fork_runner.py reads "Executable file". I'm going to try again, I'll let you know how it goes. |
Just tried on my Linux desktop (using am.git) and it worked this time. |
Strange! What were you testing on before? Was that using Vagrant perhaps? We've been exclusively using am.git for development so far... |
Docker for Mac + am.git which is what I use when I work from home. I'll give it another try tonight but I'm not too worried about it. |
Ah right. James & Payten were both testing on Macs too, so I'll get them to run another test. Otherwise, let me know if I can help troubleshoot :) |
Do you think it would be a good idea to make I was testing with |
# Generate report in plain text and store it in the database | ||
content = get_content_for(args.unit_type, args.unit_name, args.unit_uuid, html=False) | ||
store_report(content, args.unit_type, args.unit_name, args.unit_uuid) | ||
with transaction.atomic(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this script need to have a db transaction? It doesn't look like anything is written to the db here, so this line could be removed?
store_report() is called on line 214, if that is writing to the db, maybe the transcation should start one line before, instead of containing the entire for loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, store_report is writing to the DB. This might want the same treatment as the clamscan script--separate the bits that write to the database from the rest, so that we're not holding open the transaction while sending out emails. We can look at this today as we'll be colocated :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've addressed this one now. It'll send all 100ish emails in a transaction after having sent out the emails.
|
||
import multiprocessing | ||
def concurrent_instances(): | ||
return multiprocessing.cpu_count() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of doing cpu_count() here, can you have this return the value of ARCHIVEMATICA_MCPCLIENT_MCPCLIENT_NUMBEROFTASKS
See https://github.com/artefactual/archivematica/tree/stable/1.7.x/src/MCPClient/install#environment-variables
This environment variable often gets set to the cpu count during deployment
numberOfTasks = 0 |
archivematicaClient already detects the number of cpu cores (see
startThreads(django_settings.NUMBER_OF_TASKS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, it looks like this variable got the chop in 6c33d34. I don't have strong feelings about where we get this value from, as long as we know that it's at least 1 :) Thoughts, anyone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose of NUMBER_OF_TASKS
wasn't exactly the same although very similar but I think it'd be fine if it needs to be reintroduced. This could also be done after we merge this PR, that would be at least my preference. I think it's going to be hard to keep this work up to date, conflicts arise quickly because it touches many files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good to me. We can leave NUMBER_OF_TASKS
out, and consider adding it back in after, if we need it. We may not need to.
@sevein I think making BATCH_SIZE configurable could make sense for installations running multiple MCP Client instances, yeah. For the single MCP Client case there's not much benefit to configuring it; I did some testing at various batch sizes and found that 128 was a good enough number--big enough to reduce commit overhead, and going bigger didn't improve performance. If you were running multiple MCP Clients and tended to run single transfers with small numbers of large files, it might make sense to drop the number to something smaller (to encourage them to split across machines, like you said). If it's relatively easy to make the parameter configurable then it seems like there's no harm in doing that. |
@sevein Ah, we reproduced and fixed the Mac issue. It turns out that file was returning "fork_runner.pyc", not the .py file! It didn't affect Linux because the permissions on the lib directory don't allow the archivematica user to write out that pyc file, so it doesn't happen. But under Mac Docker it's all owned by archivematica and the pyc file gets written out. So I've removed the dependence on file now. Just added a constant and documented its reason for existing. Computers... |
Hi @sevein, we've just pushed up a patch for |
@marktriggs thanks, it's working now.
@payten cool that's great, thank you. You may have noticed that your change broke the tests. It's hard to tell because Travis is now reporting errors at all times because of a different issue, #1164. Running the tests locally is easy though. There are some notes here: https://github.com/artefactual-labs/am/tree/master/compose#tests-are-too-slow. One more thing, I've rebased your branch again (see dev/mcp-batching - you may want to reset again or do it yourself (if willing to solve a few conflicts). I just wanted to make sure that it was up to date before doing more testing. Looks like we're pretty close, are we? |
Thanks @sevein. We've fixed up those tests. Hopefully it's good to go now! |
…atching tasks. * The MCP Server now batches file-level tasks into fixed-size groups, creating one Gearman task per batch, rather than one per file. It also uses fixed-size thread pools to limit contention between threads. * The MCP Client now operates in batches, processing one batch at a time. It also supports running tasks using a pool of processes (improving throughput where tasks benefit from spanning multiple CPUs.) * The MCP Client scripts now accept a batch of jobs and process them as a single unit. There is a new `Job` API that provides a standard interface for these client scripts, and all scripts have been converted to use this. Much more documented in issue
…vents generated by the jobs
After much internal discussion at Artefactual, I think we agree that this is looking very good. There are some issues that @sevein is going to document, as follow ups (re-instituting CAPTURE_CLIENT_OUTPUT, make BATCH_SIZE configurable) but those don't need to block this PR, they can be addressed as new issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're ready to merge.
@payten @marktriggs, I've pushed 0e50ea2 which solves the most recent conflict. Can you reset? I'll merge-squash after.
I'm going to list a number of things that we want to work next once this is merged. I'll be filing new issues as needed.
- Make
BATCH_SIZE
configurable. - Make
linkTaskManagerGetMicroserviceGeneratedListInStdOut
forcewants_output=True
sogetAipStorageLocations_v0.0
doesn't break. - Restore
CAPTURE_CLIENT_SCRIPT_OUTPUT
in archivematicaClient.py#L179-L182. This is needed by at least one client wanting to avoid persisting the output to disk because it's expected to see very large streams. - Document new use of
LIMIT_TASK_THREADS
. - Write up some documentation on scalability and the new strategy, e.g. how and when to tweak
BATCH_SIZE
and/orCAPTURE_CLIENT_SCRIPT_OUTPUT
, why are we batching queries, etc... requiresOutputLock
field not needed, should be removed. Other model fields related to workflow data are proably unused too, should be reviewed in detail.- Test with client datasets (e.g. Columbia) - add
concurrent_instances
to more client scripts. - Fixes in acceptance tests.
I've just pushed up Thanks @sevein for all your work reviewing and testing this! And especially for handling the rebases :) |
The MCP Server now batches file-level tasks into fixed-size
groups, creating one Gearman task per batch, rather than one per
file. It also uses fixed-size thread pools to limit contention
between threads.
The MCP Client now operates in batches, processing one batch at a
time. It also supports running tasks using a pool of
processes (improving throughput where tasks benefit from spanning
multiple CPUs.)
The MCP Client scripts now accept a batch of jobs and process them
as a single unit. There is a new
Job
API that provides astandard interface for these client scripts, and all scripts have
been converted to use this.
The motivation for this work was to improve performance on transfer and ingest workflows, and to provide an improved interface for implementing client scripts.
Our testing shows transfers and ingests taking approximately half the time they did without these changes.
These changes also permit further optimisation of client scripts, by taking advantage of processing files in batches rather than one at a time. We did some work on optimising a few of the client scripts, but there is likely more improvement to be gained by further optimisation.
This is connected to #938.