Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
Problem: starting Python and importing Django in many gearman workers is inefficient and unnecessary #938
Profiling of Archivematica's MCPClient indicates that it takes roughly 1.25 seconds for a (Python-based) client script to be started in a new Python process and for it to import Django (and other heavy modules) and create a database connection. Given a package creation event with very many files each requiring several Pythonic client script process forks, this can result in significant inefficiencies.
In many cases this "cold startup" cost is unnecessary when the following could be done instead:
For evidence of the problem, see the following figure and the AM acceptance tests issue 62:
referenced this issue
Feb 22, 2018
This issue has morphed into looking at end-to-end performance of transfers and ingests, with an eye to making the improvements Joel describes above. We've had a lot of discussion via email, and @jhsimpson suggested moving the discussion here, so I'm going to try to give a potted history.
The crux of the proposal above was to switch MCP Client scripts from forking processes to loading them into a long-running Python instance where possible. To check the feasibility of the approach, we (@payten, @jambun, @marktriggs) put together a quick proof-of-concept (https://github.com/hudmol/archivematica/tree/inline-mcp-client) that used
Testing this with a 500 file transfer produced a surprising result: CPU time reduced in line with Joel's analysis above, but the overall duration of the transfer and ingest (in wall clock time) didn't budge. The CPU did less work in the same amount of time.
I put together some numbers and a screencast to show what I think is happening:
This led to more investigation of MCP Server--specifically, trying to find what was causing the MCP Client tasks to start 10s of milliseconds apart, when they could have run concurrently. Eventually that led to the spot where tasks are inserted into MySQL:
and we discovered that those inserts take about 140ms on average (historical revisionism note: in my email originally said 10ms, but we now know it's worse than that). Changing the code to group those inserts into transactions is an easy fix and improves insert performance significantly.
What we've found is that changing MCP Client to run its scripts inline can improve performance significantly, but only if you make the corresponding change to MCP Server to batch those insert statements into transactions. Either change in isolation has no effect, but apply both and you get a good performance boost (a 40% reduction in duration in my test).
With all of that, I think we've validated performance benefits of further improving the MCP Client task handling. Discussing within the group, we generally agreed that using
Our sense is that the ideal API for these scripts would:
referenced this issue
Mar 12, 2018
Just a quick bit of additional information from some small benchmarks I ran locally. As before, running a transfer and ingest of 500 JPEG images.
The inline-mcp-client branch is here: https://github.com/hudmol/archivematica/tree/inline-mcp-client, but is basically what we talked about before--using runpy to execute the MCP Client scripts in the current process instead of forking.
The fsync change runs mysqld using http://www.mcgill.org.za/stuff/software/nosync, which stops any COMMIT from calling fsync (which would hit the disk). Durability goes out the window, but it helps give a sense of how much time is spent just waiting for MySQL to handle a COMMIT (which is after every DB statement by default).
So roughly speaking, it seems like my 500 file ingest running qa/1.x is spending ~15 minutes waiting for MySQL commits, ~15 minutes reloading python/django/etc., ~10 minutes doing everything else. At least this suggests that future work on improving the MCP Client interface is likely to have a good payoff.
Should we have a maximum batch size? E.g. say you're processing 500 video files and we only write the tasks at the very end then the dashboard isn't going to be able to report any status to the user until it's 100% complete.
I wish it was easier to make the inserts concurrently. Launching a thread only for that purpose is probably too expensive (in particular in that scenario where you have small files). Maybe having a queue shared with another thread responsible of doing the inserts, would that be better? There is a nice read from the creator of sqlalchemy (http://techspot.zzzeek.org/2015/02/15/asynchronous-python-and-databases/) that talks about asyncio - something I found surprising when I read that is that the queries are not IO bound at all (well maybe they are with autocommit?). I guess this means that on a threaded solution the GIL would become a bottleneck, I'm guessing. Although you can also share a queue between two processes (https://docs.python.org/2/library/multiprocessing.html), perhaps that could do it.
One stray thought on the dashboard: I guess it would be possible to have it set its transaction isolation to READ_UNCOMMITTED when polling for task statuses. That way, it would see the latest inserts even if the batch hadn't been committed yet. Batching probably is a good idea too, but just thought I'd note the idea :)
referenced this issue
Mar 19, 2018
We've put together a summary of the analysis we've done so far an a proposal for next steps. Looking forward to chatting about it all soon!
Overview of current architecture
The analysis performed so far (https://github.com/artefactual/archivematica/issues/938#issuecomment-372508380) suggests that the current architecture adds overhead in two main areas:
Illustrating these overheads, we ran an experiment where we ingested 500 very small image files. First, we used a vanilla version of the current code running under a Docker environment. End-to-end, this took 40 minutes.
Next, we modified the test environment in two ways:
Under this modified environment, the same ingest ran in 8 minutes. The two modifications contributed equally to this improvement in duration.
So, when dealing with small files, about 80% of the total time was spent on overhead: either waiting for the database to write to disk, or waiting for the MCP Client scripts to load and bootstrap.
To achieve something like this degree of improvement, it would be necessary to change the MCP Server and MCP Client programs to enable batch processing of files. When a client script is handed a single file to work on, it doesn't have much leeway to optimise its activities; but give it 1,000 files and it has much more flexibility: it can group its database reads and writes and amortise overheads across the entire batch.
To move to a system where MCP Client scripts operate on batches of files, the immediate issue is to place responsibility for defining the batches. We propose that the MCP Server is responsible for this, creating batches of files needing processing based on rough heuristics (e.g. based on types of tasks, number of MCP Client worker machines available, and so on).
The goal of batching at the MCP Server level is to ensure that work can be effectively split to run across multiple machines in parallel. However, the most effective split will depend somewhat on the amount of work (in CPU time) there is to do.
For example, if an Archivematica installation running with three MCP Client machines has a group of nine videos to transcode, it makes sense to divide the work into three equally sized sets and share it across all three machines--with a fair amount of work to do on each file, the gains from concurrency outweigh the cost of coordination. However, if the task was to produce thumbnail versions of images, splitting the work to run on multiple machines might not really pay off until you had a few hundred images to process. In short, different types of work will have different ideal batch sizes.
Beyond the MCP Server breaking work into batches, our opinion is that the MCP Client scripts are best-placed to make processing decisions to optimise performance. The nature of the work performed by the MCP Client scripts is inherently heterogeneous and domain-specific, varying widely in duration and CPU usage. As such, it is difficult to make optimal resource allocation decisions without having knowledge of the exact nature of work being performed by a client script.
For example, the ffmpeg program automatically runs as many threads as the system has CPUs, so running concurrent ffmpeg processes doesn't improve throughput. Conversely, a program like ImageMagick runs in a single thread, and running multiple instances gives a linear increase in throughput. The client script that runs a given tool is in the best position to decide how to utilise system resources.
This leads to a departure from the current MCP Client, which runs multiple client scripts simultaneously. We propose instead that each MCP Client instance should run a single client script at a time, with that script taking responsibility for running its own threads in whatever way is optimal. This utilises CPUs more effectively--avoiding the situation where multiple "heavy hitting" client scripts try to run on every CPU, and also gives client scripts the freedom to batch their database accesses to reduce overhead.
New MCP Client script interface
Existing Python client scripts will need to be converted to standalone modules, where each presents a single, standardised entry point:
Each Job object is an atomic unit of work to be performed--for example, a file to be transcoded. It will provide:
Migration strategy for existing client scripts
MCP Server changes
Figuring out the batches
MCP Client changes
FPR command changes?
Joel noted that FPR scripts might be a candidate for consideration too:
This proposal looks great. I won’t pretend to understand all of the technical details around threading and approach for batching etc so will defer to others on that. But the general direction makes a lot of sense to me.
I’d like to mention two benefits that I think this design could deliver, beyond the primary objective to improve scalability & performance).
A couple of comments on the proposed design:
This design and the idea of standard interface for FPR maps very closely to the conceptual design work we did for the PAR with Jisc (I don't want to complicate this with that proposed project; but I think it’s worth noting this is all a strong step in that direction and I don't see anything here that I think we might regret later if we do take forward the PAR project). It would be great to leverage that design work to agree on good terminology going forward. I’m not sure if the terminology above (e.g. how you are using the word ‘jobs’) maps accurately to current MCP server schema. But I find that schema quite complex, so I think another general benefit of implementing a standard interface is that we can base it on a good domain model ignore some of the complexities of the particular implementation of MCP Server.
Final point on the dashboard: If I understand the proposed change there, you’d be doing this so that you get more responsive / timely feedback in the GUI? So long as the dashboard continues to clearly show when is ‘executing’ (doing something) versus when it is completed or awaiting user input, I don’t think increasing responsiveness on individual tasks is necessary. I don’t see that much benefit to users… indeed I wonder if there would be any, given that the dashboard currently only reports at the ‘job’ level (so a batch of ‘tasks’ for format identification, say)?
referenced this issue
May 23, 2018
This was referenced
Jul 5, 2018
referenced this issue
Jul 27, 2018
Parent of artefactual-labs/archivematica-acceptance-tests#121