Rework the MCP Server, MCP Client and MCP Client scripts to support batching tasks #1105

jambun · 2018-05-23T00:55:29Z

The MCP Server now batches file-level tasks into fixed-size
groups, creating one Gearman task per batch, rather than one per
file. It also uses fixed-size thread pools to limit contention
between threads.
The MCP Client now operates in batches, processing one batch at a
time. It also supports running tasks using a pool of
processes (improving throughput where tasks benefit from spanning
multiple CPUs.)
The MCP Client scripts now accept a batch of jobs and process them
as a single unit. There is a new Job API that provides a
standard interface for these client scripts, and all scripts have
been converted to use this.

The motivation for this work was to improve performance on transfer and ingest workflows, and to provide an improved interface for implementing client scripts.

Our testing shows transfers and ingests taking approximately half the time they did without these changes.

These changes also permit further optimisation of client scripts, by taking advantage of processing files in batches rather than one at a time. We did some work on optimising a few of the client scripts, but there is likely more improvement to be gained by further optimisation.

This is connected to #938.

marktriggs · 2018-06-13T03:04:11Z

I've recorded a walkthrough of the major changes to the code:

http://tsp.nz/d/7ccefcf27452d9c7fdbfb805acb5dd42a13b0f5d.mp4#/mcp_batching.mp4

It runs at about 30 minutes (sorry) but if you watch it at double speed it still mostly makes sense and makes for more fun watching.

sevein · 2018-06-25T20:30:45Z

@marktriggs @jambun I'm really impressed by the work you guys have done here. I think that it makes much simpler to understand how things work in general but particularly in MCPServer which is a piece not very well understood.

While I was testing this and going through the changes I noticed a number of smaller issues. I wanted to fix them but as I don't have access to your repo I did it here in dev/mcp-batching. I basically took your commit 02e0160 and rebased it onto qa/1.x so the branch is again up to date. Then I added a number of smaller commits addressing minor issues like cosmetics, unused settings (you made up to five settings redundant I believe!) and other minor changes that I thought were more or less important. You may want to reset your branch if you agree with the changes I'm suggesting.

One question I have with the batching solution is whether running very long transactions can bring other issues. E.g. if I have a transfer with 50 files of 10G each and the clamscan script does the following:

def call(jobs):
    with transaction.atomic():
        for job in jobs:
            with job.JobContext(logger=logger):
                job.set_status(scan_file(*job.args[1:]))

I think that means that we'd be keeping that transaction open for as long as it'd take to scan all the files. That could potentially take hours. Could that hurt the performance of the database server significantly?

marktriggs · 2018-06-26T01:24:54Z

Thanks @sevein! We've pulled in your changes now, so the branch should be up to date.

Good point on the transaction length issue too. We're going to modify that client script to buffer the events and insert them all at the end of the batch, so we won't have to keep that transaction open the whole time. We'll do another pass through the other client scripts and see if any of them need the same treatment.

Cheers!
Mark

sevein · 2018-06-26T16:47:01Z

Thank you @marktriggs. I have one more question. I've started testing with a simple transfer and antivirus scanning failed - this is the error that I'm seeing:

*** RUNNING TASK: archivematicaclamscan_v0.0
Execution failed: [Errno 13] Permission denied
Execution failed: [Errno 13] Permission denied
archivematicaClient.py: ERROR     2018-06-26 15:47:11,311  archivematica.mcp.client:execute_command:200:  Exception while processing task archivematicaclamscan_v0.0:
Traceback (most recent call last):
  File "/src/MCPClient/lib/archivematicaClient.py", line 172, in execute_command
    jobs = handle_batch_task(gearman_job)
  File "/src/MCPClient/lib/archivematicaClient.py", line 135, in handle_batch_task
    fork_runner.call("clientScripts." + module_name, jobs, task_count=module.concurrent_instances())
  File "/src/MCPClient/lib/fork_runner.py", line 44, in call
    finished_jobs += result.get()
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 558, in get
    raise self._value
EOFError

marktriggs · 2018-06-26T20:36:51Z

Hm, curious... is your fork_runner.py script missing an execute bit? Mine looks like:

-rwxr-xr-x 1 mst mst 3442 Jun 13 11:05 ./MCPClient/lib/fork_runner.py

Maybe I need to set that explicitly...

sevein · 2018-06-26T20:48:13Z

Hm, curious... is your fork_runner.py script missing an execute bit?

It has it. Also I've checked and the repo tracks it too, e.g. https://github.com/hudmol/archivematica/blob/mcp-batching/src/MCPClient/lib/fork_runner.py reads "Executable file".

I'm going to try again, I'll let you know how it goes.

sevein · 2018-06-26T21:08:11Z

Just tried on my Linux desktop (using am.git) and it worked this time.

marktriggs · 2018-06-26T21:09:18Z

Strange! What were you testing on before? Was that using Vagrant perhaps?

We've been exclusively using am.git for development so far...

sevein · 2018-06-26T21:14:48Z

Strange! What were you testing on before? Was that using Vagrant perhaps?

Docker for Mac + am.git which is what I use when I work from home. I'll give it another try tonight but I'm not too worried about it.

marktriggs · 2018-06-26T21:22:40Z

Ah right. James & Payten were both testing on Macs too, so I'll get them to run another test. Otherwise, let me know if I can help troubleshoot :)

sevein · 2018-06-26T22:15:42Z

Do you think it would be a good idea to make BATCH_SIZE configurable? We don't have to do this now. We barely see deployments with MCPClient repllicas so not a priority but if that was the case, perhaps packing tasks in smaller groups would help to parallelize work?

I was testing with docker-compose up -d --scale archivematica-mcp-client=N and something I noticed right away is that the batches created by linkTaskManagerFiles bring the side-effect of having work mostly allocated to one single worker in the cluster unless the number of files is bigger than the size of the batch.

jhsimpson · 2018-06-26T22:44:47Z

src/MCPClient/lib/clientScripts/email_fail_report.py

-    # Generate report in plain text and store it in the database
-    content = get_content_for(args.unit_type, args.unit_name, args.unit_uuid, html=False)
-    store_report(content, args.unit_type, args.unit_name, args.unit_uuid)
+    with transaction.atomic():


Does this script need to have a db transaction? It doesn't look like anything is written to the db here, so this line could be removed?

store_report() is called on line 214, if that is writing to the db, maybe the transcation should start one line before, instead of containing the entire for loop?

Yeah, store_report is writing to the DB. This might want the same treatment as the clamscan script--separate the bits that write to the database from the rest, so that we're not holding open the transaction while sending out emails. We can look at this today as we'll be colocated :)

I've addressed this one now. It'll send all 100ish emails in a transaction after having sent out the emails.

jhsimpson · 2018-06-26T23:02:38Z

src/MCPClient/README.md

+
+    import multiprocessing
+    def concurrent_instances():
+        return multiprocessing.cpu_count()


Instead of doing cpu_count() here, can you have this return the value of ARCHIVEMATICA_MCPCLIENT_MCPCLIENT_NUMBEROFTASKS
See https://github.com/artefactual/archivematica/tree/stable/1.7.x/src/MCPClient/install#environment-variables

This environment variable often gets set to the cpu count during deployment

archivematica/src/MCPClient/lib/settings/common.py

Line 81 in b6dcfb0

numberOfTasks = 0

archivematicaClient already detects the number of cpu cores (see

archivematica/src/MCPClient/lib/archivematicaClient.py

Line 187 in 051ded0

startThreads(django_settings.NUMBER_OF_TASKS)

)

Ah, it looks like this variable got the chop in 6c33d34. I don't have strong feelings about where we get this value from, as long as we know that it's at least 1 :) Thoughts, anyone?

The purpose of NUMBER_OF_TASKS wasn't exactly the same although very similar but I think it'd be fine if it needs to be reintroduced. This could also be done after we merge this PR, that would be at least my preference. I think it's going to be hard to keep this work up to date, conflicts arise quickly because it touches many files.

That sounds good to me. We can leave NUMBER_OF_TASKS out, and consider adding it back in after, if we need it. We may not need to.

marktriggs · 2018-06-26T23:05:24Z

@sevein I think making BATCH_SIZE configurable could make sense for installations running multiple MCP Client instances, yeah.

For the single MCP Client case there's not much benefit to configuring it; I did some testing at various batch sizes and found that 128 was a good enough number--big enough to reduce commit overhead, and going bigger didn't improve performance.

If you were running multiple MCP Clients and tended to run single transfers with small numbers of large files, it might make sense to drop the number to something smaller (to encourage them to split across machines, like you said). If it's relatively easy to make the parameter configurable then it seems like there's no harm in doing that.

marktriggs · 2018-06-27T00:50:14Z

@sevein Ah, we reproduced and fixed the Mac issue. It turns out that file was returning "fork_runner.pyc", not the .py file! It didn't affect Linux because the permissions on the lib directory don't allow the archivematica user to write out that pyc file, so it doesn't happen. But under Mac Docker it's all owned by archivematica and the pyc file gets written out.

So I've removed the dependence on file now. Just added a constant and documented its reason for existing.

Computers...

payten · 2018-06-27T03:28:50Z

Hi @sevein, we've just pushed up a patch for archivematica_clamscan which moves the database transaction to only cover the creation of the events -- we collect these events for all jobs as we go and create them in one fell swoop.

sevein · 2018-06-28T23:23:02Z

So I've removed the dependence on file now. Just added a constant and documented its reason for existing.

@marktriggs thanks, it's working now.

Hi @sevein, we've just pushed up a patch for archivematica_clamscan which moves the database transaction to only cover the creation of the events -- we collect these events for all jobs as we go and create them in one fell swoop.

@payten cool that's great, thank you. You may have noticed that your change broke the tests. It's hard to tell because Travis is now reporting errors at all times because of a different issue, #1164. Running the tests locally is easy though. There are some notes here: https://github.com/artefactual-labs/am/tree/master/compose#tests-are-too-slow.

One more thing, I've rebased your branch again (see dev/mcp-batching - you may want to reset again or do it yourself (if willing to solve a few conflicts). I just wanted to make sure that it was up to date before doing more testing.

Looks like we're pretty close, are we?

marktriggs · 2018-06-29T01:56:01Z

Thanks @sevein! I've just reset from your branch and @payten is looking at the test failure so we'll push a fix for that shortly...

payten · 2018-06-29T03:45:47Z

Thanks @sevein. We've fixed up those tests. Hopefully it's good to go now!

…atching tasks. * The MCP Server now batches file-level tasks into fixed-size groups, creating one Gearman task per batch, rather than one per file. It also uses fixed-size thread pools to limit contention between threads. * The MCP Client now operates in batches, processing one batch at a time. It also supports running tasks using a pool of processes (improving throughput where tasks benefit from spanning multiple CPUs.) * The MCP Client scripts now accept a batch of jobs and process them as a single unit. There is a new `Job` API that provides a standard interface for these client scripts, and all scripts have been converted to use this. Much more documented in issue

…vents generated by the jobs

jhsimpson · 2018-07-04T20:17:31Z

After much internal discussion at Artefactual, I think we agree that this is looking very good. There are some issues that @sevein is going to document, as follow ups (re-instituting CAPTURE_CLIENT_OUTPUT, make BATCH_SIZE configurable) but those don't need to block this PR, they can be addressed as new issues.

sevein

We're ready to merge.

@payten @marktriggs, I've pushed 0e50ea2 which solves the most recent conflict. Can you reset? I'll merge-squash after.

I'm going to list a number of things that we want to work next once this is merged. I'll be filing new issues as needed.

Make BATCH_SIZE configurable.
Make linkTaskManagerGetMicroserviceGeneratedListInStdOut force wants_output=True so getAipStorageLocations_v0.0 doesn't break.
Restore CAPTURE_CLIENT_SCRIPT_OUTPUT in archivematicaClient.py#L179-L182. This is needed by at least one client wanting to avoid persisting the output to disk because it's expected to see very large streams.
Document new use of LIMIT_TASK_THREADS.
Write up some documentation on scalability and the new strategy, e.g. how and when to tweak BATCH_SIZE and/or CAPTURE_CLIENT_SCRIPT_OUTPUT, why are we batching queries, etc...
requiresOutputLock field not needed, should be removed. Other model fields related to workflow data are proably unused too, should be reviewed in detail.
Test with client datasets (e.g. Columbia) - add concurrent_instances to more client scripts.
Fixes in acceptance tests.

marktriggs · 2018-07-04T23:12:50Z

I've just pushed up hudmol/mcp-batching to match your latest rebase, so should be good to go.

Thanks @sevein for all your work reviewing and testing this! And especially for handling the rebases :)

Full description in #1176. This is connected to #1105. Co-authored-by: Mark Triggs <mark@dishevelled.net>

jhsimpson requested a review from sevein May 23, 2018 08:05

jhsimpson added the Jisc RDSS label May 23, 2018

sevein mentioned this pull request May 23, 2018

When creating one task per file, batch the inserts #974

Closed

sevein mentioned this pull request Jun 14, 2018

Problem: this project can't replace RDSS's fork yet #1127

Closed

15 tasks

marktriggs force-pushed the mcp-batching branch from 02e0160 to 5739ec2 Compare June 26, 2018 00:57

sevein assigned jambun Jun 26, 2018

jhsimpson reviewed Jun 26, 2018

View reviewed changes

marktriggs force-pushed the mcp-batching branch from f5ea6a6 to 299f75c Compare June 29, 2018 01:54

payten force-pushed the mcp-batching branch from 00196e5 to 60240b6 Compare June 29, 2018 03:42

marktriggs and others added 3 commits July 4, 2018 13:17

Fix linting issues

64748e2

Fixes in exception logging

e0a5e85

sevein and others added 8 commits July 4, 2018 13:17

Remove unused settings

98803eb

Add missing parameter

acf3bf2

Don't use __file__ on fork_runner because it gives us back the pyc path!

4b0c8f3

Reduce the transaction scope of the email fail report

5d4a6f0

Move transaction around clamscan jobs to only cover the creation of e…

6b32106

…vents generated by the jobs

Improve logging of failures in modules running under fork_runner

10a50e6

Remove unused setting that came back after rebasing

747a345

Rework test_antivirus.py after archivespace_clamscan script changes

0e50ea2

jhsimpson approved these changes Jul 4, 2018

View reviewed changes

jhsimpson added this to the 1.8.0 milestone Jul 4, 2018

sevein approved these changes Jul 4, 2018

View reviewed changes

marktriggs force-pushed the mcp-batching branch from 60240b6 to 0e50ea2 Compare July 4, 2018 23:11

sevein merged commit 94d7275 into artefactual:qa/1.x Jul 5, 2018

sevein mentioned this pull request Jul 5, 2018

MCP batching followups #1185

Merged

jhsimpson mentioned this pull request Jul 5, 2018

Thumbnail mode #1176

Closed

sevein mentioned this pull request Jul 7, 2018

MCPClient: add indexAIP error handling #1190

Merged

qubot pushed a commit that referenced this pull request Jul 21, 2018

Introduce thumbnail mode

dc526eb

Full description in #1176. This is connected to #1105. Co-authored-by: Mark Triggs <mark@dishevelled.net>

sevein mentioned this pull request Jul 21, 2018

Introduce thumbnail mode #1216

Merged

qubot pushed a commit that referenced this pull request Jul 21, 2018

Introduce thumbnail mode

18c68ac

Full description in #1176. This is connected to #1105. Co-authored-by: Mark Triggs <mark@dishevelled.net>

sevein mentioned this pull request Jul 22, 2018

Correct PREMIS:originalName for packaged objects #1178

Merged

sevein mentioned this pull request Jul 30, 2018

MCPServer: use auto_close_db #1227

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework the MCP Server, MCP Client and MCP Client scripts to support batching tasks #1105

Rework the MCP Server, MCP Client and MCP Client scripts to support batching tasks #1105

jambun commented May 23, 2018 •

edited by sevein

marktriggs commented Jun 13, 2018

sevein commented Jun 25, 2018 •

edited

marktriggs commented Jun 26, 2018

sevein commented Jun 26, 2018 •

edited

marktriggs commented Jun 26, 2018 •

edited

sevein commented Jun 26, 2018

sevein commented Jun 26, 2018

marktriggs commented Jun 26, 2018

sevein commented Jun 26, 2018

marktriggs commented Jun 26, 2018

sevein commented Jun 26, 2018

jhsimpson Jun 26, 2018 •

edited

marktriggs Jun 26, 2018

marktriggs Jun 27, 2018

jhsimpson Jun 26, 2018

marktriggs Jun 27, 2018

sevein Jun 28, 2018

jhsimpson Jul 4, 2018

marktriggs commented Jun 26, 2018

marktriggs commented Jun 27, 2018

payten commented Jun 27, 2018

sevein commented Jun 28, 2018

marktriggs commented Jun 29, 2018

payten commented Jun 29, 2018

jhsimpson commented Jul 4, 2018

sevein left a comment •

edited

marktriggs commented Jul 4, 2018

Rework the MCP Server, MCP Client and MCP Client scripts to support batching tasks #1105

Rework the MCP Server, MCP Client and MCP Client scripts to support batching tasks #1105

Conversation

jambun commented May 23, 2018 • edited by sevein

marktriggs commented Jun 13, 2018

sevein commented Jun 25, 2018 • edited

marktriggs commented Jun 26, 2018

sevein commented Jun 26, 2018 • edited

marktriggs commented Jun 26, 2018 • edited

sevein commented Jun 26, 2018

sevein commented Jun 26, 2018

marktriggs commented Jun 26, 2018

sevein commented Jun 26, 2018

marktriggs commented Jun 26, 2018

sevein commented Jun 26, 2018

jhsimpson Jun 26, 2018 • edited

Choose a reason for hiding this comment

marktriggs Jun 26, 2018

Choose a reason for hiding this comment

marktriggs Jun 27, 2018

Choose a reason for hiding this comment

jhsimpson Jun 26, 2018

Choose a reason for hiding this comment

marktriggs Jun 27, 2018

Choose a reason for hiding this comment

sevein Jun 28, 2018

Choose a reason for hiding this comment

jhsimpson Jul 4, 2018

Choose a reason for hiding this comment

marktriggs commented Jun 26, 2018

marktriggs commented Jun 27, 2018

payten commented Jun 27, 2018

sevein commented Jun 28, 2018

marktriggs commented Jun 29, 2018

payten commented Jun 29, 2018

jhsimpson commented Jul 4, 2018

sevein left a comment • edited

Choose a reason for hiding this comment

marktriggs commented Jul 4, 2018

jambun commented May 23, 2018 •

edited by sevein

sevein commented Jun 25, 2018 •

edited

sevein commented Jun 26, 2018 •

edited

marktriggs commented Jun 26, 2018 •

edited

jhsimpson Jun 26, 2018 •

edited

sevein left a comment •

edited