Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework the MCP Server, MCP Client and MCP Client scripts to support batching tasks #1105

Merged
merged 11 commits into from Jul 5, 2018

Conversation

jambun
Copy link
Contributor

@jambun jambun commented May 23, 2018

  • The MCP Server now batches file-level tasks into fixed-size
    groups, creating one Gearman task per batch, rather than one per
    file. It also uses fixed-size thread pools to limit contention
    between threads.

  • The MCP Client now operates in batches, processing one batch at a
    time. It also supports running tasks using a pool of
    processes (improving throughput where tasks benefit from spanning
    multiple CPUs.)

  • The MCP Client scripts now accept a batch of jobs and process them
    as a single unit. There is a new Job API that provides a
    standard interface for these client scripts, and all scripts have
    been converted to use this.

The motivation for this work was to improve performance on transfer and ingest workflows, and to provide an improved interface for implementing client scripts.

Our testing shows transfers and ingests taking approximately half the time they did without these changes.

These changes also permit further optimisation of client scripts, by taking advantage of processing files in batches rather than one at a time. We did some work on optimising a few of the client scripts, but there is likely more improvement to be gained by further optimisation.


This is connected to #938.

@marktriggs
Copy link
Contributor

I've recorded a walkthrough of the major changes to the code:

http://tsp.nz/d/7ccefcf27452d9c7fdbfb805acb5dd42a13b0f5d.mp4#/mcp_batching.mp4

It runs at about 30 minutes (sorry) but if you watch it at double speed it still mostly makes sense and makes for more fun watching.

@sevein
Copy link
Member

sevein commented Jun 25, 2018

@marktriggs @jambun I'm really impressed by the work you guys have done here. I think that it makes much simpler to understand how things work in general but particularly in MCPServer which is a piece not very well understood.

While I was testing this and going through the changes I noticed a number of smaller issues. I wanted to fix them but as I don't have access to your repo I did it here in dev/mcp-batching. I basically took your commit 02e0160 and rebased it onto qa/1.x so the branch is again up to date. Then I added a number of smaller commits addressing minor issues like cosmetics, unused settings (you made up to five settings redundant I believe!) and other minor changes that I thought were more or less important. You may want to reset your branch if you agree with the changes I'm suggesting.

One question I have with the batching solution is whether running very long transactions can bring other issues. E.g. if I have a transfer with 50 files of 10G each and the clamscan script does the following:

def call(jobs):
    with transaction.atomic():
        for job in jobs:
            with job.JobContext(logger=logger):
                job.set_status(scan_file(*job.args[1:]))

I think that means that we'd be keeping that transaction open for as long as it'd take to scan all the files. That could potentially take hours. Could that hurt the performance of the database server significantly?

@marktriggs
Copy link
Contributor

Thanks @sevein! We've pulled in your changes now, so the branch should be up to date.

Good point on the transaction length issue too. We're going to modify that client script to buffer the events and insert them all at the end of the batch, so we won't have to keep that transaction open the whole time. We'll do another pass through the other client scripts and see if any of them need the same treatment.

Cheers!
Mark

@sevein
Copy link
Member

sevein commented Jun 26, 2018

Thank you @marktriggs. I have one more question. I've started testing with a simple transfer and antivirus scanning failed - this is the error that I'm seeing:

*** RUNNING TASK: archivematicaclamscan_v0.0
Execution failed: [Errno 13] Permission denied
Execution failed: [Errno 13] Permission denied
archivematicaClient.py: ERROR     2018-06-26 15:47:11,311  archivematica.mcp.client:execute_command:200:  Exception while processing task archivematicaclamscan_v0.0:
Traceback (most recent call last):
  File "/src/MCPClient/lib/archivematicaClient.py", line 172, in execute_command
    jobs = handle_batch_task(gearman_job)
  File "/src/MCPClient/lib/archivematicaClient.py", line 135, in handle_batch_task
    fork_runner.call("clientScripts." + module_name, jobs, task_count=module.concurrent_instances())
  File "/src/MCPClient/lib/fork_runner.py", line 44, in call
    finished_jobs += result.get()
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 558, in get
    raise self._value
EOFError

@marktriggs
Copy link
Contributor

marktriggs commented Jun 26, 2018

Hm, curious... is your fork_runner.py script missing an execute bit? Mine looks like:

-rwxr-xr-x 1 mst mst 3442 Jun 13 11:05 ./MCPClient/lib/fork_runner.py

Maybe I need to set that explicitly...

@sevein
Copy link
Member

sevein commented Jun 26, 2018

Hm, curious... is your fork_runner.py script missing an execute bit?

It has it. Also I've checked and the repo tracks it too, e.g. https://github.com/hudmol/archivematica/blob/mcp-batching/src/MCPClient/lib/fork_runner.py reads "Executable file".

I'm going to try again, I'll let you know how it goes.

@sevein
Copy link
Member

sevein commented Jun 26, 2018

Just tried on my Linux desktop (using am.git) and it worked this time.

@marktriggs
Copy link
Contributor

Strange! What were you testing on before? Was that using Vagrant perhaps?

We've been exclusively using am.git for development so far...

@sevein
Copy link
Member

sevein commented Jun 26, 2018

Strange! What were you testing on before? Was that using Vagrant perhaps?

Docker for Mac + am.git which is what I use when I work from home. I'll give it another try tonight but I'm not too worried about it.

@marktriggs
Copy link
Contributor

Ah right. James & Payten were both testing on Macs too, so I'll get them to run another test. Otherwise, let me know if I can help troubleshoot :)

@sevein
Copy link
Member

sevein commented Jun 26, 2018

Do you think it would be a good idea to make BATCH_SIZE configurable? We don't have to do this now. We barely see deployments with MCPClient repllicas so not a priority but if that was the case, perhaps packing tasks in smaller groups would help to parallelize work?

I was testing with docker-compose up -d --scale archivematica-mcp-client=N and something I noticed right away is that the batches created by linkTaskManagerFiles bring the side-effect of having work mostly allocated to one single worker in the cluster unless the number of files is bigger than the size of the batch.

# Generate report in plain text and store it in the database
content = get_content_for(args.unit_type, args.unit_name, args.unit_uuid, html=False)
store_report(content, args.unit_type, args.unit_name, args.unit_uuid)
with transaction.atomic():
Copy link
Member

@jhsimpson jhsimpson Jun 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this script need to have a db transaction? It doesn't look like anything is written to the db here, so this line could be removed?

store_report() is called on line 214, if that is writing to the db, maybe the transcation should start one line before, instead of containing the entire for loop?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, store_report is writing to the DB. This might want the same treatment as the clamscan script--separate the bits that write to the database from the rest, so that we're not holding open the transaction while sending out emails. We can look at this today as we'll be colocated :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've addressed this one now. It'll send all 100ish emails in a transaction after having sent out the emails.


import multiprocessing
def concurrent_instances():
return multiprocessing.cpu_count()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of doing cpu_count() here, can you have this return the value of ARCHIVEMATICA_MCPCLIENT_MCPCLIENT_NUMBEROFTASKS
See https://github.com/artefactual/archivematica/tree/stable/1.7.x/src/MCPClient/install#environment-variables

This environment variable often gets set to the cpu count during deployment

archivematicaClient already detects the number of cpu cores (see

startThreads(django_settings.NUMBER_OF_TASKS)
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, it looks like this variable got the chop in 6c33d34. I don't have strong feelings about where we get this value from, as long as we know that it's at least 1 :) Thoughts, anyone?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of NUMBER_OF_TASKS wasn't exactly the same although very similar but I think it'd be fine if it needs to be reintroduced. This could also be done after we merge this PR, that would be at least my preference. I think it's going to be hard to keep this work up to date, conflicts arise quickly because it touches many files.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good to me. We can leave NUMBER_OF_TASKS out, and consider adding it back in after, if we need it. We may not need to.

@marktriggs
Copy link
Contributor

@sevein I think making BATCH_SIZE configurable could make sense for installations running multiple MCP Client instances, yeah.

For the single MCP Client case there's not much benefit to configuring it; I did some testing at various batch sizes and found that 128 was a good enough number--big enough to reduce commit overhead, and going bigger didn't improve performance.

If you were running multiple MCP Clients and tended to run single transfers with small numbers of large files, it might make sense to drop the number to something smaller (to encourage them to split across machines, like you said). If it's relatively easy to make the parameter configurable then it seems like there's no harm in doing that.

@marktriggs
Copy link
Contributor

@sevein Ah, we reproduced and fixed the Mac issue. It turns out that file was returning "fork_runner.pyc", not the .py file! It didn't affect Linux because the permissions on the lib directory don't allow the archivematica user to write out that pyc file, so it doesn't happen. But under Mac Docker it's all owned by archivematica and the pyc file gets written out.

So I've removed the dependence on file now. Just added a constant and documented its reason for existing.

Computers...

@payten
Copy link
Contributor

payten commented Jun 27, 2018

Hi @sevein, we've just pushed up a patch for archivematica_clamscan which moves the database transaction to only cover the creation of the events -- we collect these events for all jobs as we go and create them in one fell swoop.

@sevein
Copy link
Member

sevein commented Jun 28, 2018

So I've removed the dependence on file now. Just added a constant and documented its reason for existing.

@marktriggs thanks, it's working now.

Hi @sevein, we've just pushed up a patch for archivematica_clamscan which moves the database transaction to only cover the creation of the events -- we collect these events for all jobs as we go and create them in one fell swoop.

@payten cool that's great, thank you. You may have noticed that your change broke the tests. It's hard to tell because Travis is now reporting errors at all times because of a different issue, #1164. Running the tests locally is easy though. There are some notes here: https://github.com/artefactual-labs/am/tree/master/compose#tests-are-too-slow.

One more thing, I've rebased your branch again (see dev/mcp-batching - you may want to reset again or do it yourself (if willing to solve a few conflicts). I just wanted to make sure that it was up to date before doing more testing.

Looks like we're pretty close, are we?

@marktriggs
Copy link
Contributor

Thanks @sevein! I've just reset from your branch and @payten is looking at the test failure so we'll push a fix for that shortly...

@payten
Copy link
Contributor

payten commented Jun 29, 2018

Thanks @sevein. We've fixed up those tests. Hopefully it's good to go now!

marktriggs and others added 3 commits July 4, 2018 13:17
…atching tasks.

  * The MCP Server now batches file-level tasks into fixed-size
    groups, creating one Gearman task per batch, rather than one per
    file.  It also uses fixed-size thread pools to limit contention
    between threads.

  * The MCP Client now operates in batches, processing one batch at a
    time.  It also supports running tasks using a pool of
    processes (improving throughput where tasks benefit from spanning
    multiple CPUs.)

  * The MCP Client scripts now accept a batch of jobs and process them
    as a single unit.  There is a new `Job` API that provides a
    standard interface for these client scripts, and all scripts have
    been converted to use this.

Much more documented in issue
@jhsimpson
Copy link
Member

After much internal discussion at Artefactual, I think we agree that this is looking very good. There are some issues that @sevein is going to document, as follow ups (re-instituting CAPTURE_CLIENT_OUTPUT, make BATCH_SIZE configurable) but those don't need to block this PR, they can be addressed as new issues.

@jhsimpson jhsimpson added this to the 1.8.0 milestone Jul 4, 2018
Copy link
Member

@sevein sevein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're ready to merge.

@payten @marktriggs, I've pushed 0e50ea2 which solves the most recent conflict. Can you reset? I'll merge-squash after.

I'm going to list a number of things that we want to work next once this is merged. I'll be filing new issues as needed.

  • Make BATCH_SIZE configurable.
  • Make linkTaskManagerGetMicroserviceGeneratedListInStdOut force wants_output=True so getAipStorageLocations_v0.0 doesn't break.
  • Restore CAPTURE_CLIENT_SCRIPT_OUTPUT in archivematicaClient.py#L179-L182. This is needed by at least one client wanting to avoid persisting the output to disk because it's expected to see very large streams.
  • Document new use of LIMIT_TASK_THREADS.
  • Write up some documentation on scalability and the new strategy, e.g. how and when to tweak BATCH_SIZE and/or CAPTURE_CLIENT_SCRIPT_OUTPUT, why are we batching queries, etc...
  • requiresOutputLock field not needed, should be removed. Other model fields related to workflow data are proably unused too, should be reviewed in detail.
  • Test with client datasets (e.g. Columbia) - add concurrent_instances to more client scripts.
  • Fixes in acceptance tests.

@marktriggs
Copy link
Contributor

I've just pushed up hudmol/mcp-batching to match your latest rebase, so should be good to go.

Thanks @sevein for all your work reviewing and testing this! And especially for handling the rebases :)

@sevein sevein merged commit 94d7275 into artefactual:qa/1.x Jul 5, 2018
@sevein sevein mentioned this pull request Jul 5, 2018
@jhsimpson jhsimpson mentioned this pull request Jul 5, 2018
qubot pushed a commit that referenced this pull request Jul 21, 2018
Full description in #1176.
This is connected to #1105.

Co-authored-by: Mark Triggs <mark@dishevelled.net>
@sevein sevein mentioned this pull request Jul 21, 2018
qubot pushed a commit that referenced this pull request Jul 21, 2018
Full description in #1176.
This is connected to #1105.

Co-authored-by: Mark Triggs <mark@dishevelled.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants