Need new component, InputSandboxCache? #1400

ericvaandering · 2011-04-07T23:11:00Z

We spent a while discussing this today. All of us favor an approach where the user sandbox flow is as follows:

Client uploads the sandbox to ReqMgr/CRABInterface via http/s in the same way that the CMSSW _cfg.py is uploaded. This will be secured by X509 proxy, same as posting to the CRABInterface.

The CRABInterface uploads, via REST interface, the user sandbox to the sandbox cache which responds with an identifier for the sandbox in "the cache". This identifier is returned to the client. When the job is submitted by the client, this identifier is passed along to the various work queues and is included in the job spec.

Here the handling of the config in Couch and the sandbox in a different cache would differ. The user sandbox would not be placed in the job sandbox, but would rather be downloaded directly by the worker node once the job has started. Eventually this wget would go through a squid cache at the remote site and result in smaller network loads.

Presumably the identifier in the cache would be or would include a hash of the contents of the sandbox so that repeated submission of the same sandbox would not result in wasted space in the cache nor extra bandwidth between the squid and the hash.

The other option, not favored, was to have the local work queue fetch the sandbox from the cache and include it in the job sandbox. We felt this would waste too much bandwidth between the submitting machine and the remote CE.

In any case the major issue is that we need to find or build "the cache" with a REST interface. Does any such thing exist in our software stack already or do we have the option to use a third party supplied option? This would probably not be the most difficult thing to write ourselves, but we worry about doing it right. On the other hand, something we do ourselves can easily include cleanups, diagnostics for Ops, and perhaps pinning of additional sandboxes for MC generation, etc.

This whole approach has the advantage of allowing staged testing. Initially we would use a static URL as the sandbox without any upload capability but test the WN or workqueue level stuff that will have to be added to allow HTTP accessible sand boxes.

We'd like to have a discussion, both of the sandbox data flow and possible implementations of the cache before opening a couple more tickets to address all the details.

sfoulkes · 2011-04-08T05:31:42Z

sfoulkes: This seems reasonable to me. I'd suggest using cherrypy to serve up the files as there is already support for that in WMCore and a cron'd script to prune older sandboxes as the disk fills up. Diagnostics and other bells and whistles would be built into the cherrypy server or the crab rest interface.

ericvaandering · 2011-04-08T20:17:39Z

ewv: A note to myself on how to implement this:

http://www.cherrypy.org/wiki/FileUpload

ericvaandering · 2011-04-19T03:06:05Z

ewv: Please review

Uses (modified) REST model for the uploading part, Page model for downloading.

DMWMBot · 2011-04-28T22:37:21Z

mmascher: Ouch... You are right it works. I' a moron...

ericvaandering · 2011-05-06T00:35:32Z

ewv: Simon, can you please review and then either check in or pass it on to someone else for further review?

drsm79 · 2011-05-06T18:52:20Z

metson: The code in the patch looks fine from a quick look. However, shouldn't this be in CRAB and not WMCore? What other systems will have a UserFileCache?

ericvaandering · 2011-05-06T21:04:15Z

ewv: I don't have a strong opinion, but I put it in WMCore for two reasons.

I wanted it started with the Local WQ/Agent cluster of things
I figured it may be of more general use with MC workflows that have to ship big LHE files or whatever. Those could be run in production.

So make a decision and I will relocate it if needed.

evansde77 · 2011-05-06T23:10:49Z

evansde: For 2, in the production case the LHE files will either be converted to EDM GEN files at CERN or shipped via squids or the DM system like normal data.
So I think that would make this Crab Only.

Question: Is there a maxmimum size limit on the input sandbox? The idea that a user could dump a couple of GB of data in there and send it to a batch system that copies it per job could lead to some issues with load, even with caching etc.

ericvaandering · 2011-05-06T23:21:19Z

ewv: At the moment there is no limit, but we can and should enforce something in the client, I think. I think CRAB2 enforces a 50 MB limit which comes from gLite. We had issues with PAT libraries being larger than that when they weren't in the release, but I haven't heard of that recently. So maybe 50 or 100 MB will be a good starting point.

So in answer to Simon's question, it sounds like I should relocate this to CRABServer.

drsm79 · 2011-05-07T00:04:30Z

metson: Replying to [comment:11 ewv]:

So in answer to Simon's question, it sounds like I should relocate this to CRABServer.

Yeah, I think that's best. Also, the patch has no tests in it. Can you add them at the same time?

ericvaandering · 2011-05-07T00:07:14Z

ewv: Yeah. I'll have to find an example of tests for a web service.

drsm79 · 2011-05-07T00:52:09Z

metson: https://svnweb.cern.ch/trac/CMSDMWM/browser/WMCore/trunk/test/python/WMCore_t/WebTools_t

spigad · 2011-05-07T07:06:54Z

spiga: Replying to [comment:11 ewv]:

At the moment there is no limit, but we can and should enforce something in the client, I think. I think CRAB2 enforces a 50 MB limit which comes from gLite. We had issues with PAT libraries being larger than that when they weren't in the release, but I haven't heard of that recently. So maybe 50 or 100 MB will be a good starting point.

the limit we have now should be 100MB (the gLite limit was 10MB and apply to direct submission only). I agree to start with 100, also I'd made it configurable.

So in answer to Simon's question, it sounds like I should relocate this to CRABServer.

ericvaandering · 2011-05-12T00:14:33Z

ewv: Please review. New and improved with Unit tests

ericvaandering · 2011-05-24T03:30:31Z

ewv: Can this please be reviewed and checked in?

spigad · 2011-05-24T15:28:37Z

spiga: As agreed I would give first the current stuff to integration and then move ahead.

Few things are still missing/not working on the deploy (including some problem I discovered yesterday which apparently doesn't show up it previous test?!?).

To be more precise: as soon as the next wmcore tag is cut we move on.

ericvaandering mentioned this issue Jul 24, 2012

CRABInterface communications with UserFileCache #1465

Closed

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need new component, InputSandboxCache? #1400

Need new component, InputSandboxCache? #1400

ericvaandering commented Apr 7, 2011

sfoulkes commented Apr 8, 2011

ericvaandering commented Apr 8, 2011

ericvaandering commented Apr 19, 2011

DMWMBot commented Apr 28, 2011

ericvaandering commented May 6, 2011

drsm79 commented May 6, 2011

ericvaandering commented May 6, 2011

evansde77 commented May 6, 2011

ericvaandering commented May 6, 2011

drsm79 commented May 7, 2011

ericvaandering commented May 7, 2011

drsm79 commented May 7, 2011

spigad commented May 7, 2011

ericvaandering commented May 12, 2011

ericvaandering commented May 24, 2011

spigad commented May 24, 2011

Need new component, InputSandboxCache? #1400

Need new component, InputSandboxCache? #1400

Comments

ericvaandering commented Apr 7, 2011

sfoulkes commented Apr 8, 2011

ericvaandering commented Apr 8, 2011

ericvaandering commented Apr 19, 2011

DMWMBot commented Apr 28, 2011

ericvaandering commented May 6, 2011

drsm79 commented May 6, 2011

ericvaandering commented May 6, 2011

evansde77 commented May 6, 2011

ericvaandering commented May 6, 2011

drsm79 commented May 7, 2011

ericvaandering commented May 7, 2011

drsm79 commented May 7, 2011

spigad commented May 7, 2011

ericvaandering commented May 12, 2011

ericvaandering commented May 24, 2011

spigad commented May 24, 2011