Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need new component, InputSandboxCache? #1400

Closed
ericvaandering opened this issue Apr 7, 2011 · 16 comments
Closed

Need new component, InputSandboxCache? #1400

ericvaandering opened this issue Apr 7, 2011 · 16 comments

Comments

@ericvaandering
Copy link
Member

We spent a while discussing this today. All of us favor an approach where the user sandbox flow is as follows:

Client uploads the sandbox to ReqMgr/CRABInterface via http/s in the same way that the CMSSW _cfg.py is uploaded. This will be secured by X509 proxy, same as posting to the CRABInterface.

The CRABInterface uploads, via REST interface, the user sandbox to the sandbox cache which responds with an identifier for the sandbox in "the cache". This identifier is returned to the client. When the job is submitted by the client, this identifier is passed along to the various work queues and is included in the job spec.

Here the handling of the config in Couch and the sandbox in a different cache would differ. The user sandbox would not be placed in the job sandbox, but would rather be downloaded directly by the worker node once the job has started. Eventually this wget would go through a squid cache at the remote site and result in smaller network loads.

Presumably the identifier in the cache would be or would include a hash of the contents of the sandbox so that repeated submission of the same sandbox would not result in wasted space in the cache nor extra bandwidth between the squid and the hash.

The other option, not favored, was to have the local work queue fetch the sandbox from the cache and include it in the job sandbox. We felt this would waste too much bandwidth between the submitting machine and the remote CE.

In any case the major issue is that we need to find or build "the cache" with a REST interface. Does any such thing exist in our software stack already or do we have the option to use a third party supplied option? This would probably not be the most difficult thing to write ourselves, but we worry about doing it right. On the other hand, something we do ourselves can easily include cleanups, diagnostics for Ops, and perhaps pinning of additional sandboxes for MC generation, etc.

This whole approach has the advantage of allowing staged testing. Initially we would use a static URL as the sandbox without any upload capability but test the WN or workqueue level stuff that will have to be added to allow HTTP accessible sand boxes.

We'd like to have a discussion, both of the sandbox data flow and possible implementations of the cache before opening a couple more tickets to address all the details.

@sfoulkes
Copy link

sfoulkes commented Apr 8, 2011

sfoulkes: This seems reasonable to me. I'd suggest using cherrypy to serve up the files as there is already support for that in WMCore and a cron'd script to prune older sandboxes as the disk fills up. Diagnostics and other bells and whistles would be built into the cherrypy server or the crab rest interface.

@ericvaandering
Copy link
Member Author

ewv: A note to myself on how to implement this:

http://www.cherrypy.org/wiki/FileUpload

@ericvaandering
Copy link
Member Author

ewv: Please review

Uses (modified) REST model for the uploading part, Page model for downloading.

@DMWMBot
Copy link

DMWMBot commented Apr 28, 2011

mmascher: Ouch... You are right it works. I' a moron...

@ericvaandering
Copy link
Member Author

ewv: Simon, can you please review and then either check in or pass it on to someone else for further review?

@drsm79
Copy link

drsm79 commented May 6, 2011

metson: The code in the patch looks fine from a quick look. However, shouldn't this be in CRAB and not WMCore? What other systems will have a UserFileCache?

@ericvaandering
Copy link
Member Author

ewv: I don't have a strong opinion, but I put it in WMCore for two reasons.

  1. I wanted it started with the Local WQ/Agent cluster of things
  2. I figured it may be of more general use with MC workflows that have to ship big LHE files or whatever. Those could be run in production.

So make a decision and I will relocate it if needed.

@evansde77
Copy link

evansde: For 2, in the production case the LHE files will either be converted to EDM GEN files at CERN or shipped via squids or the DM system like normal data.
So I think that would make this Crab Only.

Question: Is there a maxmimum size limit on the input sandbox? The idea that a user could dump a couple of GB of data in there and send it to a batch system that copies it per job could lead to some issues with load, even with caching etc.

@ericvaandering
Copy link
Member Author

ewv: At the moment there is no limit, but we can and should enforce something in the client, I think. I think CRAB2 enforces a 50 MB limit which comes from gLite. We had issues with PAT libraries being larger than that when they weren't in the release, but I haven't heard of that recently. So maybe 50 or 100 MB will be a good starting point.

So in answer to Simon's question, it sounds like I should relocate this to CRABServer.

@drsm79
Copy link

drsm79 commented May 7, 2011

metson: Replying to [comment:11 ewv]:

So in answer to Simon's question, it sounds like I should relocate this to CRABServer.

Yeah, I think that's best. Also, the patch has no tests in it. Can you add them at the same time?

@ericvaandering
Copy link
Member Author

ewv: Yeah. I'll have to find an example of tests for a web service.

@drsm79
Copy link

drsm79 commented May 7, 2011

@spigad
Copy link
Member

spigad commented May 7, 2011

spiga: Replying to [comment:11 ewv]:

At the moment there is no limit, but we can and should enforce something in the client, I think. I think CRAB2 enforces a 50 MB limit which comes from gLite. We had issues with PAT libraries being larger than that when they weren't in the release, but I haven't heard of that recently. So maybe 50 or 100 MB will be a good starting point.

the limit we have now should be 100MB (the gLite limit was 10MB and apply to direct submission only). I agree to start with 100, also I'd made it configurable.

So in answer to Simon's question, it sounds like I should relocate this to CRABServer.

@ericvaandering
Copy link
Member Author

ewv: Please review. New and improved with Unit tests

@ericvaandering
Copy link
Member Author

ewv: Can this please be reviewed and checked in?

@spigad
Copy link
Member

spigad commented May 24, 2011

spiga: As agreed I would give first the current stuff to integration and then move ahead.

Few things are still missing/not working on the deploy (including some problem I discovered yesterday which apparently doesn't show up it previous test?!?).

To be more precise: as soon as the next wmcore tag is cut we move on.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants