-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Need new component, InputSandboxCache? #1400
Comments
sfoulkes: This seems reasonable to me. I'd suggest using cherrypy to serve up the files as there is already support for that in WMCore and a cron'd script to prune older sandboxes as the disk fills up. Diagnostics and other bells and whistles would be built into the cherrypy server or the crab rest interface. |
ewv: A note to myself on how to implement this: |
ewv: Please review Uses (modified) REST model for the uploading part, Page model for downloading. |
mmascher: Ouch... You are right it works. I' a moron... |
ewv: Simon, can you please review and then either check in or pass it on to someone else for further review? |
metson: The code in the patch looks fine from a quick look. However, shouldn't this be in CRAB and not WMCore? What other systems will have a UserFileCache? |
ewv: I don't have a strong opinion, but I put it in WMCore for two reasons.
So make a decision and I will relocate it if needed. |
evansde: For 2, in the production case the LHE files will either be converted to EDM GEN files at CERN or shipped via squids or the DM system like normal data. Question: Is there a maxmimum size limit on the input sandbox? The idea that a user could dump a couple of GB of data in there and send it to a batch system that copies it per job could lead to some issues with load, even with caching etc. |
ewv: At the moment there is no limit, but we can and should enforce something in the client, I think. I think CRAB2 enforces a 50 MB limit which comes from gLite. We had issues with PAT libraries being larger than that when they weren't in the release, but I haven't heard of that recently. So maybe 50 or 100 MB will be a good starting point. So in answer to Simon's question, it sounds like I should relocate this to CRABServer. |
metson: Replying to [comment:11 ewv]:
Yeah, I think that's best. Also, the patch has no tests in it. Can you add them at the same time? |
ewv: Yeah. I'll have to find an example of tests for a web service. |
spiga: Replying to [comment:11 ewv]:
the limit we have now should be 100MB (the gLite limit was 10MB and apply to direct submission only). I agree to start with 100, also I'd made it configurable.
|
ewv: Please review. New and improved with Unit tests |
ewv: Can this please be reviewed and checked in? |
spiga: As agreed I would give first the current stuff to integration and then move ahead. Few things are still missing/not working on the deploy (including some problem I discovered yesterday which apparently doesn't show up it previous test?!?). To be more precise: as soon as the next wmcore tag is cut we move on. |
We spent a while discussing this today. All of us favor an approach where the user sandbox flow is as follows:
Client uploads the sandbox to ReqMgr/CRABInterface via http/s in the same way that the CMSSW _cfg.py is uploaded. This will be secured by X509 proxy, same as posting to the CRABInterface.
The CRABInterface uploads, via REST interface, the user sandbox to the sandbox cache which responds with an identifier for the sandbox in "the cache". This identifier is returned to the client. When the job is submitted by the client, this identifier is passed along to the various work queues and is included in the job spec.
Here the handling of the config in Couch and the sandbox in a different cache would differ. The user sandbox would not be placed in the job sandbox, but would rather be downloaded directly by the worker node once the job has started. Eventually this wget would go through a squid cache at the remote site and result in smaller network loads.
Presumably the identifier in the cache would be or would include a hash of the contents of the sandbox so that repeated submission of the same sandbox would not result in wasted space in the cache nor extra bandwidth between the squid and the hash.
The other option, not favored, was to have the local work queue fetch the sandbox from the cache and include it in the job sandbox. We felt this would waste too much bandwidth between the submitting machine and the remote CE.
In any case the major issue is that we need to find or build "the cache" with a REST interface. Does any such thing exist in our software stack already or do we have the option to use a third party supplied option? This would probably not be the most difficult thing to write ourselves, but we worry about doing it right. On the other hand, something we do ourselves can easily include cleanups, diagnostics for Ops, and perhaps pinning of additional sandboxes for MC generation, etc.
This whole approach has the advantage of allowing staged testing. Initially we would use a static URL as the sandbox without any upload capability but test the WN or workqueue level stuff that will have to be added to allow HTTP accessible sand boxes.
We'd like to have a discussion, both of the sandbox data flow and possible implementations of the cache before opening a couple more tickets to address all the details.
The text was updated successfully, but these errors were encountered: