Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS Batch with CWL: did not localize cwl.inputs.json #4586

Open
chapmanb opened this issue Jan 25, 2019 · 25 comments

Comments

@chapmanb
Copy link

commented Jan 25, 2019

I've been working on testing the AWS Batch Cromwell support, following documentation from @wleepang (https://docs.opendata.aws/genomics-workflows) in the CWL hackathon with @cjllanwarne and @aednichols.

The workflow is a small test with everything in an S3 bucket:

https://github.com/bcbio/test_bcbio_cwl/tree/master/aws

I'm happy to report that I made good progress and have bcbio-vm using CloudFormation templates to setup the Cromwell batch ready AMI and AWS Batch requirements. I can then generate the right Cromwell AWS configuration and launch jobs to AWS batch. I see them get submitted, EC2 resources get spun up and jobs get queued and run. Awesome.

When they're all ready and prepped to run, the instances fail with not finding the cwl.inputs.json file staged into the working directory:

[2019-01-25 13:53:43,03] [info] AwsBatchAsyncBackendJobExecutionActor [2c2e5a10prep_samples_to_rec:NA:1]: Status change from Initializing to Running
[2019-01-25 13:53:59,61] [info] AwsBatchAsyncBackendJobExecutionActor [2c2e5a10alignment_to_rec:NA:1]: Status change from Initializing to Running
[2019-01-25 13:58:25,58] [info] AwsBatchAsyncBackendJobExecutionActor [2c2e5a10alignment_to_rec:NA:1]: Status change from Running to Failed
[2019-01-25 13:58:39,11] [info] AwsBatchAsyncBackendJobExecutionActor [2c2e5a10prep_samples_to_rec:NA:1]: Status change from Running to Failed
[2019-01-25 13:58:40,06] [error] WorkflowManagerActor Workflow 2c2e5a10-8c57-4f9f-8d80-c2fccacbb452 failed (during ExecutingWorkflowState): Job alignment_to_rec:NA:1 exited with return code 1 which has not been declared as a valid return code. See 'continueOnReturnCode' runtime attribute for more details.
Check the content of stderr for potential additional information: s3://bcbio-batch-cromwell-test/cromwell-execution/main-somatic.cwl/2c2e5a10-8c57-4f9f-8d80-c2fccacbb452/call-alignment_to_rec/alignment_to_rec-stderr.log.
 Traceback (most recent call last):
  File "/usr/local/bin/bcbio_nextgen.py", line 223, in <module>
    runfn.process(kwargs["args"])
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/distributed/runfn.py", line 48, in process
    fnargs, parallel, out_keys, input_files = _world_from_cwl(args.name, fnargs[1:], work_dir)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/distributed/runfn.py", line 185, in _world_from_cwl
    assert os.path.exists(os.path.join(work_dir, "cwl.inputs.json"))
AssertionError

Job prep_samples_to_rec:NA:1 exited with return code 1 which has not been declared as a valid return code. See 'continueOnReturnCode' runtime attribute for more details.
Check the content of stderr for potential additional information: s3://bcbio-batch-cromwell-test/cromwell-execution/main-somatic.cwl/2c2e5a10-8c57-4f9f-8d80-c2fccacbb452/call-prep_samples_to_rec/prep_samples_to_rec-stderr.log.
 Traceback (most recent call last):
  File "/usr/local/bin/bcbio_nextgen.py", line 223, in <module>
    runfn.process(kwargs["args"])
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/distributed/runfn.py", line 48, in process
    fnargs, parallel, out_keys, input_files = _world_from_cwl(args.name, fnargs[1:], work_dir)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/distributed/runfn.py", line 185, in _world_from_cwl
    assert os.path.exists(os.path.join(work_dir, "cwl.inputs.json"))
AssertionError

[2019-01-25 13:58:40,07] [info] WorkflowManagerActor WorkflowActor-2c2e5a10-8c57-4f9f-8d80-c2fccacbb452 is in a terminal state: WorkflowFailedState
[2019-01-25 13:58:59,66] [info] SingleWorkflowRunnerActor workflow finished with status 'Failed'.
[2019-01-25 13:59:03,42] [info] SingleWorkflowRunnerActor writing metadata to /home/chapmanb/drive/work/cwl/test_bcbio_cwl/aws/cromwell_work/somatic-metadata.json

I can see this file in the s3 bucket, although it's not in the execution directory which is normally where things get staged (at least on local runs):

$ aws s3 ls s3://bcbio-batch-cromwell-test/cromwell-execution/main-somatic.cwl/2c2e5a10-8c57-4f9f-8d80-c2fccacbb452/call-prep_samples_to_rec/
                           PRE glob-b34dfc006a981a93d6da067cf50036fe/
2019-01-25 13:51:55          0
2019-01-25 13:51:59       6059 cwl.inputs.json
2019-01-25 13:58:01          0 glob-b34dfc006a981a93d6da067cf50036fe.list
2019-01-25 13:58:01          2 prep_samples_to_rec-rc.txt
2019-01-25 13:58:40        571 prep_samples_to_rec-stderr.log
2019-01-25 13:58:02          0 prep_samples_to_rec-stdout.log

Does this error look familiar to anyone?

I'm excited to be making progress with this and will also work on writing up and documenting the setup and run process so far. Thanks for any suggestions or pointers.

@geoffjentry

This comment has been minimized.

Copy link
Member

commented Jan 25, 2019

Hi @chapmanb - CWL support was out of scope for the group who did the AWS Batch backend support. Some things might work, but others will not. In particular things like cwl.inputs.json and cwl.output.json with special control files definitely won't work as that requires special wiring on the part of a backend (i.e. not at the Cromwell engine/WOM layer) in order to be successful.

We'll get to this eventually but is not on our immediate roadmap. We'd certainly welcome contributions if other groups were interested in more robust AWS/CWL support in Cromwell (that's more of a general comment to any potentially interested parties who see this)

@chapmanb

This comment has been minimized.

Copy link
Author

commented Jan 25, 2019

Jeff;
Thanks for the confirmation that this is expected. My goal was to get this at a point where we understood what the CWL limitations and roadblocks are and have an easy way to replicate and test so we can move it forward when we have resources. I'll finish the bcbio automation, write up documentation, then sync with the AWS team as well to see about their ability to contribute/debug. I'm excited to be making some progress on this and appreciate knowing this is where we're expecting to be.

@geoffjentry

This comment has been minimized.

Copy link
Member

commented Jan 25, 2019

@chapmanb Definitely. In particular I expect your WFs to have a rough go of things in AWS just because you do lean so heavily on those sorts of constructs, as we discovered w/ the Google backend. The hope is that the lessons learned over there make it a much easier path in AWS but it'l still wind up as work needing to be done.

@chapmanb

This comment has been minimized.

Copy link
Author

commented Jan 30, 2019

Jeff;
We've automated the spin up of necessary AWS components and documented the process of running with bcbio and CWL to try to make this as easy as possible to replicate and work on:

https://bcbio-nextgen.readthedocs.io/en/latest/contents/cloud.html#amazon-web-services-aws-batch

Just wanted to leave this here for anyone following the issue when they have time to push from the Cromwell side. I'm happy to help debug and improve from the bcbio side as well. Thanks again.

@wleepang

This comment has been minimized.

Copy link
Contributor

commented Jan 30, 2019

@chapmanb - This is awesome!
@geoffjentry - Can you go into specifics of what needs to be wired and where (or point to docs / examples) in the backend?

@geoffjentry

This comment has been minimized.

Copy link
Member

commented Jan 30, 2019

@wleepang Ultimately the answer will be "Look at how the PAPI2 backend handles CWL, map the concepts to Batch, and there ya go", but unfortunately there wasn't one single PR to make CWL work in PAPI2. Prior to the fall it was in a state of it sorta worked, but very poorly on the kinds of special case scenarios we're talking about. At that point Thibault embarked on a project with @chapmanb to make sure it worked for bcbio (which IMO is sort of a torture test for these kinds of CWL constructs).

The further bad news is that there's still not a single PR, but from digging around a bit this seems to be a decent start: #4358 #4371 #4386 #4448

@Horneth

This comment has been minimized.

Copy link
Contributor

commented Jan 31, 2019

IIRC (which is really not guaranteed), as far as the localization of cwl.inputs.json is concerned in PAPI 2, this bit is what maps the cloud path of "ad hoc files" to the localized path (which the AWS equivalent doesn't have).
So it's possible the cwl.inputs.json is actually localized but not to the right place and so the tool can't find it.
Again my memory is fading quickly so don't take this as 💯 :)

@geoffjentry

This comment has been minimized.

Copy link
Member

commented Jan 31, 2019

@Horneth If I knew you were lurking/answering I'd have just tagged you 😛

@Horneth

This comment has been minimized.

Copy link
Contributor

commented Jan 31, 2019

I'm mostly lurking, whatever knowledge I had is rapidly becoming obsolete 😄
But yeah feel free to tag me if you think I can be useful :)

@chapmanb

This comment has been minimized.

Copy link
Author

commented Jan 31, 2019

Jeff -- thanks so much for the orientation on the code. Thibault -- I really appreciate having you still helping out unofficially. From looking in the s3 bucket listed above, I think your memory about the cwl.inputs.json is right on. It looks to be in the root of the work directory rather than the execution work directory. My knowledge of the Cromwell code base is so weak though, I don't have the first clue how to propagate that adHocFile magic over. Hopefully Lee is more on top of this than I am. Thanks again for this help.

@geoffjentry geoffjentry self-assigned this Jun 17, 2019

@myazinn

This comment has been minimized.

Copy link

commented Jun 27, 2019

Hi @geoffjentry . I'm currently trying to figure out how to fix this issue, but I'm just starting to understand this project. Therefore, I want to clarify a few things.
As far as I understand, the problem is that some special (ad hoc) files are placed in "/cromwell_root/path/to/input_file.txt", while Cromwell expects them to be in "/cromwell_root/ad_hoc_file.txt" in order to execute them.
But what exactly is wrong here? It seems like either adhoc files are placed in the wrong directory and must be moved to "/cromwell_root" or their location is correct and it's Cromwell's fault that it tries to find them in the wrong directory. Which of these options is correct?
Also, I see you assigned yourself to this issue. Does this mean that help is no longer needed?

@Kirvolque

This comment has been minimized.

Copy link

commented Jul 4, 2019

Seems that there are two different ways the issue can be solved:

  1. The file can be moved to the execution folder
  2. The application can be changed to look for the file in the directory where it is placed.
    We are not sure if the solution in our PR suits you.
    Looking forward to your comments.

@geoffjentry, @chapmanb, we would really appreciate if you would check our solution. Does this solve your problem?

@chapmanb

This comment has been minimized.

Copy link
Author

commented Jul 5, 2019

Kirill -- thanks for working on this. I'll have to defer to Jeff and his team as I'm not sure how this best fits with Cromwell wants to do more generally. The approach you've taken seems like would resolve the issue for me but I'm mostly not sure about how it fits with the rest of Cromwell and how it tackles S3.

@myazinn

This comment has been minimized.

Copy link

commented Jul 9, 2019

@geoffjentry
I hope you keep reading this branch :)
We have added a new possible solution to this problem. Please tell us which one you think is more appropriate, or tell us what is wrong with them and what can be fixed.

@geoffjentry

This comment has been minimized.

Copy link
Member

commented Jul 9, 2019

Hi - sorry I had missed the original question from @myazinn and needed to dig into ad hoc files a bit before opining on first #5057 from @Kirvolque and now #5064. I'm still not in a position where I can answer definitively but my first question would be why not follow the pattern from the GCP backend and TES backend and modify the mapCommandLineWomFile function in AwsAsyncBackendJobExecutionActor.

It's entirely possible that something about the AWS backend's wiring would lead to that not being the right choice, but that's what I want to understand.

@myazinn

This comment has been minimized.

Copy link

commented Jul 9, 2019

@geoffjentry, thanks for the answer :)
Actually, we tried to do something similar to how it was done in GCP (or TES) but it didn't work out. We added logging to the mapCommandLineWomFile method so that we can see what womFiles Cromwell passes to this method. And it turns out Cromwell never passes "ad hoc" files to this method, therefore asAdHocFile, for example, always returns None. In particular, in our integration test (PR #5057) it passes only two womFiles with values something like
s3://bucket-name/cromwell-execution/cwl_temp_file_some-numbers.cwl/some-numbers/call-test
and
s3://bucket-name/cromwell-execution/cwl_temp_file_some-numbers.cwl/some-numbers/call-test/tmp.59740063
I'm not sure, but it looks like the first womValue somehow related to the runtimeEnvironment field in the StandardAsyncExecutionActor. The second value is something else too, since "ad hoc" files are placed in call-test directory.
It is possible that we misunderstood something, but for now, it looks like a dead-end.
By the way, we also tried to override localizeAdHocValues method AwsBatchAsyncBackendJobExecutionActor so that it would copy "ad hoc" files to the "/cromwell_root" directory. It fails with an AccessDeniedException.
I hope this gives you an understanding of why we came to the proposed solutions. As I said, perhaps we misunderstood something, so we will be happy if you can give us some hint.

@geoffjentry

This comment has been minimized.

Copy link
Member

commented Jul 9, 2019

Ah. Now that you say this I'm betting it fails due to how the proxy sidecar works. My quick thought is that the approach in #5064 is likely better but I need to dig into more how the AWS localization is working in the first place. In the meantime I'll try to also ping @elerch and @wleepang in case they have thoughts here.

@myazinn

This comment has been minimized.

Copy link

commented Jul 9, 2019

Okay, thanks. I look forward to hearing any updates from you.

@cjllanwarne

This comment has been minimized.

Copy link
Contributor

commented Jul 9, 2019

Hi @myazinn - sorry for jumping in and making you explain everything yet another time... 😄

As I understand it, the problem is, Cromwell creates a list of inputs files somewhere in S3, and then localizes the files into the wrong place on the instance?

If so, I think I could add a third option to your list:

  1. The file can be moved to the execution folder
  2. The application can be changed to look for the file in the directory where it is placed.

The third option I would add is:
3. Localize the inputs.json file to the correct location in the first place.

I believe one difference between regular input Files and "AdHocFile"s is that the ad-hoc files allow you to specify where they should be localized to. I wonder whether another possible option would be to change the construction of the AdHocFile object for these inputs.json files in the first place and fix it into the correct location upfront?

@myazinn

This comment has been minimized.

Copy link

commented Jul 9, 2019

Hi @cjllanwarne.
Yes, you correctly understood the problem. We haven't tried this option, because we were fixated on one approach :)
In general, your idea should solve the problem. Although I have a suspicion that an AccessDeniedException may be thrown there too. Anyway, I will try to do so and tell you the result.

@geoffjentry

This comment has been minimized.

Copy link
Member

commented Jul 15, 2019

Hi @myazinn and @Kirvolque - sorry for the delay here, the conversation last week wound up with Chris & I chatting w/ @wleepang from AWS, this looks like another case where the required bits of information are split across multiple people. I'm going to try to figure out what needs figuring out this week, but we'll see.

If that winds up not working out, one thought Lee & I had was that we could set up a 3-way call next week. Both he & I will be in Basel next week which seemed like it'd be easier to schedule with you all instead of finding a time which worked for him on the west coast of the US and you all

@myazinn

This comment has been minimized.

Copy link

commented Jul 15, 2019

@geoffjentry — that would be great! We usually use Skype, is it okay for you?

@geoffjentry

This comment has been minimized.

Copy link
Member

commented Jul 15, 2019

@myazinn hey - the more i look at the rest of my week the more i'm thinking we might as well schedule the call :) Can you email me - jgentry with hostname broadinstitute.org

@myazinn

This comment has been minimized.

Copy link

commented Aug 5, 2019

Hi @chapmanb -
We made another PR, that should solve the problem. And we believe that this time it is the right solution :)
Sorry for bothering, but can you please check if it solves your problem? II tried to run your tests myself, but I failed to do it for some reason.

@geoffjentry

This comment has been minimized.

Copy link
Member

commented Aug 6, 2019

@myazinn I'm looking. I'm first trying to make sure I can run it on GCP (to prove to myself that I'm doing the right thing), then will try to run it on cromwell develop using AWS (to prove to myself that I'm still doing the right thing), and will then try your branch.

It'll probably take a while as I'm sure I'll screw up multiple things in steps 1 & 2 :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants
You can’t perform that action at this time.