Skip to content

EMR Action on failure not working #1361

Open
mtsgrd opened this Issue Feb 27, 2013 · 9 comments

9 participants

@mtsgrd
mtsgrd commented Feb 27, 2013

The action_on_failure argument to boto.emr.connection.run_jobflow seems ineffective. Has anyone else experienced problems?

@schultzy51

I've noticed this also. I've been unable to set action_on_failure to CONTINUE or CANCEL_AND_WAIT using boto.emr.connection.run_jobflow. It's always displaying TERMINATE_JOB_FLOW in the console.

@danielgtaylor danielgtaylor added Accepted Bug EMR and removed Accepted labels Feb 25, 2014
@kjmph
kjmph commented May 16, 2014

I concur, although I don't think it is boto's fault. Boto sets the ActionOnFailure parameter to the string passed in. If CANCEL_AND_WAIT is used, and the step is checked inside AWS EMR web interface, the step lists:

Action on failure: Terminate cluster

Does this work for anyone? I thought a workaround would be to set keep_alive=True, I do see Auto-terminate:No. However, since the step still lists terminating the cluster, it shuts down.

@foscraig

Boto does not seem to have action_on_failure for the base class boto.emr.Step. It should as that's how it's implemented for Java SDK, node.js SDK, etc as the StepConfig data type specified in the EMR API has ActionOnFailure.

@kjmph
kjmph commented Aug 7, 2014

Can we bump this Issue?

@danielgtaylor
the boto project member

Can someone provide a complete code example to reproduce this issue?

@TELSER1
TELSER1 commented Jan 13, 2015

Basic initiation script (pigtest.csv can probably be any short file, and testscript1.pig just loads and then writes to a database); this errors out for some reason, and the cluster terminates.

import boto.emr
conn= boto.emr.connect_to_region('us-east-1')
pig_step=boto.emr.step.PigStep(name='Pig_Program',pig_file='s3://mybucket/testscript1.pig',pig_args=['-p','filename=pigtest.csv'])
steps=[boto.emr.step.InstallPigStep(),pig_step]
job_id=conn.run_jobflow(name='test',steps=steps,ami_version='2.4',num_instances=1,keep_alive=True,action_on_failure='CONTINUE')

Perhaps even more strangely, merely adding the pigstep with conn.add_jobflow_steps() to a cluster launched as follows will cause the server to shut down after its failure, even though the cluster has CONTINUE specified under ActionOnFailure (this doesn't happen if a step fails when submitted to the cluster from the console, or command line):
aws emr create-cluster --name "testc" --ami-version 2.4 --applications Name=Pig --ec2-attributes KeyName=KEYNAME --instance-type m3.xlarge --instance-count 1 --steps Type=PIG,Name="Pig_Program",ActionOnFailure=CONTINUE,Args=[-f,"s3://mybucket/testscript1.pig",-p,filepath="pigtest.csv"]

@pumpkiny9120

Any updates guys? Telser1 gave the repro, can we proceed?

@hcavalle

+1 on this, it's a pain! Repro above, fix por favor.

@gallamine

+1 I was able to fix the issue for the HiveStep by adding in action_on_failure as a default argument to the function that is passed to the Super:

This is in boto.emr.step.HiveStep()

    def __init__(self, name, hive_file, hive_versions='latest',
                 hive_args=None, action_on_failure='TERMINATE_JOB_FLOW'):
        step_args = []
        step_args.extend(self.BaseArgs)
        step_args.extend(['--hive-versions', hive_versions])
        step_args.extend(['--run-hive-script', '--args', '-f', hive_file])
        if hive_args is not None:
            step_args.extend(hive_args)
        super(HiveStep, self).__init__(name, step_args=step_args, action_on_failure=action_on_failure)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.