EMR Action on failure not working #1361

Open
mtsgrd opened this Issue Feb 27, 2013 · 9 comments

Comments

Projects
None yet
9 participants
@mtsgrd

mtsgrd commented Feb 27, 2013

The action_on_failure argument to boto.emr.connection.run_jobflow seems ineffective. Has anyone else experienced problems?

@schultzy51

This comment has been minimized.

Show comment
Hide comment
@schultzy51

schultzy51 Feb 17, 2014

I've noticed this also. I've been unable to set action_on_failure to CONTINUE or CANCEL_AND_WAIT using boto.emr.connection.run_jobflow. It's always displaying TERMINATE_JOB_FLOW in the console.

I've noticed this also. I've been unable to set action_on_failure to CONTINUE or CANCEL_AND_WAIT using boto.emr.connection.run_jobflow. It's always displaying TERMINATE_JOB_FLOW in the console.

@danielgtaylor danielgtaylor added Accepted and removed Accepted labels Feb 25, 2014

@kjmph

This comment has been minimized.

Show comment
Hide comment
@kjmph

kjmph May 16, 2014

I concur, although I don't think it is boto's fault. Boto sets the ActionOnFailure parameter to the string passed in. If CANCEL_AND_WAIT is used, and the step is checked inside AWS EMR web interface, the step lists:

Action on failure: Terminate cluster

Does this work for anyone? I thought a workaround would be to set keep_alive=True, I do see Auto-terminate:No. However, since the step still lists terminating the cluster, it shuts down.

kjmph commented May 16, 2014

I concur, although I don't think it is boto's fault. Boto sets the ActionOnFailure parameter to the string passed in. If CANCEL_AND_WAIT is used, and the step is checked inside AWS EMR web interface, the step lists:

Action on failure: Terminate cluster

Does this work for anyone? I thought a workaround would be to set keep_alive=True, I do see Auto-terminate:No. However, since the step still lists terminating the cluster, it shuts down.

@foscraig

This comment has been minimized.

Show comment
Hide comment
@foscraig

foscraig Jul 10, 2014

Boto does not seem to have action_on_failure for the base class boto.emr.Step. It should as that's how it's implemented for Java SDK, node.js SDK, etc as the StepConfig data type specified in the EMR API has ActionOnFailure.

Boto does not seem to have action_on_failure for the base class boto.emr.Step. It should as that's how it's implemented for Java SDK, node.js SDK, etc as the StepConfig data type specified in the EMR API has ActionOnFailure.

@kjmph

This comment has been minimized.

Show comment
Hide comment
@kjmph

kjmph Aug 7, 2014

Can we bump this Issue?

kjmph commented Aug 7, 2014

Can we bump this Issue?

@danielgtaylor

This comment has been minimized.

Show comment
Hide comment
@danielgtaylor

danielgtaylor Aug 12, 2014

Member

Can someone provide a complete code example to reproduce this issue?

Member

danielgtaylor commented Aug 12, 2014

Can someone provide a complete code example to reproduce this issue?

@TELSER1

This comment has been minimized.

Show comment
Hide comment
@TELSER1

TELSER1 Jan 13, 2015

Basic initiation script (pigtest.csv can probably be any short file, and testscript1.pig just loads and then writes to a database); this errors out for some reason, and the cluster terminates.

import boto.emr
conn= boto.emr.connect_to_region('us-east-1')
pig_step=boto.emr.step.PigStep(name='Pig_Program',pig_file='s3://mybucket/testscript1.pig',pig_args=['-p','filename=pigtest.csv'])
steps=[boto.emr.step.InstallPigStep(),pig_step]
job_id=conn.run_jobflow(name='test',steps=steps,ami_version='2.4',num_instances=1,keep_alive=True,action_on_failure='CONTINUE')

Perhaps even more strangely, merely adding the pigstep with conn.add_jobflow_steps() to a cluster launched as follows will cause the server to shut down after its failure, even though the cluster has CONTINUE specified under ActionOnFailure (this doesn't happen if a step fails when submitted to the cluster from the console, or command line):
aws emr create-cluster --name "testc" --ami-version 2.4 --applications Name=Pig --ec2-attributes KeyName=KEYNAME --instance-type m3.xlarge --instance-count 1 --steps Type=PIG,Name="Pig_Program",ActionOnFailure=CONTINUE,Args=[-f,"s3://mybucket/testscript1.pig",-p,filepath="pigtest.csv"]

TELSER1 commented Jan 13, 2015

Basic initiation script (pigtest.csv can probably be any short file, and testscript1.pig just loads and then writes to a database); this errors out for some reason, and the cluster terminates.

import boto.emr
conn= boto.emr.connect_to_region('us-east-1')
pig_step=boto.emr.step.PigStep(name='Pig_Program',pig_file='s3://mybucket/testscript1.pig',pig_args=['-p','filename=pigtest.csv'])
steps=[boto.emr.step.InstallPigStep(),pig_step]
job_id=conn.run_jobflow(name='test',steps=steps,ami_version='2.4',num_instances=1,keep_alive=True,action_on_failure='CONTINUE')

Perhaps even more strangely, merely adding the pigstep with conn.add_jobflow_steps() to a cluster launched as follows will cause the server to shut down after its failure, even though the cluster has CONTINUE specified under ActionOnFailure (this doesn't happen if a step fails when submitted to the cluster from the console, or command line):
aws emr create-cluster --name "testc" --ami-version 2.4 --applications Name=Pig --ec2-attributes KeyName=KEYNAME --instance-type m3.xlarge --instance-count 1 --steps Type=PIG,Name="Pig_Program",ActionOnFailure=CONTINUE,Args=[-f,"s3://mybucket/testscript1.pig",-p,filepath="pigtest.csv"]

@pumpkiny9120

This comment has been minimized.

Show comment
Hide comment
@pumpkiny9120

pumpkiny9120 Feb 6, 2015

Any updates guys? Telser1 gave the repro, can we proceed?

Any updates guys? Telser1 gave the repro, can we proceed?

@hcavalle

This comment has been minimized.

Show comment
Hide comment
@hcavalle

hcavalle Jun 16, 2015

+1 on this, it's a pain! Repro above, fix por favor.

+1 on this, it's a pain! Repro above, fix por favor.

@gallamine

This comment has been minimized.

Show comment
Hide comment
@gallamine

gallamine Feb 3, 2016

+1 I was able to fix the issue for the HiveStep by adding in action_on_failure as a default argument to the function that is passed to the Super:

This is in boto.emr.step.HiveStep()

    def __init__(self, name, hive_file, hive_versions='latest',
                 hive_args=None, action_on_failure='TERMINATE_JOB_FLOW'):
        step_args = []
        step_args.extend(self.BaseArgs)
        step_args.extend(['--hive-versions', hive_versions])
        step_args.extend(['--run-hive-script', '--args', '-f', hive_file])
        if hive_args is not None:
            step_args.extend(hive_args)
        super(HiveStep, self).__init__(name, step_args=step_args, action_on_failure=action_on_failure)

+1 I was able to fix the issue for the HiveStep by adding in action_on_failure as a default argument to the function that is passed to the Super:

This is in boto.emr.step.HiveStep()

    def __init__(self, name, hive_file, hive_versions='latest',
                 hive_args=None, action_on_failure='TERMINATE_JOB_FLOW'):
        step_args = []
        step_args.extend(self.BaseArgs)
        step_args.extend(['--hive-versions', hive_versions])
        step_args.extend(['--run-hive-script', '--args', '-f', hive_file])
        if hive_args is not None:
            step_args.extend(hive_args)
        super(HiveStep, self).__init__(name, step_args=step_args, action_on_failure=action_on_failure)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment