job flows are not reused when the previous job failed #633

garrettjacobson · 2013-05-08T00:17:06Z

Running a simple mrjob that raises an exception in the input mapper causes that job flow to not be re-usable automatically on the next run. It is still shown as WAITING, even though it is not automatically re-used. If you remove the exception and the job succeeds, the jobflow is also shown as WAITING but it will be reused on subsequent runs.

ap-ensighten · 2013-05-18T22:07:36Z

I also experienced this with the job flow pooling options. However, I was able to continue to use the cluster by specifying its jobflow id directly with --emr-job-flow-id. I think this has to do with the locking mechanism mrjob uses with files on s3, but I have not had a chance to dig into the code yet. I tried deleting the lock file in the s3 lock directory and running another job with the pooling options, but it still would not re-use the jobflow.

coyotemarin · 2013-08-24T02:29:50Z

This might pair well with other pooling tickets: #707, #708.

coyotemarin · 2013-09-12T21:54:08Z

I think this is the culprit (from usable_job_flows() in mrjob/emr.py):

            # in rare cases, job flow can be WAITING *and* have incomplete
            # steps
            if any(getattr(step, 'enddatetime', None) is None
                   for step in job_flow.steps):
                return

I think we just need to be pickier about only applying this to steps that have not yet run.

coyotemarin · 2013-09-12T22:24:50Z

Whoops, found a bug involving an interaction between pooling and max_hours_idle. Fixing now.

coyotemarin · 2013-09-12T23:01:48Z

Aha! The problem is that if you have a multi-step job, and the first step fails, the rest of the steps go into the CANCELLED state (because their input doesn't exist) and, because they've never run, their enddatetime is None.

secondary sort and self-terminating job flows * jobs: * SORT_VALUES: Secondary sort by value (Yelp#240) * see mrjob/examples/ * can now override jobconf() again (Yelp#656) * renamed mrjob.compat.get_jobconf_value() to jobconf_from_env() * examples: * bash_wrap/ (mapper/reducer_cmd() example) * mr_most_used_word.py (two step job) * mr_next_word_stats.py (SORT_VALUES example) * runners: * All runners: * single --setup option works but is not yet documented (Yelp#206) * setup now uses sh rather than python internally * EMR runner: * max_hours_idle: self-terminating idle job flows (Yelp#628) * mins_to_end_of_hour option gives finer control over self-termination. * Can reuse pooled job flows where previous job failed (Yelp#633) * Throws IOError if output path already exists (Yelp#634) * Gracefully handles SSL cert issues (Yelp#621, Yelp#706) * Automatically infers EMR/S3 endpoints from region (Yelp#658) * ls() supports s3n:// schema (Yelp#672) * Fixed log parsing crash on JarSteps (Yelp#645) * visible_to_all_users works with boto <2.8.0 (Yelp#701) * must use --interpreter with non-Python scripts (Yelp#683) * cat() can decompress gzipped data (Yelp#601) * Hadoop runner: * check_input_paths: can disable input path checking (Yelp#583) * cat() can decompress gzipped data (Yelp#601) * Inline/Local runners: * Fixed counter parsing for multi-step jobs in inline mode * Supports per-step jobconf (Yelp#616) * Documentation revamp * mrjob.parse.urlparse() works consistently across Python versions (Yelp#686) * deprecated: * many constants in mrjob.emr replaced with functions in mrjob.aws * removed deprecated features: * old conf locations (~/.mrjob and in PYTHONPATH) (Yelp#747) * built-in protocols must be instances (Yelp#488)

ghost assigned coyotemarin Aug 28, 2013

coyotemarin mentioned this issue Sep 13, 2013

pooling fixes #750

Merged

irskep closed this as completed in #750 Sep 13, 2013

coyotemarin removed their assignment Jul 8, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

job flows are not reused when the previous job failed #633

job flows are not reused when the previous job failed #633

garrettjacobson commented May 8, 2013

ap-ensighten commented May 18, 2013

coyotemarin commented Aug 24, 2013

coyotemarin commented Sep 12, 2013

coyotemarin commented Sep 12, 2013

coyotemarin commented Sep 12, 2013

job flows are not reused when the previous job failed #633

job flows are not reused when the previous job failed #633

Comments

garrettjacobson commented May 8, 2013

ap-ensighten commented May 18, 2013

coyotemarin commented Aug 24, 2013

coyotemarin commented Sep 12, 2013

coyotemarin commented Sep 12, 2013

coyotemarin commented Sep 12, 2013