New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
job flows are not reused when the previous job failed #633
Comments
I also experienced this with the job flow pooling options. However, I was able to continue to use the cluster by specifying its jobflow id directly with --emr-job-flow-id. I think this has to do with the locking mechanism mrjob uses with files on s3, but I have not had a chance to dig into the code yet. I tried deleting the lock file in the s3 lock directory and running another job with the pooling options, but it still would not re-use the jobflow. |
I think this is the culprit (from # in rare cases, job flow can be WAITING *and* have incomplete
# steps
if any(getattr(step, 'enddatetime', None) is None
for step in job_flow.steps):
return I think we just need to be pickier about only applying this to steps that have not yet run. |
Whoops, found a bug involving an interaction between pooling and |
Aha! The problem is that if you have a multi-step job, and the first step fails, the rest of the steps go into the |
secondary sort and self-terminating job flows * jobs: * SORT_VALUES: Secondary sort by value (Yelp#240) * see mrjob/examples/ * can now override jobconf() again (Yelp#656) * renamed mrjob.compat.get_jobconf_value() to jobconf_from_env() * examples: * bash_wrap/ (mapper/reducer_cmd() example) * mr_most_used_word.py (two step job) * mr_next_word_stats.py (SORT_VALUES example) * runners: * All runners: * single --setup option works but is not yet documented (Yelp#206) * setup now uses sh rather than python internally * EMR runner: * max_hours_idle: self-terminating idle job flows (Yelp#628) * mins_to_end_of_hour option gives finer control over self-termination. * Can reuse pooled job flows where previous job failed (Yelp#633) * Throws IOError if output path already exists (Yelp#634) * Gracefully handles SSL cert issues (Yelp#621, Yelp#706) * Automatically infers EMR/S3 endpoints from region (Yelp#658) * ls() supports s3n:// schema (Yelp#672) * Fixed log parsing crash on JarSteps (Yelp#645) * visible_to_all_users works with boto <2.8.0 (Yelp#701) * must use --interpreter with non-Python scripts (Yelp#683) * cat() can decompress gzipped data (Yelp#601) * Hadoop runner: * check_input_paths: can disable input path checking (Yelp#583) * cat() can decompress gzipped data (Yelp#601) * Inline/Local runners: * Fixed counter parsing for multi-step jobs in inline mode * Supports per-step jobconf (Yelp#616) * Documentation revamp * mrjob.parse.urlparse() works consistently across Python versions (Yelp#686) * deprecated: * many constants in mrjob.emr replaced with functions in mrjob.aws * removed deprecated features: * old conf locations (~/.mrjob and in PYTHONPATH) (Yelp#747) * built-in protocols must be instances (Yelp#488)
Running a simple mrjob that raises an exception in the input mapper causes that job flow to not be re-usable automatically on the next run. It is still shown as WAITING, even though it is not automatically re-used. If you remove the exception and the job succeeds, the jobflow is also shown as WAITING but it will be reused on subsequent runs.
The text was updated successfully, but these errors were encountered: