Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job flows are not reused when the previous job failed #633

Closed
garrettjacobson opened this issue May 8, 2013 · 5 comments · Fixed by #750
Closed

job flows are not reused when the previous job failed #633

garrettjacobson opened this issue May 8, 2013 · 5 comments · Fixed by #750
Labels
Milestone

Comments

@garrettjacobson
Copy link

Running a simple mrjob that raises an exception in the input mapper causes that job flow to not be re-usable automatically on the next run. It is still shown as WAITING, even though it is not automatically re-used. If you remove the exception and the job succeeds, the jobflow is also shown as WAITING but it will be reused on subsequent runs.

@ap-ensighten
Copy link
Contributor

I also experienced this with the job flow pooling options. However, I was able to continue to use the cluster by specifying its jobflow id directly with --emr-job-flow-id. I think this has to do with the locking mechanism mrjob uses with files on s3, but I have not had a chance to dig into the code yet. I tried deleting the lock file in the s3 lock directory and running another job with the pooling options, but it still would not re-use the jobflow.

@coyotemarin
Copy link
Collaborator

This might pair well with other pooling tickets: #707, #708.

@ghost ghost assigned coyotemarin Aug 28, 2013
@coyotemarin
Copy link
Collaborator

I think this is the culprit (from usable_job_flows() in mrjob/emr.py):

            # in rare cases, job flow can be WAITING *and* have incomplete
            # steps
            if any(getattr(step, 'enddatetime', None) is None
                   for step in job_flow.steps):
                return

I think we just need to be pickier about only applying this to steps that have not yet run.

@coyotemarin
Copy link
Collaborator

Whoops, found a bug involving an interaction between pooling and max_hours_idle. Fixing now.

@coyotemarin
Copy link
Collaborator

Aha! The problem is that if you have a multi-step job, and the first step fails, the rest of the steps go into the CANCELLED state (because their input doesn't exist) and, because they've never run, their enddatetime is None.

scottknight added a commit to timtadh/mrjob that referenced this issue Oct 10, 2013
secondary sort and self-terminating job flows
 * jobs:
   * SORT_VALUES: Secondary sort by value (Yelp#240)
     * see mrjob/examples/
   * can now override jobconf() again (Yelp#656)
   * renamed mrjob.compat.get_jobconf_value() to jobconf_from_env()
   * examples:
     * bash_wrap/ (mapper/reducer_cmd() example)
     * mr_most_used_word.py (two step job)
     * mr_next_word_stats.py (SORT_VALUES example)
 * runners:
   * All runners:
     * single --setup option works but is not yet documented (Yelp#206)
     * setup now uses sh rather than python internally
   * EMR runner:
     * max_hours_idle: self-terminating idle job flows (Yelp#628)
       * mins_to_end_of_hour option gives finer control over self-termination.
     * Can reuse pooled job flows where previous job failed (Yelp#633)
     * Throws IOError if output path already exists (Yelp#634)
     * Gracefully handles SSL cert issues (Yelp#621, Yelp#706)
     * Automatically infers EMR/S3 endpoints from region (Yelp#658)
     * ls() supports s3n:// schema (Yelp#672)
     * Fixed log parsing crash on JarSteps (Yelp#645)
     * visible_to_all_users works with boto <2.8.0 (Yelp#701)
     * must use --interpreter with non-Python scripts (Yelp#683)
     * cat() can decompress gzipped data (Yelp#601)
   * Hadoop runner:
     * check_input_paths: can disable input path checking (Yelp#583)
     * cat() can decompress gzipped data (Yelp#601)
   * Inline/Local runners:
     * Fixed counter parsing for multi-step jobs in inline mode
     * Supports per-step jobconf (Yelp#616)
 * Documentation revamp
 * mrjob.parse.urlparse() works consistently across Python versions (Yelp#686)
 * deprecated:
   * many constants in mrjob.emr replaced with functions in mrjob.aws
 * removed deprecated features:
   * old conf locations (~/.mrjob and in PYTHONPATH) (Yelp#747)
   * built-in protocols must be instances (Yelp#488)
@coyotemarin coyotemarin removed their assignment Jul 8, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants