Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash when overriding MRJob.jobconf() and not overriding MRJob.steps() #656

Closed
irskep opened this issue Jun 28, 2013 · 3 comments
Closed
Labels
Milestone

Comments

@irskep
Copy link
Contributor

irskep commented Jun 28, 2013

Long explanation of the problem

You can specify jobconfs for both the entire job or for each individual step...sort of.

Command line options, config file options, and the MRJob.jobconf() method all refer to the same job-level set of jobconf values.

The step object also accepts a jobconf kwarg which it expects to be a dict.

But there's a lurking issue: if you try to override MRJob.jobconf(), without also overriding MRJob.steps(), everything explodes. (Except not in local/inline mode, due to #655.) The explosion symptoms are documented in #585.

Example demonstrating exact way to reproduce:

import os
from mrjob.job import MRJob

class Job(MRJob):

    # comment this method out to get a crash
    def steps(self):
        return [self.mr(mapper=self.mapper)]

    def jobconf(self):
        return {}

    def mapper(self, _, v):
        for k, v in os.environ.iteritems():
            yield k, v

if __name__ == '__main__':
    Job().run()

By default, steps() returns [MRJobStep(mapper=self.mapper, ..., jobconf=self.jobconf)], where each key is only included if it has been overridden in the subclass.

But including jobconf in that list has never been valid, and apparently no test cases exist for it. Oops! One could make an argument that jobconf() should be treated the same as mapper(), i.e. only applying to the first step and only if steps() isn't implemented, but its behavior for the entire lifespan of mrjob has been to return job-level jobconf values, not step-level.

Suggested course of action

  1. Remove jobconf from the list of keys taken from the MRJob class if steps() is not specified.
  2. Write test cases for overriding jobconf() and any other methods that are missing test coverage in this context.
  3. Thoroughly document when a user-specified jobconf value is job-level vs step-level.

It seems reasonable that a step-level jobconf argument would have to be a dictionary, not a function. It's necessarily evaluated at job start time, not at task time, even if it's specific to a step.

Ping @tarnfeld and @DavidMarin to confirm everything I just said.

@tarnfeld
Copy link
Contributor

It seems reasonable that a step-level jobconf argument would have to be a dictionary, not a function. It's necessarily evaluated at job start time, not at task time, even if it's specific to a step.

That to me makes perfect sense, as you described this is an awkward side effect of things being named the same. I fixed it by doing this - simply type checking, which there was a TODO in for anyway. However I think you're right @irskep (task number 1) that we should simply not include it.

Also, should have written regression tests for this, but never got round to it!

@coyotemarin
Copy link
Collaborator

See my comment in #585.

I think what happened is that we got a patch that implemented job-specific steps for only some of the runners, and we enthusiastically integrated it. Which is great, but now we need to finish the job. :)

@tarnfeld
Copy link
Contributor

I believe this issue is fixed by #671.

@ghost ghost assigned coyotemarin Aug 28, 2013
coyotemarin pushed a commit to coyotemarin/mrjob that referenced this issue Sep 6, 2013
scottknight added a commit to timtadh/mrjob that referenced this issue Oct 10, 2013
secondary sort and self-terminating job flows
 * jobs:
   * SORT_VALUES: Secondary sort by value (Yelp#240)
     * see mrjob/examples/
   * can now override jobconf() again (Yelp#656)
   * renamed mrjob.compat.get_jobconf_value() to jobconf_from_env()
   * examples:
     * bash_wrap/ (mapper/reducer_cmd() example)
     * mr_most_used_word.py (two step job)
     * mr_next_word_stats.py (SORT_VALUES example)
 * runners:
   * All runners:
     * single --setup option works but is not yet documented (Yelp#206)
     * setup now uses sh rather than python internally
   * EMR runner:
     * max_hours_idle: self-terminating idle job flows (Yelp#628)
       * mins_to_end_of_hour option gives finer control over self-termination.
     * Can reuse pooled job flows where previous job failed (Yelp#633)
     * Throws IOError if output path already exists (Yelp#634)
     * Gracefully handles SSL cert issues (Yelp#621, Yelp#706)
     * Automatically infers EMR/S3 endpoints from region (Yelp#658)
     * ls() supports s3n:// schema (Yelp#672)
     * Fixed log parsing crash on JarSteps (Yelp#645)
     * visible_to_all_users works with boto <2.8.0 (Yelp#701)
     * must use --interpreter with non-Python scripts (Yelp#683)
     * cat() can decompress gzipped data (Yelp#601)
   * Hadoop runner:
     * check_input_paths: can disable input path checking (Yelp#583)
     * cat() can decompress gzipped data (Yelp#601)
   * Inline/Local runners:
     * Fixed counter parsing for multi-step jobs in inline mode
     * Supports per-step jobconf (Yelp#616)
 * Documentation revamp
 * mrjob.parse.urlparse() works consistently across Python versions (Yelp#686)
 * deprecated:
   * many constants in mrjob.emr replaced with functions in mrjob.aws
 * removed deprecated features:
   * old conf locations (~/.mrjob and in PYTHONPATH) (Yelp#747)
   * built-in protocols must be instances (Yelp#488)
@coyotemarin coyotemarin removed their assignment Jul 8, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants