Crash when overriding MRJob.jobconf() and not overriding MRJob.steps() #656

irskep · 2013-06-28T00:09:50Z

Long explanation of the problem

You can specify jobconfs for both the entire job or for each individual step...sort of.

Command line options, config file options, and the MRJob.jobconf() method all refer to the same job-level set of jobconf values.

The step object also accepts a jobconf kwarg which it expects to be a dict.

But there's a lurking issue: if you try to override MRJob.jobconf(), without also overriding MRJob.steps(), everything explodes. (Except not in local/inline mode, due to #655.) The explosion symptoms are documented in #585.

Example demonstrating exact way to reproduce:

import os
from mrjob.job import MRJob

class Job(MRJob):

    # comment this method out to get a crash
    def steps(self):
        return [self.mr(mapper=self.mapper)]

    def jobconf(self):
        return {}

    def mapper(self, _, v):
        for k, v in os.environ.iteritems():
            yield k, v

if __name__ == '__main__':
    Job().run()

By default, steps() returns [MRJobStep(mapper=self.mapper, ..., jobconf=self.jobconf)], where each key is only included if it has been overridden in the subclass.

But including jobconf in that list has never been valid, and apparently no test cases exist for it. Oops! One could make an argument that jobconf() should be treated the same as mapper(), i.e. only applying to the first step and only if steps() isn't implemented, but its behavior for the entire lifespan of mrjob has been to return job-level jobconf values, not step-level.

Suggested course of action

Remove jobconf from the list of keys taken from the MRJob class if steps() is not specified.
Write test cases for overriding jobconf() and any other methods that are missing test coverage in this context.
Thoroughly document when a user-specified jobconf value is job-level vs step-level.

It seems reasonable that a step-level jobconf argument would have to be a dictionary, not a function. It's necessarily evaluated at job start time, not at task time, even if it's specific to a step.

Ping @tarnfeld and @DavidMarin to confirm everything I just said.

The text was updated successfully, but these errors were encountered:

tarnfeld · 2013-06-28T09:05:31Z

It seems reasonable that a step-level jobconf argument would have to be a dictionary, not a function. It's necessarily evaluated at job start time, not at task time, even if it's specific to a step.

That to me makes perfect sense, as you described this is an awkward side effect of things being named the same. I fixed it by doing this - simply type checking, which there was a TODO in for anyway. However I think you're right @irskep (task number 1) that we should simply not include it.

Also, should have written regression tests for this, but never got round to it!

coyotemarin · 2013-06-29T06:01:26Z

See my comment in #585.

I think what happened is that we got a patch that implemented job-specific steps for only some of the runners, and we enthusiastically integrated it. Which is great, but now we need to finish the job. :)

tarnfeld · 2013-07-20T19:10:59Z

I believe this issue is fixed by #671.

secondary sort and self-terminating job flows * jobs: * SORT_VALUES: Secondary sort by value (Yelp#240) * see mrjob/examples/ * can now override jobconf() again (Yelp#656) * renamed mrjob.compat.get_jobconf_value() to jobconf_from_env() * examples: * bash_wrap/ (mapper/reducer_cmd() example) * mr_most_used_word.py (two step job) * mr_next_word_stats.py (SORT_VALUES example) * runners: * All runners: * single --setup option works but is not yet documented (Yelp#206) * setup now uses sh rather than python internally * EMR runner: * max_hours_idle: self-terminating idle job flows (Yelp#628) * mins_to_end_of_hour option gives finer control over self-termination. * Can reuse pooled job flows where previous job failed (Yelp#633) * Throws IOError if output path already exists (Yelp#634) * Gracefully handles SSL cert issues (Yelp#621, Yelp#706) * Automatically infers EMR/S3 endpoints from region (Yelp#658) * ls() supports s3n:// schema (Yelp#672) * Fixed log parsing crash on JarSteps (Yelp#645) * visible_to_all_users works with boto <2.8.0 (Yelp#701) * must use --interpreter with non-Python scripts (Yelp#683) * cat() can decompress gzipped data (Yelp#601) * Hadoop runner: * check_input_paths: can disable input path checking (Yelp#583) * cat() can decompress gzipped data (Yelp#601) * Inline/Local runners: * Fixed counter parsing for multi-step jobs in inline mode * Supports per-step jobconf (Yelp#616) * Documentation revamp * mrjob.parse.urlparse() works consistently across Python versions (Yelp#686) * deprecated: * many constants in mrjob.emr replaced with functions in mrjob.aws * removed deprecated features: * old conf locations (~/.mrjob and in PYTHONPATH) (Yelp#747) * built-in protocols must be instances (Yelp#488)

tarnfeld mentioned this issue Jun 28, 2013

No longer able to extend the jobconf method on a job class #585

Closed

ghost assigned coyotemarin Aug 28, 2013

coyotemarin pushed a commit to coyotemarin/mrjob that referenced this issue Sep 6, 2013

added a regression test for Yelp#656 (it fails, as expected)

90eebd0

coyotemarin mentioned this issue Sep 6, 2013

fix for overriding jobconf() method #734

Merged

coyotemarin closed this as completed in 39d98f0 Sep 6, 2013

coyotemarin removed their assignment Jul 8, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash when overriding MRJob.jobconf() and not overriding MRJob.steps() #656

Crash when overriding MRJob.jobconf() and not overriding MRJob.steps() #656

irskep commented Jun 28, 2013

tarnfeld commented Jun 28, 2013

coyotemarin commented Jun 29, 2013

tarnfeld commented Jul 20, 2013

Crash when overriding MRJob.jobconf() and not overriding MRJob.steps() #656

Crash when overriding MRJob.jobconf() and not overriding MRJob.steps() #656

Comments

irskep commented Jun 28, 2013

Long explanation of the problem

Suggested course of action

tarnfeld commented Jun 28, 2013

coyotemarin commented Jun 29, 2013

tarnfeld commented Jul 20, 2013