New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crash when overriding MRJob.jobconf() and not overriding MRJob.steps() #656
Comments
That to me makes perfect sense, as you described this is an awkward side effect of things being named the same. I fixed it by doing this - simply type checking, which there was a TODO in for anyway. However I think you're right @irskep (task number 1) that we should simply not include it. Also, should have written regression tests for this, but never got round to it! |
See my comment in #585. I think what happened is that we got a patch that implemented job-specific steps for only some of the runners, and we enthusiastically integrated it. Which is great, but now we need to finish the job. :) |
I believe this issue is fixed by #671. |
secondary sort and self-terminating job flows * jobs: * SORT_VALUES: Secondary sort by value (Yelp#240) * see mrjob/examples/ * can now override jobconf() again (Yelp#656) * renamed mrjob.compat.get_jobconf_value() to jobconf_from_env() * examples: * bash_wrap/ (mapper/reducer_cmd() example) * mr_most_used_word.py (two step job) * mr_next_word_stats.py (SORT_VALUES example) * runners: * All runners: * single --setup option works but is not yet documented (Yelp#206) * setup now uses sh rather than python internally * EMR runner: * max_hours_idle: self-terminating idle job flows (Yelp#628) * mins_to_end_of_hour option gives finer control over self-termination. * Can reuse pooled job flows where previous job failed (Yelp#633) * Throws IOError if output path already exists (Yelp#634) * Gracefully handles SSL cert issues (Yelp#621, Yelp#706) * Automatically infers EMR/S3 endpoints from region (Yelp#658) * ls() supports s3n:// schema (Yelp#672) * Fixed log parsing crash on JarSteps (Yelp#645) * visible_to_all_users works with boto <2.8.0 (Yelp#701) * must use --interpreter with non-Python scripts (Yelp#683) * cat() can decompress gzipped data (Yelp#601) * Hadoop runner: * check_input_paths: can disable input path checking (Yelp#583) * cat() can decompress gzipped data (Yelp#601) * Inline/Local runners: * Fixed counter parsing for multi-step jobs in inline mode * Supports per-step jobconf (Yelp#616) * Documentation revamp * mrjob.parse.urlparse() works consistently across Python versions (Yelp#686) * deprecated: * many constants in mrjob.emr replaced with functions in mrjob.aws * removed deprecated features: * old conf locations (~/.mrjob and in PYTHONPATH) (Yelp#747) * built-in protocols must be instances (Yelp#488)
Long explanation of the problem
You can specify jobconfs for both the entire job or for each individual step...sort of.
Command line options, config file options, and the
MRJob.jobconf()
method all refer to the same job-level set of jobconf values.The step object also accepts a
jobconf
kwarg which it expects to be a dict.But there's a lurking issue: if you try to override
MRJob.jobconf()
, without also overridingMRJob.steps()
, everything explodes. (Except not in local/inline mode, due to #655.) The explosion symptoms are documented in #585.Example demonstrating exact way to reproduce:
By default,
steps()
returns[MRJobStep(mapper=self.mapper, ..., jobconf=self.jobconf)]
, where each key is only included if it has been overridden in the subclass.But including
jobconf
in that list has never been valid, and apparently no test cases exist for it. Oops! One could make an argument thatjobconf()
should be treated the same asmapper()
, i.e. only applying to the first step and only ifsteps()
isn't implemented, but its behavior for the entire lifespan of mrjob has been to return job-level jobconf values, not step-level.Suggested course of action
jobconf
from the list of keys taken from theMRJob
class ifsteps()
is not specified.jobconf()
and any other methods that are missing test coverage in this context.It seems reasonable that a step-level jobconf argument would have to be a dictionary, not a function. It's necessarily evaluated at job start time, not at task time, even if it's specific to a step.
Ping @tarnfeld and @DavidMarin to confirm everything I just said.
The text was updated successfully, but these errors were encountered: