Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for combiners #74

Closed
coyotemarin opened this issue Dec 15, 2010 · 3 comments
Closed

add support for combiners #74

coyotemarin opened this issue Dec 15, 2010 · 3 comments
Labels
Milestone

Comments

@coyotemarin
Copy link
Collaborator

dumbo does this clever thing to support combiners; they basically add | sort | python mr_script.py --combiner to the end of the mapper command line. mrjob should do this as well.

We'll need to update our (lame) internal representation of steps, and update the --steps protocol to include combiners.

@ghost ghost assigned coyotemarin Apr 11, 2011
@irskep
Copy link
Contributor

irskep commented Jun 10, 2011

It also appears that Hadoop 0.20 supports arbitrary scripts for combiners, so we could just declare that we support it for that version and up.

(Let it be known that I am interested in having/doing this, as it brings significant opportunities for optimization.)

@coyotemarin
Copy link
Collaborator Author

Oh, sweet; I knew they added it, but I thought it was in a later version.

It's not a big deal to hack it with | sort | my_your_job.py --combiner for earlier versions, but we should definitely use -combiner in versions where it's available.

@irskep
Copy link
Contributor

irskep commented Aug 4, 2011

My other tasks are done or blocked so I started to think about what needs to be done here.

MRJob:

  • Define combiner(), combiner_init(), and combiner_final()
  • Add those methods to the step representation, mr(), steps(), and show_steps()
  • Write run_combiner() and add it to execute()
  • Command line options / is_mapper_or_reducer()
  • Integrate with protocols

Other:

  • Add some -combiner-vs-| sort... logic to mrjob.compat
  • Add | sort... to local and inline runners or have them call combine() directly
  • Add | sort... or -combiner to EMRJobRunner and HadoopJobRunner
  • Figure out the logging situation (where are combiner failures written? probably just need to update TASK_ATTEMPTS_LOG_URI_RE.)
  • Update examples and documentation

Oh, and boto doesn't support combiners at all.

@coyotemarin coyotemarin removed their assignment Jul 8, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants