Skip to content
This repository

Add "migrating from dumbo" section to docs #11

Closed
davidmarin opened this Issue · 22 comments

5 participants

David Marin James Wheare Klaas Bosteels David Gleich Steve Johnson
David Marin
Collaborator

Migrating from dumbo (the other MapReduce Python module) should be pretty easy because its mappers and reducers have the same function signature.

Would be great to have some input from someone who actually uses dumbo so we're not just making stuff up. :)

James Wheare

What were the limitations of dumbo that made you write mrjob? Maybe outlining those differences, whether philosophical or technical, in the docs somewhere would be useful. At the moment I'm just confused which one I'd use.

David Marin
Collaborator

Good question! Will do. The big advantages mrjob has over dumbo are:

  • works with EMR
  • JSON and other protocols for communication between steps and with other processes
  • add custom switches to your jobs, including file options
Klaas Bosteels

That's not quite true. I haven't looked at mrjob much, but:

David Marin
Collaborator

Oh, awesome, a visit from the dumbo maintainer!

Neat, didn't know that dumbo has custom switches.

It looks like dumbo's EMR support is more like running dumbo by hand inside EMR, whereas mrjob jobs will actually start up an EMR job flow and launch themselves on EMR (and install mrjob automatically) when you run them with -r emr. But yeah, I stand corrected on EMR as well.

I'll make a point of trying to run by statements about what dumbo can and can't do by you first. I think there's room for more than one Python MapReduce library, and I'm hoping we can work together and learn from each other. :)

By the way, I've taken a look at your typedbytes library. I imagine it'll be pretty useful for mrjob once we support mixing of python scripts and Java (right now it's all Python).

David Gleich

Are there any forks that try to get mrjob to use typedbytes as the interface instead of json? I need to use numeric data almost exclusively in a project and I'm concerned that json will introduce too much overhead.

David Marin
Collaborator

Not that I know of, but I'd be happy to work with you on getting it to work.

One assumption you might bump up against is that mrjob assumes you're using a line-based format (since it's a Hadoop streaming library). However, it would not be at all difficult to extend the protocol definition to allow arbitrary reading and writing to file handles.

All the code that controls input/output format is in mrjob.job.MRJob.run_mapper(), mrjob.job.MRJob.run_reducer(), and mrjob.job.MRJob.run_combiner(). I would recommend forking master, hacking this code however you need to in order to get typed bytes to work, and then we can figure out the best way to clean it up and generalize once you have something working.

Let me know how it goes!

Klaas Bosteels

Hadoop streaming doesn't require a line-based format actually, its default InputWriter and OutputReader just happen to be line-based. With typed bytes you simply use a different InputWriter and OutputReader that rely on a more efficient binary format (and those typed bytes IO classes are also shipped with Hadoop streaming by the way).

Also, with the java and python side I meant streaming and dumbo above. Mrjob might be all python (as is dumbo), but you still need to talk to streaming's java code then. Making this more efficient can lead to very substantial performance gains.

David Gleich

It looks like it won't be too difficult to come up with a proof-of-concept that this works. The user interface for picking a protocol will be a tricky bit. To start with, I'll try and force everything to use typedbytes.

David Gleich

One issue that'll arise is how to distributed the typedbytes python module with the mrjob.tar.gz file.

David Gleich

I have a fork where dgleich@def0452 that does this in a really hacky way. At the moment, it assumes all IO from python is in typedbytes.

  • Tested on CDH3 in single node pseudo-distributed mode.
  • typedbytes module installed system-wide
  • dumbo installed via pip install dumbo

    (pyenv)dgleich@recurrent:~/devextern/mrjob$ HADOOP_HOME=/usr/lib/hadoop-0.20/ python mrjob/examples/mr_word_freq_count.py README.rst -r hadoop -v --no-output -o counts --hadoop-output-format=org.apache.hadoop.mapred.SequenceFileOutputFormat

If I cat the output file, I get garbage now.

Now, let's see the output via dumbo cat (which should convert from typed bytes back to the usual...)

dumbo cat counts/part-* -hadoop /usr/lib/hadoop-0.20
12/02/07 12:57:00 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing.
_   5
a   2
e   1
g   1
r   2
an  2
by  1
David Marin
Collaborator

Oh, awesome.

--hadoop-output-format only sets the output format for the last step. You probably want to set input/output format directly with e.g. --hadoop-arg -outputformat --hadoop-arg org.apache.hadoop.mapred.SequenceFileOutputFormat (and input format as well) so that you'll be using that format to pass sequence file data between your mapper and reducer.

(And thank you for the tips, @klbostee!)

David Gleich

As I'm thinking about how to implement this in a more structured way, I'm wondering if there is any reason to use a line-based interface between hadoop streaming and mrjob.

Klaas, if you use typedbytes as the communication between the streaming jar and python, you can still write out line-based data by setting the right outputformat, right?

I'm also struggling with thinking about how the typedbytes interface fits into mrjob. That is, python can communicate with java via typedbytes. Records can be stored as typedbytes, and I think any standard writeables get converted to typedbytes to communicate with streaming.

If the answer to the above question is yes, then I don't see any reason to keep the old line based interface around. Am I missing something?

(As an aside, I'll note that one reason to keep an OPTION for a line-based mode around would be promote compatibility with other languages. For instance, if mrjob has the ability to specify other executables to run to process the data... it might not be so easy to use typedbytes with those.)

David Marin
Collaborator

We picked JSON and a line-based interface because:

  • The default interface to Hadoop Streaming is line-based
  • The line-based record format is easy to understand
  • Basically all our input files are line-based log files
  • JSON is human-readable
  • JSON is about the most interoperable format around

That being said, I'd love to know if there are significant efficiency gains to be had by using typed bytes, and I'd love to have a good use case to guide the design of the protocol interface for non line-based formats.

David Gleich

I don't see any way around radically modifying the function job.pick_protocols and job._wrap_protocols

I don't think that modifying _wrap_protocols is an issue as it's an "internal" function. However, is modifying pick_protocols an issue? What (if any) backwards compatibility should I consider with changing this function?

David Marin
Collaborator

It's totally fine to override pick_protocols() (or to rewrite it while hacking out a proof-of-concept). See http://packages.python.org/mrjob/job.html#mrjob.job.MRJob.pick_protocols.

Steve Johnson
Collaborator

Two questions:

  • Do we ever plan to actually add this to the docs?
  • Should we open a new issue about typedbytes? How's that going?
David Gleich

I'm working with a student on implementing it ... here's the current repository he's been working on. I just haven't had a chance to test it yet.

See ... joshi-prahlad@b0b005a for the most recent changes.

David

Steve Johnson
Collaborator

Cool, but it's really confusing to me that we're doing all this typedbytes discussion on the "dumbo migration in mrjob docs" issue. Can you please open a new one with what you actually plan to do?

Steve Johnson
Collaborator

And my "are we actually doing this" question was about the docs, not typedbytes.

David Gleich

Just added #430 about the typedbytes piece of this issue.

Steve Johnson
Collaborator

IMO we should close this issue. If someone wants to volunteer to write it, they have my endless thanks, but we'll never do it ourselves and no one has asked for it.

Steve Johnson irskep closed this
David Marin
Collaborator

Yep.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.