Migrating from dumbo (the other MapReduce Python module) should be pretty easy because its mappers and reducers have the same function signature.
Would be great to have some input from someone who actually uses dumbo so we're not just making stuff up. :)
What were the limitations of dumbo that made you write mrjob? Maybe outlining those differences, whether philosophical or technical, in the docs somewhere would be useful. At the moment I'm just confused which one I'd use.
Good question! Will do. The big advantages mrjob has over dumbo are:
That's not quite true. I haven't looked at mrjob much, but:
Oh, awesome, a visit from the dumbo maintainer!
Neat, didn't know that dumbo has custom switches.
It looks like dumbo's EMR support is more like running dumbo by hand inside EMR, whereas mrjob jobs will actually start up an EMR job flow and launch themselves on EMR (and install mrjob automatically) when you run them with -r emr. But yeah, I stand corrected on EMR as well.
I'll make a point of trying to run by statements about what dumbo can and can't do by you first. I think there's room for more than one Python MapReduce library, and I'm hoping we can work together and learn from each other. :)
By the way, I've taken a look at your typedbytes library. I imagine it'll be pretty useful for mrjob once we support mixing of python scripts and Java (right now it's all Python).
Are there any forks that try to get mrjob to use typedbytes as the interface instead of json? I need to use numeric data almost exclusively in a project and I'm concerned that json will introduce too much overhead.
Not that I know of, but I'd be happy to work with you on getting it to work.
One assumption you might bump up against is that mrjob assumes you're using a line-based format (since it's a Hadoop streaming library). However, it would not be at all difficult to extend the protocol definition to allow arbitrary reading and writing to file handles.
All the code that controls input/output format is in mrjob.job.MRJob.run_mapper(), mrjob.job.MRJob.run_reducer(), and mrjob.job.MRJob.run_combiner(). I would recommend forking master, hacking this code however you need to in order to get typed bytes to work, and then we can figure out the best way to clean it up and generalize once you have something working.
Let me know how it goes!
Hadoop streaming doesn't require a line-based format actually, its default InputWriter and OutputReader just happen to be line-based. With typed bytes you simply use a different InputWriter and OutputReader that rely on a more efficient binary format (and those typed bytes IO classes are also shipped with Hadoop streaming by the way).
Also, with the java and python side I meant streaming and dumbo above. Mrjob might be all python (as is dumbo), but you still need to talk to streaming's java code then. Making this more efficient can lead to very substantial performance gains.
It looks like it won't be too difficult to come up with a proof-of-concept that this works. The user interface for picking a protocol will be a tricky bit. To start with, I'll try and force everything to use typedbytes.
One issue that'll arise is how to distributed the typedbytes python module with the mrjob.tar.gz file.
I have a fork where dgleich@def0452 that does this in a really hacky way. At the moment, it assumes all IO from python is in typedbytes.
dumbo installed via pip install dumbo
pip install dumbo
(pyenv)dgleich@recurrent:~/devextern/mrjob$ HADOOP_HOME=/usr/lib/hadoop-0.20/ python mrjob/examples/mr_word_freq_count.py README.rst -r hadoop -v --no-output -o counts --hadoop-output-format=org.apache.hadoop.mapred.SequenceFileOutputFormat
If I cat the output file, I get garbage now.
Now, let's see the output via dumbo cat (which should convert from typed bytes back to the usual...)
dumbo cat counts/part-* -hadoop /usr/lib/hadoop-0.20
12/02/07 12:57:00 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing.
--hadoop-output-format only sets the output format for the last step. You probably want to set input/output format directly with e.g. --hadoop-arg -outputformat --hadoop-arg org.apache.hadoop.mapred.SequenceFileOutputFormat (and input format as well) so that you'll be using that format to pass sequence file data between your mapper and reducer.
--hadoop-arg -outputformat --hadoop-arg org.apache.hadoop.mapred.SequenceFileOutputFormat
(And thank you for the tips, @klbostee!)
As I'm thinking about how to implement this in a more structured way, I'm wondering if there is any reason to use a line-based interface between hadoop streaming and mrjob.
Klaas, if you use typedbytes as the communication between the streaming jar and python, you can still write out line-based data by setting the right outputformat, right?
I'm also struggling with thinking about how the typedbytes interface fits into mrjob. That is, python can communicate with java via typedbytes. Records can be stored as typedbytes, and I think any standard writeables get converted to typedbytes to communicate with streaming.
If the answer to the above question is yes, then I don't see any reason to keep the old line based interface around. Am I missing something?
(As an aside, I'll note that one reason to keep an OPTION for a line-based mode around would be promote compatibility with other languages. For instance, if mrjob has the ability to specify other executables to run to process the data... it might not be so easy to use typedbytes with those.)
We picked JSON and a line-based interface because:
That being said, I'd love to know if there are significant efficiency gains to be had by using typed bytes, and I'd love to have a good use case to guide the design of the protocol interface for non line-based formats.
I don't see any way around radically modifying the function job.pick_protocols and job._wrap_protocols
I don't think that modifying _wrap_protocols is an issue as it's an "internal" function. However, is modifying pick_protocols an issue? What (if any) backwards compatibility should I consider with changing this function?
It's totally fine to override pick_protocols() (or to rewrite it while hacking out a proof-of-concept). See http://packages.python.org/mrjob/job.html#mrjob.job.MRJob.pick_protocols.
I'm working with a student on implementing it ... here's the current repository he's been working on. I just haven't had a chance to test it yet.
See ... joshi-prahlad@b0b005a for the most recent changes.
Cool, but it's really confusing to me that we're doing all this typedbytes discussion on the "dumbo migration in mrjob docs" issue. Can you please open a new one with what you actually plan to do?
And my "are we actually doing this" question was about the docs, not typedbytes.
Just added #430 about the typedbytes piece of this issue.
IMO we should close this issue. If someone wants to volunteer to write it, they have my endless thanks, but we'll never do it ourselves and no one has asked for it.