It wouldn't actually be that difficult for MRJob to run scripts written in other languages if they implemented the MRJob protocol (--steps, --mapper, --reducer, and --step-num). Instead of prepending python to our command inside Hadoop streaming, we'd prepend ruby or java or (for shell scripts) nothing. We'd probably run them like:
python mrjob.job.MRJob --mr-job-script mr_perform_aweomeness.rb
The main thing is, I'm not sure there's any demand for such a feature.
Tell you what, you write the base MRJob class in your favorite language and put it up on github, and I'll hook it up to mrjob for you. :)
Also --combiner. (Just keeping this up to date)
Other things involved:
We need to incorporate Hadoop input/output format, partitioner, and jobconf into the steps format.
Default runner is inline.
MRJobLauncher does not support inline.
We'll have to have different defaults, I suppose.
Yeah, that's true. But reasonable.
I can imagine people not even realizing that inline and local modes have names, and just think; the job runs itself inline, and if I want to simulate Hadoop, I can run it from mrjob-launch (or whatever we call the binary).
This pretty much works now, but the interface for invoking it is pretty hacky. Rest of the issue is in #225.