You can clone with
HTTPS or Subversion.
We've gotten a lot of mail on the mrjob mailing list lately from people who are trying to use mrjob with non line-based data and finding it a struggle.
Here's what I propose. Rather than read() and write(), protocols will need to define decode() and encode() as follows:
def decode(self, input):
"""Read bytestrings from *input* (a generator) and yield tuples of (key, value)."""
def encode(self, output):
"""Read tuples of (key, value) from *output* (a generator) and yield bytestrings."""
The --strict-protocols flag doesn't matter to these new-style protocols (they're always strict). If you really needed loose protocols, you could definitely build a protocol that could detect bad data, increment a counter, and continue (you'd have to pass the job or the increment_counter method to the protocol's constructor). Kind of depends how your data is divided into records; if it's line-based, you just discard bad lines, but if you were reading, say, XML, an open tag in the wrong place could confuse everything.
We could eventually deprecate the old-style protocols, though hiding them in an obscure corner of the documentation would probably be sufficient to reduce confusion. I mean, there's nothing wrong with the old interface; I'd just rather people only had to know about two methods rather than four.
Note this is totally independent from whether --strict-protocols should be enabled by default (see #696).
I'm thinking that passing file objects to protocols might be too much to promise. File objects have lots of methods, and a protocol could potentially use all of them.
One thing we could do is pass a function that returns a chunk of data to load() and a function that allows writing a chunk of data to dump().
However, load() is inevitably going to have to buffer some unused data, so it would really make more sense for it to be an iterator that yields key-value pairs until it gets to the end of the data.
For symmetry, it might make sense for dump() to take a write method and an iterator that yields the key, value pairs.
Using iterators basically makes it impossible for MRJob to implement loose protocols (turning exceptions into Hadoop counters) with these new-style protocols, but maybe that's not a bad thing. It certainly makes things easier to implement if you can just bail out when you encounter unexpected data.
This is an elegant model, but it's harder to understand than the current read/write one, so maybe we'll just keep the old way around forever too.
decode() and encode() are probably more apropos names for the iterator-based approach.
Updated issue description to use decode() and encode().
It might be feasible to specify that decode() is fed lines (chunks of bytes ending in \n), so as not to hurt performance in text mode. I think Hadoop adds newlines after each record no matter what, so that reading input one line at a time is a safe strategy (see my comment in #723).
Nope, there's no guarantee of newlines (see #723). We just need to ensure that buffer_iterator_to_line_iterator() (to be renamed to yield_lines()) creates as little overhead as possible when fed lines.
What is the status of this? I need this now and I can help out on ongoing work.
AFAIK no one is working on it.
I think you could have a hacky version working in 2 hours and a "correct" version in 4. But that's a vague, unreliable estimate from someone who hasn't worked with the code in two years.