rework protocols to handle non-line-based data #715

Open
davidmarin opened this Issue Aug 29, 2013 · 10 comments

Comments

Projects
None yet
3 participants
Collaborator

davidmarin commented Aug 29, 2013

We've gotten a lot of mail on the mrjob mailing list lately from people who are trying to use mrjob with non line-based data and finding it a struggle.

Here's what I propose. Rather than read() and write(), protocols will need to define decode() and encode() as follows:

def decode(self, input):
    """Read bytestrings from *input* (a generator) and yield tuples of (key, value)."""
    ...

def encode(self, output):
    """Read tuples of (key, value) from *output* (a generator) and yield bytestrings."""
    ...

The --strict-protocols flag doesn't matter to these new-style protocols (they're always strict). If you really needed loose protocols, you could definitely build a protocol that could detect bad data, increment a counter, and continue (you'd have to pass the job or the increment_counter method to the protocol's constructor). Kind of depends how your data is divided into records; if it's line-based, you just discard bad lines, but if you were reading, say, XML, an open tag in the wrong place could confuse everything.

We could eventually deprecate the old-style protocols, though hiding them in an obscure corner of the documentation would probably be sufficient to reduce confusion. I mean, there's nothing wrong with the old interface; I'd just rather people only had to know about two methods rather than four.

Collaborator

davidmarin commented Aug 29, 2013

Note this is totally independent from whether --strict-protocols should be enabled by default (see #696).

Collaborator

davidmarin commented Sep 2, 2013

I'm thinking that passing file objects to protocols might be too much to promise. File objects have lots of methods, and a protocol could potentially use all of them.

One thing we could do is pass a function that returns a chunk of data to load() and a function that allows writing a chunk of data to dump().

However, load() is inevitably going to have to buffer some unused data, so it would really make more sense for it to be an iterator that yields key-value pairs until it gets to the end of the data.

For symmetry, it might make sense for dump() to take a write method and an iterator that yields the key, value pairs.

Using iterators basically makes it impossible for MRJob to implement loose protocols (turning exceptions into Hadoop counters) with these new-style protocols, but maybe that's not a bad thing. It certainly makes things easier to implement if you can just bail out when you encounter unexpected data.

This is an elegant model, but it's harder to understand than the current read/write one, so maybe we'll just keep the old way around forever too.

Collaborator

davidmarin commented Sep 3, 2013

decode() and encode() are probably more apropos names for the iterator-based approach.

Collaborator

davidmarin commented Dec 13, 2013

Updated issue description to use decode() and encode().

Collaborator

davidmarin commented Dec 13, 2013

It might be feasible to specify that decode() is fed lines (chunks of bytes ending in \n), so as not to hurt performance in text mode. I think Hadoop adds newlines after each record no matter what, so that reading input one line at a time is a safe strategy (see my comment in #723).

@davidmarin davidmarin modified the milestone: v0.4.4, v0.4.3 Feb 10, 2014

Collaborator

davidmarin commented Feb 10, 2014

Nope, there's no guarantee of newlines (see #723). We just need to ensure that buffer_iterator_to_line_iterator() (to be renamed to yield_lines()) creates as little overhead as possible when fed lines.

What is the status of this? I need this now and I can help out on ongoing work.

Contributor

irskep commented Jul 12, 2014

AFAIK no one is working on it.

Thank you for the heads up. I desperately need that to read Avro binary
data. I can try to take a crack on that. Do you know the scope of the work?

Wai Yip

Steve Johnson mailto:notifications@github.com
Friday, July 11, 2014 5:56 PM

AFAIK no one is working on it.


Reply to this email directly or view it on GitHub
#715 (comment).

Contributor

irskep commented Jul 14, 2014

I think you could have a hacky version working in 2 hours and a "correct" version in 4. But that's a vague, unreliable estimate from someone who hasn't worked with the code in two years.

@davidmarin davidmarin modified the milestone: v0.4.4, on the radar Apr 7, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment