hadoop_io step field #723

coyotemarin · 2013-09-03T05:33:49Z

Individual steps should be able to specify a value to pass to Hadoop with the -io flag (e.g. text, rawbytes, typedbytes).

New-style protocols should have an optional HADOOP_IO field that we can use to infer this (so, for example, a TypedBytesProtocol could imply -io typedbytes).

The text was updated successfully, but these errors were encountered:

coyotemarin · 2013-11-15T21:38:41Z

There's an equivalent jobconf, so maybe we don't need to add an extra option?

tarnfeld · 2013-12-09T10:26:31Z

I've been using the JOBCONF options stream.map.output and stream.reduce.output with no issue, though it'd be really awesome to have a constant for it – I didn't realise you could pass it as an argument -io.

👍

coyotemarin · 2013-12-09T18:18:37Z

Yeah, I've seen it mentioned in the Hadoop O'Reilly book but not elsewhere. It's actually really challenging to find information about non-text I/O with Hadoop Streaming. Any working example code (including example non-text input) you can share with me would be really helpful.

tarnfeld · 2013-12-10T10:12:51Z

Indeed... I wrote a custom OutputReader class and ended up just digging through the Hadoop source and inspecting jobconf options to figure out what was going on... pretty intense 👎

In any case, once I got the hang of it, it's pretty simple. The two options described above are strings passed into an object called an IdentityResolver which resolves a string (by default text/rawbytes/typedbytes) to a set of classes that describe;

Key class
Value class
InputWriter class
OutputReader class

I drew this flow diagram a while back, which describes the flow of records through the hadoop streaming code path, the K and V object class types are defined by the object types returned by the identity resolver.

See here and here to understand how the -io argument is translated, it's pretty trivial.

You'll see just below where the identity resolver is also assigned and the actual classes listed above are resolved. I spent a long time trying to figure out why setting the jobconf options stream.{map,reduce}.input.writer.class and stream.{map,reduce}.output.reader.class did nothing, but they get overwritten by the resolver.

If you want to implement custom input writer/output reader (I don't know, say for instance you wanted to read Protocol Buffer input and pass it to MrJob as a json blob) you need to;

Extend org.apache.hadoop.streaming.io.IdentifierResolver and implement a custom resolution
Set the identity resolver jobconf option correctly
Set the stream.{map,reduce}.{input,output} jobconf options correctly
Implement the things and stuff

I hope that clears it all up! 😄

tarnfeld · 2013-12-10T10:17:30Z

My only concern about it being tied to a protocol is that, if you're using one IO value across your whole project (not text) but several input protocols, you have to extend them all to implement the constant, and more importantly to use one protocol with several IO values you need to have several protocols. I'd prefer it was defined on the MRStep class, and the first step defaulted to a value on the job class.

coyotemarin · 2013-12-11T05:00:52Z

Thanks for the default walkthrough of the Hadoop source!

Well, my plan was to allow protocols to specify default values for jobconf, input format, etc. that could be overridden in MRStep. Basically, I'd like to be able to write a TypedBytesProtocol that just works.

Since the -io flag is just a convenient way of setting four configuration properties, I'm thinking it doesn't offer any value over jobconf. I can imagine that we'd often want to mix up formats within a step (e.g. have a mapper read typed bytes from a sequence file as input, and output JSON as text like we normally do). It also doesn't really make sense for a protocol to control the key/value type for the entire step when it may be only used as e.g. reading input for the mapper.

I think it would be useful to allow HADOOP_STREAM_INPUT and HADOOP_STREAM_OUTPUT fields on protocols, so that we could set one or two of the various stream.*.* configuration properties, depending on where the protocol was used. (These sound like terrible names now that i've typed them; better suggestions?)

coyotemarin · 2013-12-13T20:36:25Z

Hey, @tarnfield, quick question. When you're using typedbytes or bytes, does Hadoop stick a newline after each record anyways? If so, that makes reading from a file simpler; there might still be newlines in the middle of records, but at least you know you'll hit a newline every so often.

tarnfeld · 2013-12-14T00:27:13Z

Sorry i've been a little absent from this discussion! Fair enough regarding the protocol, i'm all for that – so long as it's still configurable on a step (and therefore job) level. I'm not sure about the names, but those definitely feel a little misleading. Perhaps, in line with the jobconf options...

HADOOP_MAP_INPUT = "text"
HADOOP_MAP_OUTPUT = "typedbytes"
HADOOP_REDUCE_INPUT = "text"
HADOOP_REDUCE_OUTPUT = "typedbytes"

@DavidMarin regarding your question about the new line, no I don't believe it does (i've not actually used typedbytes however) – looking at the code the trailing new line is an artefact of the TextInputWriter. The TypedBytesInputWriter just seems to write the raw bytes in a constant stream.

If so, that makes reading from a file simpler; there might still be newlines in the middle of records

If i'm honest – i'm not quite sure how you're supposed to detect the start and end of the record, perhaps there's something built into the typed bytes spec that outlines this? I assumed not, since it's not a schema (something like protocol buffers) if i'm understanding it correctly, it's simply a serialisation format for strictly typed chunks of data...

coyotemarin · 2014-02-10T21:19:47Z

Thinking that the -io flag is too coarse for most cases, and that the best approach is to integrate input/output type with the protocols (see #808).

tarnfeld · 2014-02-10T21:24:01Z

👍

On 10 Feb 2014, at 21:19, David Marin notifications@github.com wrote:

Thinking that the -io flag is too coarse for most cases, and that the best approach is to integrate input/output type with the protocols (see #808).

—
Reply to this email directly or view it on GitHub.

coyotemarin mentioned this issue Dec 13, 2013

rework protocols to handle non-line-based data #715

Closed

coyotemarin closed this as completed Feb 10, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hadoop_io step field #723

hadoop_io step field #723

coyotemarin commented Sep 3, 2013

coyotemarin commented Nov 15, 2013

tarnfeld commented Dec 9, 2013

coyotemarin commented Dec 9, 2013

tarnfeld commented Dec 10, 2013

tarnfeld commented Dec 10, 2013

coyotemarin commented Dec 11, 2013

coyotemarin commented Dec 13, 2013

tarnfeld commented Dec 14, 2013

coyotemarin commented Feb 10, 2014

tarnfeld commented Feb 10, 2014

hadoop_io step field #723

hadoop_io step field #723

Comments

coyotemarin commented Sep 3, 2013

coyotemarin commented Nov 15, 2013

tarnfeld commented Dec 9, 2013

coyotemarin commented Dec 9, 2013

tarnfeld commented Dec 10, 2013

tarnfeld commented Dec 10, 2013

coyotemarin commented Dec 11, 2013

coyotemarin commented Dec 13, 2013

tarnfeld commented Dec 14, 2013

coyotemarin commented Feb 10, 2014

tarnfeld commented Feb 10, 2014