Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hadoop_io step field #723

Closed
coyotemarin opened this issue Sep 3, 2013 · 10 comments
Closed

hadoop_io step field #723

coyotemarin opened this issue Sep 3, 2013 · 10 comments
Labels

Comments

@coyotemarin
Copy link
Collaborator

Individual steps should be able to specify a value to pass to Hadoop with the -io flag (e.g. text, rawbytes, typedbytes).

New-style protocols should have an optional HADOOP_IO field that we can use to infer this (so, for example, a TypedBytesProtocol could imply -io typedbytes).

@coyotemarin
Copy link
Collaborator Author

There's an equivalent jobconf, so maybe we don't need to add an extra option?

@tarnfeld
Copy link
Contributor

tarnfeld commented Dec 9, 2013

I've been using the JOBCONF options stream.map.output and stream.reduce.output with no issue, though it'd be really awesome to have a constant for it – I didn't realise you could pass it as an argument -io.

👍

@coyotemarin
Copy link
Collaborator Author

Yeah, I've seen it mentioned in the Hadoop O'Reilly book but not elsewhere. It's actually really challenging to find information about non-text I/O with Hadoop Streaming. Any working example code (including example non-text input) you can share with me would be really helpful.

@tarnfeld
Copy link
Contributor

Indeed... I wrote a custom OutputReader class and ended up just digging through the Hadoop source and inspecting jobconf options to figure out what was going on... pretty intense 👎


In any case, once I got the hang of it, it's pretty simple. The two options described above are strings passed into an object called an IdentityResolver which resolves a string (by default text/rawbytes/typedbytes) to a set of classes that describe;

  • Key class
  • Value class
  • InputWriter class
  • OutputReader class

I drew this flow diagram a while back, which describes the flow of records through the hadoop streaming code path, the K and V object class types are defined by the object types returned by the identity resolver.

See here and here to understand how the -io argument is translated, it's pretty trivial.

You'll see just below where the identity resolver is also assigned and the actual classes listed above are resolved. I spent a long time trying to figure out why setting the jobconf options stream.{map,reduce}.input.writer.class and stream.{map,reduce}.output.reader.class did nothing, but they get overwritten by the resolver.

If you want to implement custom input writer/output reader (I don't know, say for instance you wanted to read Protocol Buffer input and pass it to MrJob as a json blob) you need to;

I hope that clears it all up! 😄

@tarnfeld
Copy link
Contributor

My only concern about it being tied to a protocol is that, if you're using one IO value across your whole project (not text) but several input protocols, you have to extend them all to implement the constant, and more importantly to use one protocol with several IO values you need to have several protocols. I'd prefer it was defined on the MRStep class, and the first step defaulted to a value on the job class.

@coyotemarin
Copy link
Collaborator Author

Thanks for the default walkthrough of the Hadoop source!

Well, my plan was to allow protocols to specify default values for jobconf, input format, etc. that could be overridden in MRStep. Basically, I'd like to be able to write a TypedBytesProtocol that just works.

Since the -io flag is just a convenient way of setting four configuration properties, I'm thinking it doesn't offer any value over jobconf. I can imagine that we'd often want to mix up formats within a step (e.g. have a mapper read typed bytes from a sequence file as input, and output JSON as text like we normally do). It also doesn't really make sense for a protocol to control the key/value type for the entire step when it may be only used as e.g. reading input for the mapper.

I think it would be useful to allow HADOOP_STREAM_INPUT and HADOOP_STREAM_OUTPUT fields on protocols, so that we could set one or two of the various stream.*.* configuration properties, depending on where the protocol was used. (These sound like terrible names now that i've typed them; better suggestions?)

@coyotemarin
Copy link
Collaborator Author

Hey, @tarnfield, quick question. When you're using typedbytes or bytes, does Hadoop stick a newline after each record anyways? If so, that makes reading from a file simpler; there might still be newlines in the middle of records, but at least you know you'll hit a newline every so often.

@tarnfeld
Copy link
Contributor

Sorry i've been a little absent from this discussion! Fair enough regarding the protocol, i'm all for that – so long as it's still configurable on a step (and therefore job) level. I'm not sure about the names, but those definitely feel a little misleading. Perhaps, in line with the jobconf options...

  • HADOOP_MAP_INPUT = "text"
  • HADOOP_MAP_OUTPUT = "typedbytes"
  • HADOOP_REDUCE_INPUT = "text"
  • HADOOP_REDUCE_OUTPUT = "typedbytes"

@DavidMarin regarding your question about the new line, no I don't believe it does (i've not actually used typedbytes however) – looking at the code the trailing new line is an artefact of the TextInputWriter. The TypedBytesInputWriter just seems to write the raw bytes in a constant stream.

If so, that makes reading from a file simpler; there might still be newlines in the middle of records

If i'm honest – i'm not quite sure how you're supposed to detect the start and end of the record, perhaps there's something built into the typed bytes spec that outlines this? I assumed not, since it's not a schema (something like protocol buffers) if i'm understanding it correctly, it's simply a serialisation format for strictly typed chunks of data...

@coyotemarin
Copy link
Collaborator Author

Thinking that the -io flag is too coarse for most cases, and that the best approach is to integrate input/output type with the protocols (see #808).

@tarnfeld
Copy link
Contributor

👍

On 10 Feb 2014, at 21:19, David Marin notifications@github.com wrote:

Thinking that the -io flag is too coarse for most cases, and that the best approach is to integrate input/output type with the protocols (see #808).


Reply to this email directly or view it on GitHub.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants