New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hadoop_io step field #723
Comments
There's an equivalent |
I've been using the 👍 |
Yeah, I've seen it mentioned in the Hadoop O'Reilly book but not elsewhere. It's actually really challenging to find information about non-text I/O with Hadoop Streaming. Any working example code (including example non-text input) you can share with me would be really helpful. |
Indeed... I wrote a custom In any case, once I got the hang of it, it's pretty simple. The two options described above are strings passed into an object called an IdentityResolver which resolves a string (by default
I drew this flow diagram a while back, which describes the flow of records through the hadoop streaming code path, the See here and here to understand how the You'll see just below where the identity resolver is also assigned and the actual classes listed above are resolved. I spent a long time trying to figure out why setting the jobconf options If you want to implement custom input writer/output reader (I don't know, say for instance you wanted to read Protocol Buffer input and pass it to MrJob as a json blob) you need to;
I hope that clears it all up! 😄 |
My only concern about it being tied to a protocol is that, if you're using one IO value across your whole project (not |
Thanks for the default walkthrough of the Hadoop source! Well, my plan was to allow protocols to specify default values for jobconf, input format, etc. that could be overridden in Since the I think it would be useful to allow |
Hey, @tarnfield, quick question. When you're using |
Sorry i've been a little absent from this discussion! Fair enough regarding the protocol, i'm all for that – so long as it's still configurable on a step (and therefore job) level. I'm not sure about the names, but those definitely feel a little misleading. Perhaps, in line with the jobconf options...
@DavidMarin regarding your question about the new line, no I don't believe it does (i've not actually used
If i'm honest – i'm not quite sure how you're supposed to detect the start and end of the record, perhaps there's something built into the typed bytes spec that outlines this? I assumed not, since it's not a schema (something like protocol buffers) if i'm understanding it correctly, it's simply a serialisation format for strictly typed chunks of data... |
Thinking that the |
👍
|
Individual steps should be able to specify a value to pass to Hadoop with the
-io
flag (e.g.text
,rawbytes
,typedbytes
).New-style protocols should have an optional
HADOOP_IO
field that we can use to infer this (so, for example, aTypedBytesProtocol
could imply-io typedbytes
).The text was updated successfully, but these errors were encountered: