-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: Make hadoop input format configurable for batch ingestion #1177
Conversation
I have updated the code so that configured InputFormat can either give Text or MapWritable records |
@himanshug do you mind just shooting a quick email to the dev mailing list regarding this proposal? We can direct discussion to the PR, but we would ultimately like the mailing list to be single point of entry for outside contributors. |
@xvrl sure, will do that. |
//Note: this hack is required because StringInputRowParser is a | ||
//InputRowParser<ByteBuffer> and not InputRowParser<String> | ||
//parse() method on it expects a ByteBuffer and not String | ||
ByteBuffer bb = ByteBuffer.wrap(value.toString().getBytes()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please specify charset in getBytes (utf8 is the general standard)
EDIT: actually, better yet, use the java-utils StringUtils converter to UTF8
Edit again: Just as a sanity check, what does parser.parse(...) expect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can avoid copying by doing ByteBuffer.wrap(value.getBytes(), 0, value.getLength())
. Hadoop Texts are internally just UTF-8 byte arrays with some valid length.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StringInputRowParser expects UTF-8. Unless it is guaranteed in the spec somehow that text bytes are UTF-8 encoded, I would prefer we do it this way but convert to UTF-8 so that we are absolutely sure we're passing text formats correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oooohhh! nice, org.apache.hadoop.io.Text
does explicitly state they are UTF-8 encoded. So yes, I think @gianm is right in that we can simply wrap the bytes directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gianm @drcrallen
actually changed code to look like
"((StringInputRowParser)parser).parse(value.toString());"
It is like old code and wouldn't need byte buffer wrapping business.
I have updated the code to allow option#3 which is that users can supply arbitrary InputFormat and that will work as long as they implement and configure appropriate InputRowParser implementation. thanks for the feedback. I had put this code to get feedback on the idea and high level implementation. I will test it now with a custom input format, update code for various review comments etc. |
A similar functionality (re-indexing existing data without retaining original data) is possibly made available via #1190 though not in a hadoop manner per se |
@@ -37,16 +39,21 @@ | |||
@JsonProperty("paths") | |||
public String paths; | |||
|
|||
@JsonProperty("inputFormat") | |||
private Class<? extends InputFormat> inputFormat; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
usually we add final for things that don't have setters.
I think this is a good PR overall. Is it possible to add any unit tests for this? |
@himanshug would you mind adding some serde tests for the different pathsSpec classes? |
@xvrl @drcrallen sure, will update soon. |
1f2052f
to
1375663
Compare
@xvrl @cheddar I'm trying to add setup for more InputRowParsers inside indexing-hadoop . Please see MapWritableInputRowParserTest.testDeserialization() , currently it is failing. One way, I know of fixing it is to make indexing-hadoop a druid extension module is there any other smarter way to do it? May be just adding some annotations on the MapWritableInputRowParser class? |
made indexing-hadoop a druid extension to register MapWritableInputRowParser as a subtype of InputRowParser inside jackson framework |
1be19dc
to
44e6c35
Compare
…ven input format must extend from org.apache.hadoop.mapreduce.InputFormat
44e6c35
to
3f7a7ba
Compare
@xvrl this PR is complete and ready for merging. can u pls see? |
closing/reopening PR to trigger the build |
@xvrl build is failing for unrelated reason. can u please merge this PR? |
@cheddar I haven't had a chance to look at the latest changes, if that looks good to you, I'll merge.
|
Generally looks good to me. I think there is room to move the parsing entirely into the InputFormat and just pass an |
Feature: Make hadoop input format configurable for batch ingestion
Our hadooop batch pipelines mostly store data using avro input format. Currently we need to run additional jobs just to convert avro data into text(csv/json/tsv) so that it can be ingested into druid. This is really not necessary. We can save those hadoop grid resources and additional jobs by letting users specify their own hadoop input format for the batch ingestion to druid.
This patch enables additional parameter inside hadoop ioConfig as following..
Given input format must conform to following assumptions...