Feature: Make hadoop input format configurable for batch ingestion #1177

himanshug · 2015-03-07T17:02:55Z

Our hadooop batch pipelines mostly store data using avro input format. Currently we need to run additional jobs just to convert avro data into text(csv/json/tsv) so that it can be ingested into druid. This is really not necessary. We can save those hadoop grid resources and additional jobs by letting users specify their own hadoop input format for the batch ingestion to druid.

This patch enables additional parameter inside hadoop ioConfig as following..

 "ioConfig" : {
    "type" : "hadoop",
    "inputSpec" : {
      "type" : "static",
      "paths" : "/MyDirectory/examples/indexing/avrodata",
      "inputFormat": "com.xyz.hadoop.MyAvroToTextInputFormat"
    }

Given input format must conform to following assumptions...

It reads input data location from "mapreduce.input.fileinputformat.inputdir" in the job configuration.
it must extend from "org.apache.hadoop.mapreduce.InputFormat"
record key is ignored and value is assumed to have the data.

himanshug · 2015-03-10T17:55:09Z

I have updated the code so that configured InputFormat can either give Text or MapWritable records

xvrl · 2015-03-10T18:10:31Z

@himanshug do you mind just shooting a quick email to the dev mailing list regarding this proposal? We can direct discussion to the PR, but we would ultimately like the mailing list to be single point of entry for outside contributors.

himanshug · 2015-03-10T18:12:32Z

@xvrl sure, will do that.

drcrallen · 2015-03-10T18:55:57Z

indexing-hadoop/src/main/java/io/druid/indexer/HadoopDruidIndexerMapper.java

+      //Note: this hack is required because StringInputRowParser is a
+      //InputRowParser<ByteBuffer> and not InputRowParser<String>
+      //parse() method on it expects a ByteBuffer and not String
+      ByteBuffer bb = ByteBuffer.wrap(value.toString().getBytes());


Please specify charset in getBytes (utf8 is the general standard)

EDIT: actually, better yet, use the java-utils StringUtils converter to UTF8

Edit again: Just as a sanity check, what does parser.parse(...) expect?

I think you can avoid copying by doing ByteBuffer.wrap(value.getBytes(), 0, value.getLength()). Hadoop Texts are internally just UTF-8 byte arrays with some valid length.

StringInputRowParser expects UTF-8. Unless it is guaranteed in the spec somehow that text bytes are UTF-8 encoded, I would prefer we do it this way but convert to UTF-8 so that we are absolutely sure we're passing text formats correctly.

Oooohhh! nice, org.apache.hadoop.io.Text does explicitly state they are UTF-8 encoded. So yes, I think @gianm is right in that we can simply wrap the bytes directly.

@gianm @drcrallen
actually changed code to look like
"((StringInputRowParser)parser).parse(value.toString());"
It is like old code and wouldn't need byte buffer wrapping business.

himanshug · 2015-03-10T23:17:59Z

I have updated the code to allow option#3 which is that users can supply arbitrary InputFormat and that will work as long as they implement and configure appropriate InputRowParser implementation.
All of this is fully backward compatible.

thanks for the feedback. I had put this code to get feedback on the idea and high level implementation. I will test it now with a custom input format, update code for various review comments etc.

drcrallen · 2015-03-12T18:40:25Z

A similar functionality (re-indexing existing data without retaining original data) is possibly made available via #1190 though not in a hadoop manner per se

drcrallen · 2015-03-12T22:53:11Z

indexing-hadoop/src/main/java/io/druid/indexer/path/StaticPathSpec.java

@@ -37,16 +39,21 @@
  @JsonProperty("paths")
  public String paths;

+  @JsonProperty("inputFormat")
+  private Class<? extends InputFormat> inputFormat;


usually we add final for things that don't have setters.

drcrallen · 2015-03-12T23:01:35Z

I think this is a good PR overall. Is it possible to add any unit tests for this?

xvrl · 2015-03-12T23:05:46Z

@himanshug would you mind adding some serde tests for the different pathsSpec classes?
Otherwise it looks good to me.

himanshug · 2015-03-12T23:31:36Z

@xvrl @drcrallen sure, will update soon.

himanshug · 2015-03-14T16:05:35Z

@xvrl @cheddar I'm trying to add setup for more InputRowParsers inside indexing-hadoop .
I have added a class called MapWritableInputRowParser which can parse MapWritable into InputRow

Please see MapWritableInputRowParserTest.testDeserialization() , currently it is failing. One way, I know of fixing it is to make indexing-hadoop a druid extension module
that is, create a DruidModule implementation in indexing-hadoop and provide appropriate JacksonModule from getJacksonModule() impl and create io.druid.initialization.DruidModule file in indexing-hadoop/..METAINF/..

is there any other smarter way to do it? May be just adding some annotations on the MapWritableInputRowParser class?

himanshug · 2015-03-17T00:11:51Z

made indexing-hadoop a druid extension to register MapWritableInputRowParser as a subtype of InputRowParser inside jackson framework

…ven input format must extend from org.apache.hadoop.mapreduce.InputFormat

himanshug · 2015-03-18T21:11:01Z

@xvrl this PR is complete and ready for merging. can u pls see?

himanshug · 2015-03-19T01:46:09Z

closing/reopening PR to trigger the build

himanshug · 2015-03-19T19:42:17Z

@xvrl build is failing for unrelated reason. can u please merge this PR?

xvrl · 2015-03-19T20:05:01Z

@cheddar I haven't had a chance to look at the latest changes, if that looks good to you, I'll merge.

On Mar 19, 2015, at 12:42, Himanshu notifications@github.com wrote:

@xvrl build is failing for unrelated reason. can u please merge this PR?

—
Reply to this email directly or view it on GitHub.

cheddar · 2015-03-19T22:48:36Z

Generally looks good to me. I think there is room to move the parsing entirely into the InputFormat and just pass an InputRow into the MR jobs, but I don't think there's too much value in holding this up for that.

Feature: Make hadoop input format configurable for batch ingestion

xvrl added the Discuss label Mar 10, 2015

drcrallen reviewed Mar 10, 2015
View reviewed changes

drcrallen reviewed Mar 12, 2015
View reviewed changes

xvrl removed the Discuss label Mar 12, 2015

himanshug force-pushed the custom_input_format1 branch from 1f2052f to 1375663 Compare March 14, 2015 15:58

himanshug force-pushed the custom_input_format1 branch from 1be19dc to 44e6c35 Compare March 18, 2015 21:05

For batch hadoop indexing, make hadoop input format configuration. Gi…

3f7a7ba

…ven input format must extend from org.apache.hadoop.mapreduce.InputFormat

himanshug force-pushed the custom_input_format1 branch from 44e6c35 to 3f7a7ba Compare March 18, 2015 21:10

himanshug changed the title ~~Proposal: feature to make hadoop input format configurable for batch ingestion~~ Feature: Make hadoop input format configurable for batch ingestion Mar 18, 2015

himanshug closed this Mar 19, 2015

himanshug reopened this Mar 19, 2015

xvrl added this to the 0.7.1 milestone Mar 19, 2015

xvrl added the Feature label Mar 19, 2015

xvrl added a commit that referenced this pull request Mar 19, 2015

Merge pull request #1177 from himanshug/custom_input_format1

a98187f

Feature: Make hadoop input format configurable for batch ingestion

xvrl merged commit a98187f into apache:master Mar 19, 2015

himanshug mentioned this pull request Mar 19, 2015

Removing MapWritableInputRowParser from indexing-hadoop #1229

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Make hadoop input format configurable for batch ingestion #1177

Feature: Make hadoop input format configurable for batch ingestion #1177

himanshug commented Mar 7, 2015

himanshug commented Mar 10, 2015

xvrl commented Mar 10, 2015

himanshug commented Mar 10, 2015

drcrallen Mar 10, 2015

gianm Mar 10, 2015

drcrallen Mar 10, 2015

drcrallen Mar 10, 2015

himanshug Mar 11, 2015

himanshug commented Mar 10, 2015

drcrallen commented Mar 12, 2015

drcrallen Mar 12, 2015

drcrallen commented Mar 12, 2015

xvrl commented Mar 12, 2015

himanshug commented Mar 12, 2015

himanshug commented Mar 14, 2015

himanshug commented Mar 17, 2015

himanshug commented Mar 18, 2015

himanshug commented Mar 19, 2015

himanshug commented Mar 19, 2015

xvrl commented Mar 19, 2015

cheddar commented Mar 19, 2015

Feature: Make hadoop input format configurable for batch ingestion #1177

Feature: Make hadoop input format configurable for batch ingestion #1177

Conversation

himanshug commented Mar 7, 2015

himanshug commented Mar 10, 2015

xvrl commented Mar 10, 2015

himanshug commented Mar 10, 2015

drcrallen Mar 10, 2015

Choose a reason for hiding this comment

gianm Mar 10, 2015

Choose a reason for hiding this comment

drcrallen Mar 10, 2015

Choose a reason for hiding this comment

drcrallen Mar 10, 2015

Choose a reason for hiding this comment

himanshug Mar 11, 2015

Choose a reason for hiding this comment

himanshug commented Mar 10, 2015

drcrallen commented Mar 12, 2015

drcrallen Mar 12, 2015

Choose a reason for hiding this comment

drcrallen commented Mar 12, 2015

xvrl commented Mar 12, 2015

himanshug commented Mar 12, 2015

himanshug commented Mar 14, 2015

himanshug commented Mar 17, 2015

himanshug commented Mar 18, 2015

himanshug commented Mar 19, 2015

himanshug commented Mar 19, 2015

xvrl commented Mar 19, 2015

cheddar commented Mar 19, 2015