Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Make hadoop input format configurable for batch ingestion #1177

Merged
merged 1 commit into from
Mar 19, 2015

Conversation

himanshug
Copy link
Contributor

Our hadooop batch pipelines mostly store data using avro input format. Currently we need to run additional jobs just to convert avro data into text(csv/json/tsv) so that it can be ingested into druid. This is really not necessary. We can save those hadoop grid resources and additional jobs by letting users specify their own hadoop input format for the batch ingestion to druid.

This patch enables additional parameter inside hadoop ioConfig as following..

 "ioConfig" : {
    "type" : "hadoop",
    "inputSpec" : {
      "type" : "static",
      "paths" : "/MyDirectory/examples/indexing/avrodata",
      "inputFormat": "com.xyz.hadoop.MyAvroToTextInputFormat"
    }

Given input format must conform to following assumptions...

  1. It reads input data location from "mapreduce.input.fileinputformat.inputdir" in the job configuration.
  2. it must extend from "org.apache.hadoop.mapreduce.InputFormat"
  3. record key is ignored and value is assumed to have the data.

@xvrl xvrl added the Discuss label Mar 10, 2015
@himanshug
Copy link
Contributor Author

I have updated the code so that configured InputFormat can either give Text or MapWritable records

@xvrl
Copy link
Member

xvrl commented Mar 10, 2015

@himanshug do you mind just shooting a quick email to the dev mailing list regarding this proposal? We can direct discussion to the PR, but we would ultimately like the mailing list to be single point of entry for outside contributors.

@himanshug
Copy link
Contributor Author

@xvrl sure, will do that.

//Note: this hack is required because StringInputRowParser is a
//InputRowParser<ByteBuffer> and not InputRowParser<String>
//parse() method on it expects a ByteBuffer and not String
ByteBuffer bb = ByteBuffer.wrap(value.toString().getBytes());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please specify charset in getBytes (utf8 is the general standard)

EDIT: actually, better yet, use the java-utils StringUtils converter to UTF8

Edit again: Just as a sanity check, what does parser.parse(...) expect?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can avoid copying by doing ByteBuffer.wrap(value.getBytes(), 0, value.getLength()). Hadoop Texts are internally just UTF-8 byte arrays with some valid length.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StringInputRowParser expects UTF-8. Unless it is guaranteed in the spec somehow that text bytes are UTF-8 encoded, I would prefer we do it this way but convert to UTF-8 so that we are absolutely sure we're passing text formats correctly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooohhh! nice, org.apache.hadoop.io.Text does explicitly state they are UTF-8 encoded. So yes, I think @gianm is right in that we can simply wrap the bytes directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gianm @drcrallen
actually changed code to look like
"((StringInputRowParser)parser).parse(value.toString());"
It is like old code and wouldn't need byte buffer wrapping business.

@himanshug
Copy link
Contributor Author

I have updated the code to allow option#3 which is that users can supply arbitrary InputFormat and that will work as long as they implement and configure appropriate InputRowParser implementation.
All of this is fully backward compatible.

thanks for the feedback. I had put this code to get feedback on the idea and high level implementation. I will test it now with a custom input format, update code for various review comments etc.

@drcrallen
Copy link
Contributor

A similar functionality (re-indexing existing data without retaining original data) is possibly made available via #1190 though not in a hadoop manner per se

@@ -37,16 +39,21 @@
@JsonProperty("paths")
public String paths;

@JsonProperty("inputFormat")
private Class<? extends InputFormat> inputFormat;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usually we add final for things that don't have setters.

@drcrallen
Copy link
Contributor

I think this is a good PR overall. Is it possible to add any unit tests for this?

@xvrl
Copy link
Member

xvrl commented Mar 12, 2015

@himanshug would you mind adding some serde tests for the different pathsSpec classes?
Otherwise it looks good to me.

@xvrl xvrl removed the Discuss label Mar 12, 2015
@himanshug
Copy link
Contributor Author

@xvrl @drcrallen sure, will update soon.

@himanshug
Copy link
Contributor Author

@xvrl @cheddar I'm trying to add setup for more InputRowParsers inside indexing-hadoop .
I have added a class called MapWritableInputRowParser which can parse MapWritable into InputRow

Please see MapWritableInputRowParserTest.testDeserialization() , currently it is failing. One way, I know of fixing it is to make indexing-hadoop a druid extension module
that is, create a DruidModule implementation in indexing-hadoop and provide appropriate JacksonModule from getJacksonModule() impl and create io.druid.initialization.DruidModule file in indexing-hadoop/..METAINF/..

is there any other smarter way to do it? May be just adding some annotations on the MapWritableInputRowParser class?

@himanshug
Copy link
Contributor Author

made indexing-hadoop a druid extension to register MapWritableInputRowParser as a subtype of InputRowParser inside jackson framework

…ven input format must extend from org.apache.hadoop.mapreduce.InputFormat
@himanshug
Copy link
Contributor Author

@xvrl this PR is complete and ready for merging. can u pls see?

@himanshug himanshug changed the title Proposal: feature to make hadoop input format configurable for batch ingestion Feature: Make hadoop input format configurable for batch ingestion Mar 18, 2015
@himanshug
Copy link
Contributor Author

closing/reopening PR to trigger the build

@himanshug himanshug closed this Mar 19, 2015
@himanshug himanshug reopened this Mar 19, 2015
@himanshug
Copy link
Contributor Author

@xvrl build is failing for unrelated reason. can u please merge this PR?

@xvrl
Copy link
Member

xvrl commented Mar 19, 2015

@cheddar I haven't had a chance to look at the latest changes, if that looks good to you, I'll merge.

On Mar 19, 2015, at 12:42, Himanshu notifications@github.com wrote:

@xvrl build is failing for unrelated reason. can u please merge this PR?


Reply to this email directly or view it on GitHub.

@xvrl xvrl added this to the 0.7.1 milestone Mar 19, 2015
@xvrl xvrl added the Feature label Mar 19, 2015
@cheddar
Copy link
Contributor

cheddar commented Mar 19, 2015

Generally looks good to me. I think there is room to move the parsing entirely into the InputFormat and just pass an InputRow into the MR jobs, but I don't think there's too much value in holding this up for that.

xvrl added a commit that referenced this pull request Mar 19, 2015
Feature:  Make hadoop input format configurable for batch ingestion
@xvrl xvrl merged commit a98187f into apache:master Mar 19, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants