Support parsing of BytesWritable strings in HadoopDruidIndexerMapper #1682

jon-wei · 2015-08-28T03:31:55Z

Allows HadoopDruidIndexerMapper to parse InputRows from BytesWritable values containing strings.

himanshug · 2015-08-28T14:50:32Z

indexing-hadoop/src/main/java/io/druid/indexer/HadoopDruidIndexerMapper.java

@@ -104,6 +105,10 @@ public final static InputRow parseInputRow(Object value, InputRowParser parser)
    if(parser instanceof StringInputRowParser && value instanceof Text) {
      //Note: This is to ensure backward compatibility with 0.7.0 and before
      return ((StringInputRowParser) parser).parse(value.toString());
+    } else if(parser instanceof StringInputRowParser && value instanceof BytesWritable) {


this usecase does not look like something that needs special handling here. You can write an extension and use a InputRowParser implementation that handles BytesWritable or extend your choice of InputFormat so that it returns ByteBuffer .

You don't think it's worth having something in Druid core that knows how to read text that has been written as BytesWritable? This came up for making Druid work better with Secor, which does this if you ask it to write SequenceFiles: https://github.com/pinterest/secor. With this patch it's possible to read those just by setting your inputFormat to org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat and using a normal string parser (assuming the data written by Secor from Kafka was utf-8 encoded and something that Druid understands).

I'm open to deciding that it only makes sense to handle Text without users writing custom parsers, but would like to understand the reason for not working out of the box with the other builtin hadoop string-ish type.

I can see it going both ways. But, only reason I added Text as special case was to maintain backward compatibility. In general the "contract" I believe we want to have is that InputFormat can return anything and specific InputRowParser configured should be able to handle that "anything" and be able to give an InputRow from that.

I can see your point of view about adding special handling for "popular" hadoop InputFormat return types which are String-ish. But, at the same time, it is possible to write BytesWritableInputRowParser (that probably just extends StringInputRowParser) and use that. That can be added as an extension( which might contain all other StringishTypeInputRowParser implementations) or in core code.

Or may be write a core HadoopStringishInputRowParser (that extends StringInputRowParser) that takes Object type record as input and handles all the hadoopy stringish types possible in a big if-else. That way all the special handling will be in that special place only and we can keep things a bit cleaner here.

Ah I see your point. So doing BytesWritableInputRowParser as something in druid-indexing-hadoop would make sense to me- in this case we'll ask users that want to read BytesWritable SequenceFiles to change two things: the inputFormat and the type of the parser.

I think it makes sense in core rather than an extension, because BytesWritable is a core hadoop type. If we wanted to add official support for something like Avro at some point then that would make sense to me as an extension.

@gianm you are right about the deserialization issue. (but users can still do that as long as they put hadoop libs on the classpath and it will work but we can't put it in core druid as it does not work in all cases)
If adding this support is a very general usecase needed by many users then it might be ok to let special handling happen for now while we can work out a model where it is feasible to write parsers that depend on hadoop libs and work in all cases.

I would like to add this now, it's not something that a ton of people have complained about but I do think it's good to have an easy kafka -> s3 -> druid path. this patch + Secor can provide that easy path.

in addition, I think there is a reasonable case that it is good to have support built-in for the native Hadoop types.

OK, in that case this PR is LGTM for me.
with the caveat whenever we improve things, this will most likely not be backward compatible.

ok, I'll merge it then, and hopefully in the future we can find a good way of handling parsing of Hadoop datatypes without running into class loader problems.

fjy · 2015-08-29T00:33:36Z

👍

Support parsing of BytesWritable strings in HadoopDruidIndexerMapper

himanshug reviewed Aug 28, 2015
View reviewed changes

Add support for parsing BytesWritable strings to Hadoop Indexer

e5c4927

jon-wei force-pushed the master branch from 4b82a4d to e5c4927 Compare August 28, 2015 21:27

gianm added a commit that referenced this pull request Aug 29, 2015

Merge pull request #1682 from jon-wei/master

e7b4cf9

Support parsing of BytesWritable strings in HadoopDruidIndexerMapper

gianm merged commit e7b4cf9 into apache:master Aug 29, 2015

himanshug mentioned this pull request Sep 1, 2015

Allow writing InputRowParser extensions that use hadoop/any libraries #1695

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support parsing of BytesWritable strings in HadoopDruidIndexerMapper #1682

Support parsing of BytesWritable strings in HadoopDruidIndexerMapper #1682

jon-wei commented Aug 28, 2015

himanshug Aug 28, 2015

gianm Aug 28, 2015

himanshug Aug 28, 2015

himanshug Aug 28, 2015

gianm Aug 28, 2015

himanshug Aug 28, 2015

gianm Aug 28, 2015

gianm Aug 28, 2015

himanshug Aug 28, 2015

gianm Aug 29, 2015

fjy commented Aug 29, 2015

Support parsing of BytesWritable strings in HadoopDruidIndexerMapper #1682

Support parsing of BytesWritable strings in HadoopDruidIndexerMapper #1682

Conversation

jon-wei commented Aug 28, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjy commented Aug 29, 2015