Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support parsing of BytesWritable strings in HadoopDruidIndexerMapper #1682

Merged
merged 1 commit into from
Aug 29, 2015

Conversation

jon-wei
Copy link
Contributor

@jon-wei jon-wei commented Aug 28, 2015

Allows HadoopDruidIndexerMapper to parse InputRows from BytesWritable values containing strings.

@@ -104,6 +105,10 @@ public final static InputRow parseInputRow(Object value, InputRowParser parser)
if(parser instanceof StringInputRowParser && value instanceof Text) {
//Note: This is to ensure backward compatibility with 0.7.0 and before
return ((StringInputRowParser) parser).parse(value.toString());
} else if(parser instanceof StringInputRowParser && value instanceof BytesWritable) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this usecase does not look like something that needs special handling here. You can write an extension and use a InputRowParser implementation that handles BytesWritable or extend your choice of InputFormat so that it returns ByteBuffer .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't think it's worth having something in Druid core that knows how to read text that has been written as BytesWritable? This came up for making Druid work better with Secor, which does this if you ask it to write SequenceFiles: https://github.com/pinterest/secor. With this patch it's possible to read those just by setting your inputFormat to org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat and using a normal string parser (assuming the data written by Secor from Kafka was utf-8 encoded and something that Druid understands).

I'm open to deciding that it only makes sense to handle Text without users writing custom parsers, but would like to understand the reason for not working out of the box with the other builtin hadoop string-ish type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see it going both ways. But, only reason I added Text as special case was to maintain backward compatibility. In general the "contract" I believe we want to have is that InputFormat can return anything and specific InputRowParser configured should be able to handle that "anything" and be able to give an InputRow from that.

I can see your point of view about adding special handling for "popular" hadoop InputFormat return types which are String-ish. But, at the same time, it is possible to write BytesWritableInputRowParser (that probably just extends StringInputRowParser) and use that. That can be added as an extension( which might contain all other StringishTypeInputRowParser implementations) or in core code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or may be write a core HadoopStringishInputRowParser (that extends StringInputRowParser) that takes Object type record as input and handles all the hadoopy stringish types possible in a big if-else. That way all the special handling will be in that special place only and we can keep things a bit cleaner here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see your point. So doing BytesWritableInputRowParser as something in druid-indexing-hadoop would make sense to me- in this case we'll ask users that want to read BytesWritable SequenceFiles to change two things: the inputFormat and the type of the parser.

I think it makes sense in core rather than an extension, because BytesWritable is a core hadoop type. If we wanted to add official support for something like Avro at some point then that would make sense to me as an extension.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gianm you are right about the deserialization issue. (but users can still do that as long as they put hadoop libs on the classpath and it will work but we can't put it in core druid as it does not work in all cases)
If adding this support is a very general usecase needed by many users then it might be ok to let special handling happen for now while we can work out a model where it is feasible to write parsers that depend on hadoop libs and work in all cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to add this now, it's not something that a ton of people have complained about but I do think it's good to have an easy kafka -> s3 -> druid path. this patch + Secor can provide that easy path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in addition, I think there is a reasonable case that it is good to have support built-in for the native Hadoop types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, in that case this PR is LGTM for me.
with the caveat whenever we improve things, this will most likely not be backward compatible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I'll merge it then, and hopefully in the future we can find a good way of handling parsing of Hadoop datatypes without running into class loader problems.

@fjy
Copy link
Contributor

fjy commented Aug 29, 2015

👍

gianm added a commit that referenced this pull request Aug 29, 2015
Support parsing of BytesWritable strings in HadoopDruidIndexerMapper
@gianm gianm merged commit e7b4cf9 into apache:master Aug 29, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants