-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support parsing of BytesWritable strings in HadoopDruidIndexerMapper #1682
Conversation
@@ -104,6 +105,10 @@ public final static InputRow parseInputRow(Object value, InputRowParser parser) | |||
if(parser instanceof StringInputRowParser && value instanceof Text) { | |||
//Note: This is to ensure backward compatibility with 0.7.0 and before | |||
return ((StringInputRowParser) parser).parse(value.toString()); | |||
} else if(parser instanceof StringInputRowParser && value instanceof BytesWritable) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this usecase does not look like something that needs special handling here. You can write an extension and use a InputRowParser implementation that handles BytesWritable or extend your choice of InputFormat so that it returns ByteBuffer .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't think it's worth having something in Druid core that knows how to read text that has been written as BytesWritable? This came up for making Druid work better with Secor, which does this if you ask it to write SequenceFiles: https://github.com/pinterest/secor. With this patch it's possible to read those just by setting your inputFormat to org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat and using a normal string parser (assuming the data written by Secor from Kafka was utf-8 encoded and something that Druid understands).
I'm open to deciding that it only makes sense to handle Text without users writing custom parsers, but would like to understand the reason for not working out of the box with the other builtin hadoop string-ish type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see it going both ways. But, only reason I added Text as special case was to maintain backward compatibility. In general the "contract" I believe we want to have is that InputFormat can return anything and specific InputRowParser configured should be able to handle that "anything" and be able to give an InputRow from that.
I can see your point of view about adding special handling for "popular" hadoop InputFormat return types which are String-ish. But, at the same time, it is possible to write BytesWritableInputRowParser (that probably just extends StringInputRowParser) and use that. That can be added as an extension( which might contain all other StringishTypeInputRowParser implementations) or in core code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or may be write a core HadoopStringishInputRowParser (that extends StringInputRowParser) that takes Object type record as input and handles all the hadoopy stringish types possible in a big if-else. That way all the special handling will be in that special place only and we can keep things a bit cleaner here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see your point. So doing BytesWritableInputRowParser as something in druid-indexing-hadoop would make sense to me- in this case we'll ask users that want to read BytesWritable SequenceFiles to change two things: the inputFormat and the type of the parser.
I think it makes sense in core rather than an extension, because BytesWritable is a core hadoop type. If we wanted to add official support for something like Avro at some point then that would make sense to me as an extension.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gianm you are right about the deserialization issue. (but users can still do that as long as they put hadoop libs on the classpath and it will work but we can't put it in core druid as it does not work in all cases)
If adding this support is a very general usecase needed by many users then it might be ok to let special handling happen for now while we can work out a model where it is feasible to write parsers that depend on hadoop libs and work in all cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to add this now, it's not something that a ton of people have complained about but I do think it's good to have an easy kafka -> s3 -> druid path. this patch + Secor can provide that easy path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in addition, I think there is a reasonable case that it is good to have support built-in for the native Hadoop types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, in that case this PR is LGTM for me.
with the caveat whenever we improve things, this will most likely not be backward compatible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I'll merge it then, and hopefully in the future we can find a good way of handling parsing of Hadoop datatypes without running into class loader problems.
👍 |
Support parsing of BytesWritable strings in HadoopDruidIndexerMapper
Allows HadoopDruidIndexerMapper to parse InputRows from BytesWritable values containing strings.