Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse Batch support #5081

Merged
merged 4 commits into from
Dec 4, 2017
Merged

Parse Batch support #5081

merged 4 commits into from
Dec 4, 2017

Conversation

pjain1
Copy link
Member

@pjain1 pjain1 commented Nov 13, 2017

Fixes #4373

  • Modify InputRowParser to have List<InputRow> parseBatch(T input) method which returns list of InputRows. Currently is has a default implementation which just wraps the output of parse(T) method in a list.
  • Changed Implementations of InputRowParser to implement parseBatch instead of parse, although they currently just warp the output of parsing the input in a list.
  • Changed all the usages of parse in the codebase to parseBatch.
  • Changed Firehose iterators to work on top of parseBatch.
  • Modified Appenderator's add method to accept a flag to disable automatic persists, so that callers can decide to not persist during addition of batch of rows and can call persist later explicitly.

Having been running this code on our internal cluster with a InputRowPraser that multiplies each row a certain number of times and using KafkaIndexingService to index these rows.

@pjain1 pjain1 added this to the 0.11.0 milestone Nov 13, 2017
@gianm
Copy link
Contributor

gianm commented Nov 13, 2017

@pjain1 could this be moved to 0.11.1? It doesn't seem like a regression fix or something that critically needs to be in 0.11.0. So I'd rather avoid delaying 0.11.0.

Although if there is a reason I'm missing please call it out.

@pjain1 pjain1 modified the milestones: 0.11.0, 0.11.1 Nov 13, 2017
@pjain1
Copy link
Member Author

pjain1 commented Nov 13, 2017

0.11.1 is fine

@himanshug
Copy link
Contributor

@gianm reason for originally putting it in 0.11.0 was that some customers have been blocked on this feature for quite a while.
It is fine to target it in 0.11.1. and, we will backport this (with possibly #4815 ) and release internally in 0.11.0 as soon as its merged. Also, note that we are already testing it on our metrics cluster and did few verifications.
@gianm @dclim my only request is to prioritize this in the reviews so that it gets merged as early as possible. thanks.

@gianm
Copy link
Contributor

gianm commented Nov 13, 2017

@gianm @dclim my only request is to prioritize this in the reviews so that it gets merged as early as possible. thanks.

Sure thing, we will prioritize reviewing it along with the Kafka incremental publishing patch.

@himanshug
Copy link
Contributor

@gianm thanks.

@pjain1
Copy link
Member Author

pjain1 commented Nov 19, 2017

I don't understand TeamCity inspection, now it is complaining about a line which is from a commit from 2014.

@gianm
Copy link
Contributor

gianm commented Nov 20, 2017

I restarted it. I've noticed sometimes it does that and I'm not sure why.

@gianm
Copy link
Contributor

gianm commented Nov 21, 2017

@pjain1 could you please merge master into this branch and push it? May help with CI.

add addAll method, skip max rows in memory check for it

remove parse method from implemetations

transform transformers

add string multiplier input row parser

fix withParseSpec

fix kafka batch indexing

fix isPersistRequired

comments
@pjain1
Copy link
Member Author

pjain1 commented Nov 21, 2017

@gianm rebased with master

@dclim
Copy link
Contributor

dclim commented Nov 28, 2017

@pjain1 thanks! reviewing this now

Copy link
Contributor

@dclim dclim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good, just a few comments.

{
return parseMap(buildStringKeyMap(input));
return ImmutableList.of(parseMap(buildStringKeyMap(input)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throws NPE if parseMap returns a null. Maybe in general it's better to prefer your Utils.nullableListOf() instead of ImmutableList.of() throughout - for example NoopInputRowParser will also throw an NPE if input is null which is not the same behavior as previously.

* Parse an input into list of {@link InputRow}. List can contains null for rows that should be thrown away,
* or throws {@code ParseException} if the input is unparseable.
*/
default List<InputRow> parseBatch(T input)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using @NotNull or adding a comment to the Javadoc that implementations must not return null. Lots of things would break if a null was returned.

@Override
public boolean hasMore()
{
return iter.hasNext();
return iter.hasNext() || nextIterator.hasNext();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should check nextIterator.hasNext() first before ConsumerIterator.hasNext() in case their implementation is blocking, for example in old implementations: https://issues.apache.org/jira/browse/KAFKA-520

@@ -92,20 +96,20 @@ protected void map(Object key, Object value, Context context) throws IOException
}

@Nullable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be marked as @Nullable anymore. Also method can be made private.

@pjain1
Copy link
Member Author

pjain1 commented Dec 4, 2017

@dclim thanks for the review, I have addressed your comments.

@dclim
Copy link
Contributor

dclim commented Dec 4, 2017

👍

@himanshug himanshug merged commit 7c01f77 into apache:master Dec 4, 2017
@quenlang
Copy link

@pjain1 @himanshug
I did not find any describe in the 0.12.0 doc, any advice give me ?

@dragonls
Copy link
Contributor

@quenlang This is the code changes.
Code position:
Batch support: https://github.com/druid-io/druid/blob/druid-0.12.0/api/src/main/java/io/druid/data/input/impl/InputRowParser.java#L47

@pjain1 pjain1 deleted the batch_indexing branch March 28, 2020 14:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants