-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add stringLast and stringFirst aggregators extension #5789
Add stringLast and stringFirst aggregators extension #5789
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
@@ -0,0 +1,41 @@ | |||
--- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file's in the extensions-core folder, but the link to it in extensions.md points at extensions-contrib.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right!
@JsonCreator | ||
public StringFirstAggregatorFactory( | ||
@JsonProperty("name") String name, | ||
@JsonProperty("fieldName") final String fieldName, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be cool if this (and last) supported expressions like the *sum and min/max aggregators
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the expressions usage? I have seen that:
But, I can't find anything on druid documentation about expressions. Taking a look at the class:
I think that is another way to define the fieldName
... Could you give me some documentation about that? Or some example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Afraid there's not comprehensive documentation at the moment. I haven't been digging into Druid's source for too long so feel free to ignore me!
There's some documentation on the language itself here: https://github.com/druid-io/druid/blob/master/docs/content/misc/math-expr.md
I think adding support for it should be making the constructor of this class similar to LongsumAggregatorFactory, taking an injected macroTable and a string for the expression.
Then in your factorize methods you'd check for an expression, if that's present you'd return an ExprEvalSelector instead of the results of the metricFactory.makeColumnValueSelector.
Making the ExprEvalSelector would look something like this:
final Expr expr = Parser.parse(fieldExpression, macroTable); return ExpressionSelectors.makeExprEvalSelector(metricFactory, expr);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Druid's expression system is not documented yet because it was an experimental feature (but maybe it's time to document).
I think we can add expression
field to here later.
@andresgomezfrr thanks for the contribution! I'm reviewing this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andresgomezfrr thanks for the PR. I left some comments.
import java.nio.ByteBuffer; | ||
import java.nio.charset.StandardCharsets; | ||
|
||
public class SerializablePairSerde extends ComplexMetricSerde |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you please add some java doc for this class? It should contain the value types to be serde and where this class is used.
public class SerializablePairSerde extends ComplexMetricSerde | ||
{ | ||
|
||
public SerializablePairSerde() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary class constructor.
@Override | ||
public String getTypeName() | ||
{ | ||
return "serializablePairLongString"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we define this string as a static variable in somewhere and use it?
@Override | ||
public ObjectStrategy getObjectStrategy() | ||
{ | ||
return new ObjectStrategy<SerializablePair>() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please specify type parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will create a new class SerializablePairLongString
because if we specify type parameters later I can't do the SerializablePair<Long, String>.class
return new ObjectStrategy<SerializablePair>() | ||
{ | ||
@Override | ||
public int compare(SerializablePair o1, SerializablePair o2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please specify type parameters.
|
||
long lastTime = mutationBuffer.getLong(position); | ||
if (time >= lastTime) { | ||
byte[] valueBytes = lastString.getBytes(StandardCharsets.UTF_8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lastString
is nullable. You should check it's null.
byte[] valueBytes = new byte[stringSizeBytes]; | ||
mutationBuffer.position(position + Long.BYTES + Integer.BYTES); | ||
mutationBuffer.get(valueBytes, 0, stringSizeBytes); | ||
serializablePair = new SerializablePair<>(timeValue, new String(valueBytes, StandardCharsets.UTF_8)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use StringUtils.toUtf8()
instead.
public void aggregate() | ||
{ | ||
SerializablePair<Long, String> pair = (SerializablePair<Long, String>) selector.getObject(); | ||
if (pair.lhs >= lastTime) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pair
can be null.
|
||
SerializablePair<Long, String> pair = (SerializablePair<Long, String>) selector.getObject(); | ||
long lastTime = mutationBuffer.getLong(position); | ||
if (pair.lhs >= lastTime) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pair
can be null.
long lastTime = mutationBuffer.getLong(position); | ||
if (pair.lhs >= lastTime) { | ||
mutationBuffer.putLong(position, pair.lhs); | ||
byte[] valueBytes = pair.rhs.getBytes(StandardCharsets.UTF_8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use StringUtils.toUtf8()
instead.
I have one more comment. Can we move this to druid core (i.e., |
I did all the improvements and refactors. Now, we will start to move the source to the druid core. @jihoonson thanks for the code review!! 😃 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andresgomezfrr thank you for the quick fix!
long time = timeSelector.getLong(); | ||
if (time < firstTime) { | ||
firstTime = time; | ||
Object value = valueSelector.getObject(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, value
might be accidentally another type because of bugs. We need to add a sanity check.
@andresgomezfrr thanks. Please check my comment here. Also, this unit test failure looks legitimate.
|
Fix the test, I didn't see it, sorry! Remove the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andresgomezfrr thanks for the quick fix! I left my last comments.
@Override | ||
public List<AggregatorFactory> getRequiredColumns() | ||
{ | ||
return Arrays.asList(new StringLastAggregatorFactory(fieldName, fieldName, maxStringBytes)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can use Collections.singletonList()
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@Override | ||
public List<AggregatorFactory> getRequiredColumns() | ||
{ | ||
return Arrays.asList(new StringFirstAggregatorFactory(fieldName, fieldName, maxStringBytes)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can use Collections.singletonList()
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@Override | ||
public void fold(ColumnValueSelector selector) | ||
{ | ||
if (firstString == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks that this is to check reset()
is called or not. But, firstValue
can be null even when reset()
is called because selector.getObject()
can return null. I think we need a flag isReset
to check this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah! It is true, good point!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latest change looks good to me. Thanks @andresgomezfrr!
Does stringFirst actually work at ingestion time? The doc change made here (changing the existing claim that first/last aggregators don't work at ingestion time to say that only numeric ones don't) plus the implementation of makeAggregateCombiner makes it seem like it should, and when I define a Kafka indexing service data source with a stringFirst aggregator, I can properly query the metric against data as the indexing task indexes it. But the indexing task publish stage fails (in 0.13) with errors like:
Has anyone actually successfully used stringFirst at ingestion time? |
Yes, I use it at indexing time. Could you share your ingestion spec and some example input data? |
OK, I attempted a a trivial reproduction by working through the Kafka stream tutorial but removing
This worked just fine (including actually publishing), so it's unclear what happened when I ran it in our cluster. In our cluster, this was the ingestion spec. Note that we use a custom parser implementation which parses some protobufs into MapBasedInputRow, but it should just end up mapping
I'll keep investigating, but definitely curious to hear if there's anything obviously strange here! |
I don't really understand the variety of slightly different entry points involved in the metric aggregation process, but it seems like if StringFirstAggregateCombiner used the same logic as |
I tried running this again with DEBUG logging enabled but nothing obvious showed up. Same stack trace. Is there a good way to poke at the persisted files while a task runs and see if they're in the right format? |
@glasser Hmm, I just noticed the AggregateCombiner type is a |
@gianm I could believe that — it would explain why some small reproductions I tried worked but running in our QA cluster didn't. Though I don't know what a spill file is :) |
@glasser By 'spill file' I mean the files that get written to disk every maxRowsInMemory or intermediatePersistPeriod (and are merged later into a single segment) |
Hmm, I'm not sure if that's exactly it. I've been trying the standard quickstart Kafka ingestion example with this supervisor:
Note the maxRowsInMemory: 3000, which is less than the number of rows in the wikiticker-2015-09-12-sampled.json. (I tried setting it just to 1 but that leads to OOMs.) This job runs successfully. I should probably try with just an index task instead of kafka to make it simpler though. |
Yeah, running this task against a fresh download of 0.13-incubating succeeds, even though I would think it would need to invoke AggregateCombiner?
|
Oh hmm. |
Oh, right. We need to actually make things roll up with each other, so set a non-trivial queryGranularity. I now have an actual reproduction so I'll open an issue: #7243. |
@glasser @andresgomezfrr Did you happen to find any workaround for this issue? Or is it solved in any of the latest versions? I am facing this exact issue when using stringLast aggregation during ingestion. |
I didn't have time to look into this and we switched to working on migrating this particular weird data source away from Druid instead. Somebody told me this might have been fixed though. |
Hi all,
This PR contains a druid extension module that adds a
stringLast
&stringFirst
aggregators.