Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Contributing Moving-Average Query to open source. #6430

Merged
merged 55 commits into from
Apr 27, 2019
Merged

Contributing Moving-Average Query to open source. #6430

merged 55 commits into from
Apr 27, 2019

Conversation

yurmix
Copy link
Contributor

@yurmix yurmix commented Oct 8, 2018

Implements #6320

Includes documentation and comments but I'd be glad to add more if needed.

@nishantmonu51 nishantmonu51 self-assigned this Oct 9, 2018
@yurmix
Copy link
Contributor Author

yurmix commented Oct 9, 2018

@nishantmonu51 Thank you!
Please look at the documentation .md for user perspective
Also please look at the high level comments in MovingAverageQueryRunner (Main one at class level and more inlined) to walk through the code.

Copy link
Contributor Author

@yurmix yurmix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The build fails due to useDefaultValueForNull=false (A flag introduced in 0.13.0).
I will update the tests accordingly.

@fjy fjy added this to the 0.13.1 milestone Oct 10, 2018
@yurmix
Copy link
Contributor Author

yurmix commented Oct 15, 2018

@nishantmonu51 the build passes successfully now. The extension doesn't support useDefaultValueForNull=false yet. In the meantime I had to explicitly turn off the flag in unit tests. (See issue #6472).
Please let me know if you would like to review the feature.
Thanks!

@yurmix
Copy link
Contributor Author

yurmix commented Dec 3, 2018

@niketh would you like to review?

@peferron
Copy link
Contributor

Very useful feature. Right now, moving averages are doable in SQL via nested SELECTs, but native queries don't support it. This extension would be great to bring native queries up to par. Hope it gets merged soon!

@jon-wei jon-wei removed this from the 0.14.0 milestone Feb 5, 2019
Copy link
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yurmix thanks for the nice PR! I'm still looking at, but have a question for the general design.

Why do we need a new query type for moving average? Rather, can we generalize Averager and PostAverager to support any types of queries like PostAggregator? This question is related to MovingAverageQueryRunner. It internally generates a query and computes moving average on top of it. I don't think this is necessary, but we can generalize this logic similar to PostAggregator.

Copy link
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still reviewing, but left some more comments.

{

private final Collection<DimensionSpec> dims;
private final Map<Map<String, Object>, Collection<Averager<?>>> averagers = new HashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a brief description of what are keys and values.

Copy link
Contributor Author

@yurmix yurmix Mar 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

public class LongMaxAverager extends BaseAverager<Number, Long>
{

private int startFrom = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean?

Copy link
Contributor Author

@yurmix yurmix Mar 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment (see commit e2a5317).

private Set<Map<String, Object>> seenKeys = new HashSet<>();
private Row saveNext;
private Map<String, AggregatorFactory> aggMap;
private Map<String, Object> fakeEvents;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is fakeEvents?

Copy link
Contributor Author

@yurmix yurmix Mar 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment (see commit 4b425b2).

@yurmix
Copy link
Contributor Author

yurmix commented Feb 23, 2019

@yurmix thanks for the nice PR! I'm still looking at, but have a question for the general design.

Why do we need a new query type for moving average? Rather, can we generalize Averager and PostAverager to support any types of queries like PostAggregator? This question is related to MovingAverageQueryRunner. It internally generates a query and computes moving average on top of it. I don't think this is necessary, but we can generalize this logic similar to PostAggregator.

I completely agree the work could be incorporated within the general Query. The reason it was created as a separate query type is for the sake of easier implementation (due to the separation of concerns). Calling the underlying query through an internal API is easier than making those changes directly to Queries or another class. Perhaps refactoring it into the core query framework could be a future work?

@yurmix
Copy link
Contributor Author

yurmix commented Feb 23, 2019

@jihoonson BTW, I have a couple of my own design concerns and I welcome your thoughts on the matter:

  1. The complexity of MovingAverageIterable. But I would be careful in changing that for the sake of refactoring.
  2. Averager implements their own accumulation formulas (computeResults) which are separate from the formulas of Aggregator. I think this could be refactored in order to easily add more types of Averagers and perhaps in order to allign null-handling with the new standards.

@jihoonson
Copy link
Contributor

jihoonson commented Feb 25, 2019

I completely agree the work could be incorporated within the general Query. The reason it was created as a separate query type is for the sake of easier implementation (due to the separation of concerns). Calling the underlying query through an internal API is easier than making those changes directly to Queries or another class. Perhaps refactoring it into the core query framework could be a future work?

@yurmix I see. It makes sense to me.

  1. The complexity of MovingAverageIterable. But I would be careful in changing that for the sake of refactoring.

It looks complex, but I think it would be easier to understand if you can add more comments. But, simpler implementation would be great if possible. Would you tell me a bit of details of what kind of refactoring you think?

  1. Averager implements their own accumulation formulas (computeResults) which are separate from the formulas of Aggregator. I think this could be refactored in order to easily add more types of Averagers and perhaps in order to allign null-handling with the new standards.

It sounds good, but I'm wondering we can use our Aggregator for moving average instead of adding a new type of Averager. It would be best if we can because we don't have to implement almost same aggregation logic again (ex, DoubleMaxAverager and DoubleMaxAggregator). Probably possible if we can integrate Accumulator with Aggregator. Or, maybe it's even easier if we can use Grouper (like StreamingMergeSortedGrouper, but simpler non-thread-safe one).

…ollowing once DI conflicts with datasketches are resolved.
@yurmix
Copy link
Contributor Author

yurmix commented Mar 19, 2019

@jihoonson, thanks so much for your effort on this thorough review and sorry it took me that long to complete my response. I have addressed all comments, feel free to review and raise other concerns.

@drcrallen
Copy link
Contributor

Ping to keep open

@jihoonson
Copy link
Contributor

@drcrallen thanks for the ping.

@yurmix I'll take another look once 0.14.0 release is finalized.

Copy link
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yurmix thank you for the fix. Left some more comments.

private final List<AveragerFactory<?, ?>> factories;
private final Map<String, PostAggregator> postAggMap;
private final Map<String, AggregatorFactory> aggMap;
private final Map<String, Object> fakeEvents;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe emptyEvents?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed.

*
* <p>Usually, the contents of key will be contained by the row R being passed in, but in the case of a
* dummy row, its possible that the dimensions will be known but the row empty. Hence, the values are
* passed as two separate arguments.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not sure why it should accept key and r separately instead of accepting MapBasedRow. Would you elaborate more?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The row's key (only dimensions, no metrics) is required but is not provided by Row's interface.
We use MovingAverageHelper.getDimKeyFromRow(dims, r) to extract the key.

I was able to remove the redundant parameter by calling getDimKeyFromRow inside computeMovingAverage() as well.

final I[] buckets;
private int index;

/* startFrom is needed because `buckets` field is a fixed array, not a list.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think we don't use multi line comments widely. How about changing it to javadoc? I think it's more clear anyway.

@@ -0,0 +1,2 @@
druid.processing.buffer.sizeBytes=655360
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you please let me know what tests failed without this file? I think you should be able to set these properties in each test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the property (druid.processing.buffer.sizeBytes=655360) to MovingAverageQueryTest and removed runtime.properties.

// standard case. return regular row
yielder = yielder.next(currentBucket);
expectedBucket = expectedBucket.plus(period);
return currentBucket;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, let me add some more details. yielder is updated in this if clause which should be used to iterate all values in it. However, in hasNext(), it only checks expectedBucket is less than endTime. Since expectedBucket is also updated in this if clause, hasNext() can return false even though yielder is not used yet.

@jihoonson
Copy link
Contributor

@yurmix sorry, I've just checked your last update. Most of my last comments were addressed except these two: #6430 (comment), #6430 (comment). Would you please check them?

@yurmix
Copy link
Contributor Author

yurmix commented Apr 25, 2019

@yurmix sorry, I've just checked your last update. Most of my last comments were addressed except these two [...]

Thanks for reminding me, I have addressed one of them and currently reviewing the other.

@yurmix
Copy link
Contributor Author

yurmix commented Apr 26, 2019

Hmm, let me add some more details. yielder is updated in this if clause which should be used to iterate all values in it. However, in hasNext(), it only checks expectedBucket is less than endTime. Since expectedBucket is also updated in this if clause, hasNext() can return false even though yielder is not used yet.

The reason we need expectedBucket is that RowBucketIterable needs to return empty row buckets for periods with no rows. That's why it traverses over intervals in addition to the rows seq.
I added a check for yielder as well to hasNext(), in addition to expectedBucket.

I think there could be a more elegant way for traversing over two levels (intervals/periods and rows) when one does not directly contain the other, but I won't be able to refactor it for this release.

Copy link
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there could be a more elegant way for traversing over two levels (intervals/periods and rows) when one does not directly contain the other, but I won't be able to refactor it for this release.

It sounds good to me. I don't think this refactoring is strictly required for this PR.

The latest change looks good to me. The CI failure looks a flaky Travis timeout and I just restarted it. +1 after CI.

@yurmix thank you for all your hard work!

@jihoonson jihoonson merged commit f02251a into apache:master Apr 27, 2019
@yurmix
Copy link
Contributor Author

yurmix commented Apr 27, 2019

@jihoonson thanks for your dedication and for the insightful review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants