Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Variance and Standard Deviation Aggregation Functions #9910

Merged
merged 3 commits into from Dec 8, 2022

Conversation

snleee
Copy link
Contributor

@snleee snleee commented Dec 5, 2022

This PR adds VAR_POP(), VAR_SAMP(), STDDEV_POP(), STDDEV_SAMP() aggregation functions.

@snleee snleee force-pushed the add-variance-function branch 3 times, most recently from b5fa523 to 30e7324 Compare December 6, 2022 08:04
@snleee snleee changed the title [WIP] Add Variance and Standard Deviation Aggregation Functions Add Variance and Standard Deviation Aggregation Functions Dec 6, 2022
@codecov-commenter
Copy link

codecov-commenter commented Dec 6, 2022

Codecov Report

Merging #9910 (38c8821) into master (ecf41be) will increase coverage by 0.03%.
The diff coverage is 74.62%.

@@             Coverage Diff              @@
##             master    #9910      +/-   ##
============================================
+ Coverage     64.30%   64.33%   +0.03%     
- Complexity     5034     5531     +497     
============================================
  Files          1928     1931       +3     
  Lines        103874   104243     +369     
  Branches      15823    15882      +59     
============================================
+ Hits          66793    67067     +274     
- Misses        32254    32333      +79     
- Partials       4827     4843      +16     
Flag Coverage Δ
unittests1 68.01% <74.62%> (+0.08%) ⬆️
unittests2 15.82% <0.00%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...ion/utils/StatisticalAggregationFunctionUtils.java 50.00% <50.00%> (ø)
...inot/segment/local/customobject/VarianceTuple.java 65.00% <65.00%> (ø)
...gation/function/CovarianceAggregationFunction.java 66.66% <66.66%> (+1.23%) ⬆️
...org/apache/pinot/core/common/ObjectSerDeUtils.java 91.43% <75.00%> (+0.70%) ⬆️
...regation/function/VarianceAggregationFunction.java 79.68% <79.68%> (ø)
.../pinot/common/function/scalar/StringFunctions.java 67.54% <100.00%> (+11.29%) ⬆️
...gregation/function/AggregationFunctionFactory.java 81.56% <100.00%> (+0.53%) ⬆️
...che/pinot/segment/spi/AggregationFunctionType.java 90.42% <100.00%> (+0.42%) ⬆️
.../pinot/segment/spi/store/ColumnIndexDirectory.java 28.57% <0.00%> (-11.43%) ⬇️
...ache/pinot/segment/spi/store/SegmentDirectory.java 45.45% <0.00%> (-10.11%) ⬇️
... and 74 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Contributor

@jasperjiaguo jasperjiaguo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM over all.

checkWithPrecisionForStandardDeviation((VarianceTuple) aggregationResult.get(7), NUM_RECORDS, expectedDoubleSum,
expectedStdDevs[7].getResult(), true);

// Update the expected result by 3 more times (broker query will compute 4 identical segments)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variance on 4 identical segments is the same as variance on one of each segment. Should probably to consider using getDistinctInstances in BaseQueriesTest

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jasperjiaguo Can you elaborate on your suggestion?

I referred covariance test and it also uses getBrokerResponse() to get the result. According to the documentation, getBrokerResponse() seems to compute the result from 4 identical segments. FYI,getBrokerResponse() internally calls getDistinctInstances().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in CovarianceQueriesTest, @SabrinaZhaozyf has added the getDistinctInstances() to use 4 distinct segments for these statistical functions, because Covar on 4 identical segments is the same as Covar on one of each segment, as kind of defeats the purpose of testing merge logic over multiple segments. The getDistinctInstances in base class is still returning 4 identical segments though for backward compat. We can probably inherit CovarianceQueriesTest in VarianceQueriesTest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jasperjiaguo Thanks for pointing out. I actually merged Covariance & Variance tests. This actually helped me to identify a bug when we merging empty VarianceTuple with non empty VarianceTuple!

Can you go over the pr once more?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

* Sample Variances" by Chan et al. Please refer to the "Parallel Algorithm" section from the following wiki:
* - https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm
*/
public class VarianceAggregationFunction extends BaseSingleInputAggregationFunction<VarianceTuple, Double> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

majority of this code is shared with CoVariance agg. can we abstract out some comment utilities?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly, they look very similar but the implementations of all functions are slightly different due to the algorithm difference. Also, we use a different intermediate class VarianceTuple instead of CovarianceTuple.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took out a common function that can be shared and created a util class. Please take a look

This PR adds `VAR_POP()`, `VAR_SAMP()`, `STDDEV_POP()`,
`STDDEV_SAMP()` aggregation functions.
@snleee snleee force-pushed the add-variance-function branch 2 times, most recently from bf2e529 to 01c4be6 Compare December 7, 2022 07:03
Copy link
Contributor

@walterddr walterddr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm overall. will let @Jackie-Jiang take another pass.

refactoring of taking out common utils can be done separately IMO.

@snleee snleee merged commit 852477b into master Dec 8, 2022
@snleee snleee deleted the add-variance-function branch December 8, 2022 20:51
@snleee snleee restored the add-variance-function branch December 8, 2022 22:31
@snleee snleee deleted the add-variance-function branch December 8, 2022 23:15
}

@Test
public void testStandardDeviationAggregationOnly() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fixed with #9948

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants