Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add geometric_mean Presto aggregate function #6678

Conversation

xumingming
Copy link
Contributor

@xumingming xumingming commented Sep 21, 2023

Fixes #6666

@netlify
Copy link

netlify bot commented Sep 21, 2023

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 25ee1e7
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/651193534938cb000853a4ec

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 21, 2023
@xumingming xumingming force-pushed the implement-presto-geometric-mean branch from 15cfc86 to 425d45f Compare September 21, 2023 14:46
Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great % a few small comments.

@kagamiori Wei, would you also take a look?

@@ -82,6 +82,12 @@ General Aggregate Functions
each input value occurs. Supports integral, floating-point,
boolean, timestamp, and date input types.

.. function:: geometric_mean(x) -> double

Returns the geometric mean of all input values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps, turn "geometric mean" into a link to https://en.wikipedia.org/wiki/Geometric_mean to allow folks who are not familiar with this term to quickly learn about it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -298,7 +298,7 @@ Here is a list of all scalar and aggregate Presto functions with functions that
:func:`array_sum` flatten_geometry_collections localtimestamp :func:`sign` :func:`timezone_minute` :func:`entropy`
:func:`array_union` :func:`floor` :func:`log10` simplify_geometry :func:`to_base` evaluate_classifier_predictions
:func:`arrays_overlap` fnv1_32 :func:`log2` :func:`sin` :func:`to_base64` :func:`every`
:func:`asin` fnv1_64 :func:`lower` :func:`slice` :func:`to_base64url` geometric_mean
:func:`asin` fnv1_64 :func:`lower` :func:`slice` :func:`to_base64url` :func:`geometric_mean`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not update coverage maps in this PR. These need to be updated in a separate PR. Also, make sure to auto-generate the file, not edit by hand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -57,10 +58,10 @@ target_link_libraries(
velox_functions_util
Folly::folly)

if(${VELOX_BUILD_TESTING})
if (${VELOX_BUILD_TESTING})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated changes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted

*/

#include <cmath>
#include "velox/exec/Aggregate.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this include needed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SimpleAggregateAdapter.h already includes Aggregate.h. So I think we don't need to include Aggregate.h again here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed


using OutputType = double;

static bool toIntermediate(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kagamiori Wei, can we allow void toIntermediate method?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we can add support for void toIntermediate(...). The adapter will assume this method always return true.

SimpleAggregateAdapter<GeometricMeanAggregate<double>>>(
resultType);
default:
VELOX_FAIL(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VELOX_USER_FAIL

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

default:
VELOX_FAIL(
"Unsupported result type for final aggregation: {}",
resultType->kindName());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resultType->toString()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

inputType->kindName());
}
} else {
switch (resultType->kind()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For intermediate agg, resultType will be ROW. Can you make sure to handle this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


TEST_F(GeometricMeanTest, globalEmpty) {
auto data = makeRowVector({
makeFlatVector<int64_t>(std::vector<int64_t>{}),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makeFlatVector<int64_t> -> makeFlatVector

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

static bool toIntermediate(
exec::out_type<Row<double, int64_t>>& out,
exec::arg_type<T> in) {
out.copy_from(std::make_tuple(static_cast<double>(in), 1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make sure this code path is covered with tests?

Please, also run AggregationFuzzer and make sure this function is covered well. Use --only geometric_mean --duration_sec 1800

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would also be good to test companion functions with AggregationFuzzer since you registered companion functions altogether. i.e., use --only "geometric_mean,geometric_mean_partial,geometric_mean_merge,geometric_mean_merge_extract" --duration_sec 1800

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two questions here:

  1. How to we test toIntermediate individually? Is there examples I can take a look at?
  2. I tried the AggregationFuzzer, it fails with the following error message:
1 extra rows, 1 missing rows
1 of extra rows:
	null | null | null | [-2.6020417535176192,4]

1 of missing rows:
	null | null | null | [-2.602041753517619,4]

It seems fails with precision issue? is this expected for floating type functions? Or something wrong with my implementation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to we test toIntermediate individually? Is there examples I can take a look at?

AggregationTestBase::testAggregations has logic to test this. See velox/functions/lib/aggregates/tests/AggregationTestBase.cpp

SCOPED_TRACE("Run partial + final with abandon partial agg");

However, this logic applies only if input has 2 or more vectors. Hence, to trigger that logic pass 2 or more vectors to testAggregations.

It seems fails with precision issue?

Fuzzer uses assertEqualResults which has logic to compare with epsilon. Maybe this logic doesn't work for some reason. Would you, please, investigate?

// Compare actualRows with expectedRows and return whether they match. Compare
// actualRows and expectedRows with epsilon if needed and allowed. Otherwise,
// compare their values directly. The underlying assumption is that aggregation
// results can be sorted by unique keys and floating-point values in them are
// computed in different ways and hence require epsilon comparison. For results
// of other operations, floating-point values are likely copied from inputs and
// hence can be compared directly.
bool assertEqualResults(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failure with precision issue is probably due to this: #4481.

To confirm that toIntermediate is tested, you can write a test case as @mbasmanova suggested and set a breakpoint in toIntermediate to make sure the breakpoint gets hit when you run the test.

*/

#include <cmath>
#include "velox/exec/Aggregate.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SimpleAggregateAdapter.h already includes Aggregate.h. So I think we don't need to include Aggregate.h again here.


using OutputType = double;

static bool toIntermediate(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we can add support for void toIntermediate(...). The adapter will assume this method always return true.


void addInput(HashStringAllocator* /*allocator*/, exec::arg_type<T> data) {
logSum_ += std::log(data);
count_ = checkedPlus<int64_t>(count_, 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

combine() would need checkedPlus because the merge companion function (if registered) can receive arbitrarily large count_ values. I think it would be more consistent using checkedPlus here too.

SimpleAggregateAdapter<GeometricMeanAggregate<double>>>(
resultType);
default:
VELOX_FAIL(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbasmanova, Looks like some of our existing UDAFs use VELOX_FAIL here and some others use VELOX_USER_FAIL here. We should make them consistent. Should it throw VeloxUserError here since signature binding should fail first if it's end users giving arguments of incorrect type?

static bool toIntermediate(
exec::out_type<Row<double, int64_t>>& out,
exec::arg_type<T> in) {
out.copy_from(std::make_tuple(static_cast<double>(in), 1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would also be good to test companion functions with AggregationFuzzer since you registered companion functions altogether. i.e., use --only "geometric_mean,geometric_mean_partial,geometric_mean_merge,geometric_mean_merge_extract" --duration_sec 1800


testAggregations({data}, {"c0"}, {"geometric_mean(c1)"}, {expected});
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to also add a test case that compares the aggregation result with DuckDB geometric_mean? See examples in velox/functions/prestosql/aggregates/tests/ArrayAggTest.cpp.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The version of DuckDB seems does not have geometric_mean yet:

Function:verifyDuckDBResult, Expression: result->success DuckDB query failed: Catalog Error: Scalar Function with name geometric_mean does not exist!
Did you mean "median"?
LINE 1: SELECT geometric_mean(c0) FROM tmp
               ^
SELECT geometric_mean(c0) FROM tmp, Source: RUNTIME, ErrorCode: INVALID_STATE

And I checked with DuckDB's source code, seems geometric_mean is added recently: https://github.com/duckdb/duckdb/blame/239f51293c429168774c3943e96ddf2451253a07/src/catalog/default/default_functions.cpp#L102

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kagamiori do I need to upgrade DuckDB to add the test?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upgrading is not possible. There is a GitHub issue about that with details. Look it up. We are working on changing AggregationFuzzer to use Presto as source of truth: #6595 Once this is done we'll be able to verify all aggregate functions.

@xumingming
Copy link
Contributor Author

AggregationFuzzer reports:

Function:combine, Expression: other.at<1>().has_value() , Source: RUNTIME, ErrorCode: INVALID_STATE
libc++abi: terminating with uncaught exception of type facebook::velox::VeloxRuntimeError: Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Retriable: False
Expression: other.at<1>().has_value()
Function: combine
File: /Users/abei/Code/velox_commmunity/velox/functions/prestosql/aggregates/GeometricMeanAggregate.cpp
Line: 64

Which corresponding to this check:

// Both field of an intermediate result should be non-null because
// writeIntermediateResult() never make an intermediate result with a
// single null.
VELOX_CHECK(other.at<0>().has_value());
VELOX_CHECK(other.at<1>().has_value());

Which I learned from SimpleAverageAggregate.cpp, the assumption does not hold?

@kagamiori
Copy link
Contributor

AggregationFuzzer reports:

Function:combine, Expression: other.at<1>().has_value() , Source: RUNTIME, ErrorCode: INVALID_STATE
libc++abi: terminating with uncaught exception of type facebook::velox::VeloxRuntimeError: Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Retriable: False
Expression: other.at<1>().has_value()
Function: combine
File: /Users/abei/Code/velox_commmunity/velox/functions/prestosql/aggregates/GeometricMeanAggregate.cpp
Line: 64

Which corresponding to this check:

// Both field of an intermediate result should be non-null because
// writeIntermediateResult() never make an intermediate result with a
// single null.
VELOX_CHECK(other.at<0>().has_value());
VELOX_CHECK(other.at<1>().has_value());

Which I learned from SimpleAverageAggregate.cpp, the assumption does not hold?

Hi @xumingming, did this failure happen on geometric_mean itself or on its companion functions (e.g., geometric_mean_merge or geometric_mean_merge_extract)? I would assume the latter. If that's the case, it's possible that fuzzer creates random input data that contain null fields to the companion functions. (The check here expects both fields to be non-null because for geometric_mean itself, combine() consumes intermediate states produced by extractAccumulators() and extractAccumulators() doesn't produce null fields. But this is not the case for geometric_mean_merge and geometric_mean_merge_extract whose input doesn't have to be from extractAccumulators(). SimpleAverageAggregate didn't have the issue with fuzzer because its companion functions were not tested in fuzzer.)

The Aggregate class has a member Aggregate::validateIntermediateInputs_ to differentiate when the function needs to check incoming intermediate states, but this flag is not accessible from the simple function interface yet. To unblock yourself for now, you can change VELOX_CHECK to VELOX_USER_CHECK in combine().

@xumingming
Copy link
Contributor Author

After these changes:

  • Change VELOX_FAIL to VELOX_USER_FAIL to make fuzzer happy.

I re-run the Aggregation Fuzzer, result:

./velox/exec/tests/velox_aggregation_fuzzer_test --only "geometric_mean" --duration_sec 1800
I0923 10:21:54.069901 3168036 Compression.cpp:474] Initialized zstd compressor with compression level 7
[==========] Running 0 tests from 0 test suites.
[==========] 0 tests from 0 test suites ran. (0 ms total)
[  PASSED  ] 0 tests.

./velox/exec/tests/velox_aggregation_fuzzer_test --only "geometric_mean_merge_extract_double" --duration_sec 1800
I0923 09:48:07.440472 2632530 Compression.cpp:474] Initialized zstd compressor with compression level 7
[==========] Running 0 tests from 0 test suites.
[==========] 0 tests from 0 test suites ran. (0 ms total)
[  PASSED  ] 0 tests.

./velox/exec/tests/velox_aggregation_fuzzer_test --only "geometric_mean_merge" --duration_sec 1800
I0923 10:56:05.960763 3538955 Compression.cpp:474] Initialized zstd compressor with compression level 7
[==========] Running 0 tests from 0 test suites.
[==========] 0 tests from 0 test suites ran. (0 ms total)
[  PASSED  ] 0 tests.

./velox/exec/tests/velox_aggregation_fuzzer_test --only "geometric_mean_partial" --duration_sec 1800
Failed
Expected 938, got 938
2 extra rows, 2 missing rows
2 of extra rows:
	null | null | 2554184283254315974 | [213.98455794385993,5]
	1243795782953767913 | null | null | [253.0308178339107,6]

2 of missing rows:
	null | null | 2554184283254315974 | [213.9845579438599,5]
	1243795782953767913 | null | null | [253.03081783391067,6]

And as @kagamiori said geometric_mean_partial due to known issue: #4481

}
}
},
true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change this to false to skip registering companion functions; this should help avoid fuzzer failures

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

void combine(
HashStringAllocator* /*allocator*/,
exec::arg_type<Row<double, int64_t>> other) {
// Use VELOX_USER_CHECK here to make aggregation fuzzer happy.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this comment and change registration code to not register companion functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xumingming Looks great % a few minor comments.

const TypePtr& resultType,
const core::QueryConfig& /*config*/)
-> std::unique_ptr<exec::Aggregate> {
VELOX_CHECK_EQ(argTypes.size(), 1, "{} takes one argument", name);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

USER check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

TEST_F(GeometricMeanTest, groupByTwoPhases) {
// Use two data vectors to test two-phase agg
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

groupByTwoPhases

Usage of "two phases" term here is confusing. testAggregation runs lots of plans including two phase plan partial + final.

Here, you are running aggregation on multiple batches / vectors of inputs. Hence, let's rename to groupByMultipleBatches.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


TEST_F(GeometricMeanTest, groupByTwoPhases) {
// Use two data vectors to test two-phase agg
auto data1 = makeRowVector({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data1, data2 names are an anti-pattern. Consider

std::vector<RowVectorPtr> data = {
   makeRowVector(...),
   makeRowVector(...),
   makeRowVector(...),
};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

[](auto row) {
double logSum = 0;
int64_t count = 0;
for (int32_t i = 0; i < 10; ++i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These computations repeat multiple times. Consider refactoring:

makeFlatVector<double>(
          10,
          [](auto row) { 
                return geometricMean(10, [] (auto row) { return row * 10 + i}; 
          }),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@xumingming
Copy link
Contributor Author

By the way, one question about the new SimpleAggregateAdapter, is this adapter suitable for all aggregate functions? some aggs need some dedicated data structure such as heap to implement, don't whether it is suitable to implement using SimpleAggregateAdapter?

@mbasmanova
Copy link
Contributor

By the way, one question about the new SimpleAggregateAdapter, is this adapter suitable for all aggregate functions? some aggs need some dedicated data structure such as heap to implement, don't whether it is suitable to implement using SimpleAggregateAdapter?

Yes, it should be possible to write all aggregate functions this way (except lambda aggregate functions, e.g. reduce_agg https://facebookincubator.github.io/velox/functions/presto/aggregate.html#reduce_agg).

That said, this is a new framework and there might be some rough edges or bugs, but we are committed to improve it as needed.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. One last ask. Would you run the AggregationFuzzer and share (1) command like you used to run it; (2) the results (a few last lines of the output starting at "Total functions tested:..."). Make sure to run the fuzzer for at least 45 min. Thanks.

@xumingming
Copy link
Contributor Author

I run the AggregationFuzzer, result is the following:

./velox/exec/tests/velox_aggregation_fuzzer_test --only "geometric_mean" --duration_sec 3600
I0924 10:36:58.475366 2070634 Compression.cpp:474] Initialized zstd compressor with compression level 7
[==========] Running 0 tests from 0 test suites.
[==========] 0 tests from 0 test suites ran. (0 ms total)
[  PASSED  ] 0 tests.

@mbasmanova
Copy link
Contributor

@xumingming Oh... I don't see any results here. Can you re-run with --logtostderr?

@xumingming
Copy link
Contributor Author

The command:

 ./velox/exec/tests/velox_aggregation_fuzzer_test --logtostderr --only "geometric_mean" --duration_sec 3600

The result:

I20230924 18:54:47.324401 2828931 AggregationFuzzer.cpp:780] ==============================> Done with iteration 5585
I20230924 18:54:47.324419 2828931 AggregationFuzzer.cpp:1503] Total functions tested: 1
I20230924 18:54:47.324442 2828931 AggregationFuzzer.cpp:1504] Total masked aggregations: 883 (15.81%)
I20230924 18:54:47.324810 2828931 AggregationFuzzer.cpp:1506] Total global aggregations: 478 (8.56%)
I20230924 18:54:47.324817 2828931 AggregationFuzzer.cpp:1508] Total group-by aggregations: 4040 (72.32%)
I20230924 18:54:47.324823 2828931 AggregationFuzzer.cpp:1510] Total distinct aggregations: 559 (10.01%)
I20230924 18:54:47.324829 2828931 AggregationFuzzer.cpp:1512] Total window expressions: 509 (9.11%)
I20230924 18:54:47.324834 2828931 AggregationFuzzer.cpp:1514] Total aggregations verified against DuckDB: 495 (8.86%)
I20230924 18:54:47.324839 2828931 AggregationFuzzer.cpp:1516] Total failed aggregations: 0 (0.00%)
[==========] Running 0 tests from 0 test suites.
[==========] 0 tests from 0 test suites ran. (0 ms total)
[  PASSED  ] 0 tests.

@facebook-github-bot
Copy link
Contributor

@kagamiori has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@kagamiori merged this pull request in bab2e7e.

@conbench-facebook
Copy link

Conbench analyzed the 1 benchmark run on commit bab2e7ed.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

ericyuliu pushed a commit to ericyuliu/velox that referenced this pull request Oct 12, 2023
Summary:
Fixes facebookincubator#6666

Pull Request resolved: facebookincubator#6678

Reviewed By: amitkdutta

Differential Revision: D49580085

Pulled By: kagamiori

fbshipit-source-id: 359acd31e17664ffca0fe48006063748eddb3df1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add geometric_mean Presto aggregate function
4 participants