Add geometric_mean Presto aggregate function #6678

xumingming · 2023-09-21T14:42:46Z

netlify · 2023-09-21T14:42:51Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`25ee1e7`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/651193534938cb000853a4ec

mbasmanova

Looks great % a few small comments.

@kagamiori Wei, would you also take a look?

mbasmanova · 2023-09-21T14:44:18Z

velox/docs/functions/presto/aggregate.rst

@@ -82,6 +82,12 @@ General Aggregate Functions
    each input value occurs. Supports integral, floating-point,
    boolean, timestamp, and date input types.

+.. function:: geometric_mean(x) -> double
+
+    Returns the geometric mean of all input values.


perhaps, turn "geometric mean" into a link to https://en.wikipedia.org/wiki/Geometric_mean to allow folks who are not familiar with this term to quickly learn about it

mbasmanova · 2023-09-21T14:44:50Z

velox/docs/functions/presto/coverage.rst

@@ -298,7 +298,7 @@ Here is a list of all scalar and aggregate Presto functions with functions that
    :func:`array_sum`                         flatten_geometry_collections              localtimestamp                            :func:`sign`                              :func:`timezone_minute`                       :func:`entropy`
    :func:`array_union`                       :func:`floor`                             :func:`log10`                             simplify_geometry                         :func:`to_base`                               evaluate_classifier_predictions
    :func:`arrays_overlap`                    fnv1_32                                   :func:`log2`                              :func:`sin`                               :func:`to_base64`                             :func:`every`
-    :func:`asin`                              fnv1_64                                   :func:`lower`                             :func:`slice`                             :func:`to_base64url`                          geometric_mean
+    :func:`asin`                              fnv1_64                                   :func:`lower`                             :func:`slice`                             :func:`to_base64url`                          :func:`geometric_mean`


Do not update coverage maps in this PR. These need to be updated in a separate PR. Also, make sure to auto-generate the file, not edit by hand.

mbasmanova · 2023-09-21T14:45:07Z

velox/functions/prestosql/aggregates/CMakeLists.txt

@@ -57,10 +58,10 @@ target_link_libraries(
  velox_functions_util
  Folly::folly)

-if(${VELOX_BUILD_TESTING})
+if (${VELOX_BUILD_TESTING})


unrelated changes?

mbasmanova · 2023-09-21T14:45:22Z

velox/functions/prestosql/aggregates/GeometricMeanAggregate.cpp

+ */
+
+#include <cmath>
+#include "velox/exec/Aggregate.h"


Is this include needed?

SimpleAggregateAdapter.h already includes Aggregate.h. So I think we don't need to include Aggregate.h again here.

mbasmanova · 2023-09-21T14:46:01Z

velox/functions/prestosql/aggregates/GeometricMeanAggregate.cpp

+
+  using OutputType = double;
+
+  static bool toIntermediate(


@kagamiori Wei, can we allow void toIntermediate method?

Yeah, we can add support for void toIntermediate(...). The adapter will assume this method always return true.

mbasmanova · 2023-09-21T14:48:52Z

velox/functions/prestosql/aggregates/GeometricMeanAggregate.cpp

+                  SimpleAggregateAdapter<GeometricMeanAggregate<double>>>(
+                  resultType);
+            default:
+              VELOX_FAIL(


VELOX_USER_FAIL

mbasmanova · 2023-09-21T14:49:04Z

velox/functions/prestosql/aggregates/GeometricMeanAggregate.cpp

+            default:
+              VELOX_FAIL(
+                  "Unsupported result type for final aggregation: {}",
+                  resultType->kindName());


resultType->toString()

mbasmanova · 2023-09-21T14:49:28Z

velox/functions/prestosql/aggregates/GeometricMeanAggregate.cpp

+                  inputType->kindName());
+          }
+        } else {
+          switch (resultType->kind()) {


For intermediate agg, resultType will be ROW. Can you make sure to handle this?

mbasmanova · 2023-09-21T14:49:58Z

velox/functions/prestosql/aggregates/tests/GeometricMeanTest.cpp

+
+TEST_F(GeometricMeanTest, globalEmpty) {
+  auto data = makeRowVector({
+      makeFlatVector<int64_t>(std::vector<int64_t>{}),


makeFlatVector<int64_t> -> makeFlatVector

mbasmanova · 2023-09-21T14:51:20Z

velox/functions/prestosql/aggregates/GeometricMeanAggregate.cpp

+  static bool toIntermediate(
+      exec::out_type<Row<double, int64_t>>& out,
+      exec::arg_type<T> in) {
+    out.copy_from(std::make_tuple(static_cast<double>(in), 1));


Can you make sure this code path is covered with tests?

Please, also run AggregationFuzzer and make sure this function is covered well. Use --only geometric_mean --duration_sec 1800

It would also be good to test companion functions with AggregationFuzzer since you registered companion functions altogether. i.e., use --only "geometric_mean,geometric_mean_partial,geometric_mean_merge,geometric_mean_merge_extract" --duration_sec 1800

Two questions here:

How to we test toIntermediate individually? Is there examples I can take a look at?

I tried the AggregationFuzzer, it fails with the following error message:

1 extra rows, 1 missing rows 1 of extra rows: null | null | null | [-2.6020417535176192,4] 1 of missing rows: null | null | null | [-2.602041753517619,4]

It seems fails with precision issue? is this expected for floating type functions? Or something wrong with my implementation?

How to we test toIntermediate individually? Is there examples I can take a look at?

AggregationTestBase::testAggregations has logic to test this. See velox/functions/lib/aggregates/tests/AggregationTestBase.cpp

SCOPED_TRACE("Run partial + final with abandon partial agg");

However, this logic applies only if input has 2 or more vectors. Hence, to trigger that logic pass 2 or more vectors to testAggregations.

It seems fails with precision issue?

Fuzzer uses assertEqualResults which has logic to compare with epsilon. Maybe this logic doesn't work for some reason. Would you, please, investigate?

// Compare actualRows with expectedRows and return whether they match. Compare
// actualRows and expectedRows with epsilon if needed and allowed. Otherwise,
// compare their values directly. The underlying assumption is that aggregation
// results can be sorted by unique keys and floating-point values in them are
// computed in different ways and hence require epsilon comparison. For results
// of other operations, floating-point values are likely copied from inputs and
// hence can be compared directly.
bool assertEqualResults(

The failure with precision issue is probably due to this: #4481.

To confirm that toIntermediate is tested, you can write a test case as @mbasmanova suggested and set a breakpoint in toIntermediate to make sure the breakpoint gets hit when you run the test.

kagamiori · 2023-09-21T20:10:25Z

velox/functions/prestosql/aggregates/GeometricMeanAggregate.cpp

+ */
+
+#include <cmath>
+#include "velox/exec/Aggregate.h"


SimpleAggregateAdapter.h already includes Aggregate.h. So I think we don't need to include Aggregate.h again here.

kagamiori · 2023-09-21T20:25:56Z

velox/functions/prestosql/aggregates/GeometricMeanAggregate.cpp

+
+  using OutputType = double;
+
+  static bool toIntermediate(


Yeah, we can add support for void toIntermediate(...). The adapter will assume this method always return true.

kagamiori · 2023-09-21T20:42:53Z

velox/functions/prestosql/aggregates/GeometricMeanAggregate.cpp

+
+    void addInput(HashStringAllocator* /*allocator*/, exec::arg_type<T> data) {
+      logSum_ += std::log(data);
+      count_ = checkedPlus<int64_t>(count_, 1);


combine() would need checkedPlus because the merge companion function (if registered) can receive arbitrarily large count_ values. I think it would be more consistent using checkedPlus here too.

kagamiori · 2023-09-21T21:00:38Z

velox/functions/prestosql/aggregates/GeometricMeanAggregate.cpp

+                  SimpleAggregateAdapter<GeometricMeanAggregate<double>>>(
+                  resultType);
+            default:
+              VELOX_FAIL(


@mbasmanova, Looks like some of our existing UDAFs use VELOX_FAIL here and some others use VELOX_USER_FAIL here. We should make them consistent. Should it throw VeloxUserError here since signature binding should fail first if it's end users giving arguments of incorrect type?

kagamiori · 2023-09-21T22:23:47Z

velox/functions/prestosql/aggregates/GeometricMeanAggregate.cpp

+  static bool toIntermediate(
+      exec::out_type<Row<double, int64_t>>& out,
+      exec::arg_type<T> in) {
+    out.copy_from(std::make_tuple(static_cast<double>(in), 1));


It would also be good to test companion functions with AggregationFuzzer since you registered companion functions altogether. i.e., use --only "geometric_mean,geometric_mean_partial,geometric_mean_merge,geometric_mean_merge_extract" --duration_sec 1800

kagamiori · 2023-09-21T22:29:18Z

velox/functions/prestosql/aggregates/tests/GeometricMeanTest.cpp

+
+  testAggregations({data}, {"c0"}, {"geometric_mean(c1)"}, {expected});
+}
+


Would it make sense to also add a test case that compares the aggregation result with DuckDB geometric_mean? See examples in velox/functions/prestosql/aggregates/tests/ArrayAggTest.cpp.

The version of DuckDB seems does not have geometric_mean yet:

Function:verifyDuckDBResult, Expression: result->success DuckDB query failed: Catalog Error: Scalar Function with name geometric_mean does not exist! Did you mean "median"? LINE 1: SELECT geometric_mean(c0) FROM tmp ^ SELECT geometric_mean(c0) FROM tmp, Source: RUNTIME, ErrorCode: INVALID_STATE

And I checked with DuckDB's source code, seems geometric_mean is added recently: https://github.com/duckdb/duckdb/blame/239f51293c429168774c3943e96ddf2451253a07/src/catalog/default/default_functions.cpp#L102

@kagamiori do I need to upgrade DuckDB to add the test?

Upgrading is not possible. There is a GitHub issue about that with details. Look it up. We are working on changing AggregationFuzzer to use Presto as source of truth: #6595 Once this is done we'll be able to verify all aggregate functions.

xumingming · 2023-09-22T05:27:07Z

AggregationFuzzer reports:

Function:combine, Expression: other.at<1>().has_value() , Source: RUNTIME, ErrorCode: INVALID_STATE
libc++abi: terminating with uncaught exception of type facebook::velox::VeloxRuntimeError: Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Retriable: False
Expression: other.at<1>().has_value()
Function: combine
File: /Users/abei/Code/velox_commmunity/velox/functions/prestosql/aggregates/GeometricMeanAggregate.cpp
Line: 64

Which corresponding to this check:

// Both field of an intermediate result should be non-null because
// writeIntermediateResult() never make an intermediate result with a
// single null.
VELOX_CHECK(other.at<0>().has_value());
VELOX_CHECK(other.at<1>().has_value());

Which I learned from SimpleAverageAggregate.cpp, the assumption does not hold?

kagamiori · 2023-09-22T18:00:00Z

AggregationFuzzer reports:

Function:combine, Expression: other.at<1>().has_value() , Source: RUNTIME, ErrorCode: INVALID_STATE
libc++abi: terminating with uncaught exception of type facebook::velox::VeloxRuntimeError: Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Retriable: False
Expression: other.at<1>().has_value()
Function: combine
File: /Users/abei/Code/velox_commmunity/velox/functions/prestosql/aggregates/GeometricMeanAggregate.cpp
Line: 64

Which corresponding to this check:

// Both field of an intermediate result should be non-null because
// writeIntermediateResult() never make an intermediate result with a
// single null.
VELOX_CHECK(other.at<0>().has_value());
VELOX_CHECK(other.at<1>().has_value());

Which I learned from SimpleAverageAggregate.cpp, the assumption does not hold?

Hi @xumingming, did this failure happen on geometric_mean itself or on its companion functions (e.g., geometric_mean_merge or geometric_mean_merge_extract)? I would assume the latter. If that's the case, it's possible that fuzzer creates random input data that contain null fields to the companion functions. (The check here expects both fields to be non-null because for geometric_mean itself, combine() consumes intermediate states produced by extractAccumulators() and extractAccumulators() doesn't produce null fields. But this is not the case for geometric_mean_merge and geometric_mean_merge_extract whose input doesn't have to be from extractAccumulators(). SimpleAverageAggregate didn't have the issue with fuzzer because its companion functions were not tested in fuzzer.)

The Aggregate class has a member Aggregate::validateIntermediateInputs_ to differentiate when the function needs to check incoming intermediate states, but this flag is not accessible from the simple function interface yet. To unblock yourself for now, you can change VELOX_CHECK to VELOX_USER_CHECK in combine().

xumingming · 2023-09-23T05:41:01Z

After these changes：

Change VELOX_FAIL to VELOX_USER_FAIL to make fuzzer happy.

I re-run the Aggregation Fuzzer, result:

./velox/exec/tests/velox_aggregation_fuzzer_test --only "geometric_mean" --duration_sec 1800
I0923 10:21:54.069901 3168036 Compression.cpp:474] Initialized zstd compressor with compression level 7
[==========] Running 0 tests from 0 test suites.
[==========] 0 tests from 0 test suites ran. (0 ms total)
[  PASSED  ] 0 tests.

./velox/exec/tests/velox_aggregation_fuzzer_test --only "geometric_mean_merge_extract_double" --duration_sec 1800
I0923 09:48:07.440472 2632530 Compression.cpp:474] Initialized zstd compressor with compression level 7
[==========] Running 0 tests from 0 test suites.
[==========] 0 tests from 0 test suites ran. (0 ms total)
[  PASSED  ] 0 tests.

./velox/exec/tests/velox_aggregation_fuzzer_test --only "geometric_mean_merge" --duration_sec 1800
I0923 10:56:05.960763 3538955 Compression.cpp:474] Initialized zstd compressor with compression level 7
[==========] Running 0 tests from 0 test suites.
[==========] 0 tests from 0 test suites ran. (0 ms total)
[  PASSED  ] 0 tests.

./velox/exec/tests/velox_aggregation_fuzzer_test --only "geometric_mean_partial" --duration_sec 1800
Failed
Expected 938, got 938
2 extra rows, 2 missing rows
2 of extra rows:
	null | null | 2554184283254315974 | [213.98455794385993,5]
	1243795782953767913 | null | null | [253.0308178339107,6]

2 of missing rows:
	null | null | 2554184283254315974 | [213.9845579438599,5]
	1243795782953767913 | null | null | [253.03081783391067,6]

And as @kagamiori said geometric_mean_partial due to known issue: #4481

mbasmanova · 2023-09-23T09:06:28Z

velox/functions/prestosql/aggregates/GeometricMeanAggregate.cpp

+          }
+        }
+      },
+      true);


change this to false to skip registering companion functions; this should help avoid fuzzer failures

mbasmanova · 2023-09-23T09:07:09Z

velox/functions/prestosql/aggregates/GeometricMeanAggregate.cpp

+    void combine(
+        HashStringAllocator* /*allocator*/,
+        exec::arg_type<Row<double, int64_t>> other) {
+      // Use VELOX_USER_CHECK here to make aggregation fuzzer happy.


Remove this comment and change registration code to not register companion functions.

mbasmanova

@xumingming Looks great % a few minor comments.

mbasmanova · 2023-09-23T09:38:03Z

velox/functions/prestosql/aggregates/GeometricMeanAggregate.cpp

+          const TypePtr& resultType,
+          const core::QueryConfig& /*config*/)
+          -> std::unique_ptr<exec::Aggregate> {
+        VELOX_CHECK_EQ(argTypes.size(), 1, "{} takes one argument", name);


mbasmanova · 2023-09-23T09:40:44Z

velox/functions/prestosql/aggregates/tests/GeometricMeanTest.cpp

+}
+
+TEST_F(GeometricMeanTest, groupByTwoPhases) {
+  // Use two data vectors to test two-phase agg


groupByTwoPhases

Usage of "two phases" term here is confusing. testAggregation runs lots of plans including two phase plan partial + final.

Here, you are running aggregation on multiple batches / vectors of inputs. Hence, let's rename to groupByMultipleBatches.

mbasmanova · 2023-09-23T09:41:16Z

velox/functions/prestosql/aggregates/tests/GeometricMeanTest.cpp

+
+TEST_F(GeometricMeanTest, groupByTwoPhases) {
+  // Use two data vectors to test two-phase agg
+  auto data1 = makeRowVector({


data1, data2 names are an anti-pattern. Consider

std::vector<RowVectorPtr> data = { makeRowVector(...), makeRowVector(...), makeRowVector(...), };

mbasmanova · 2023-09-23T09:44:36Z

velox/functions/prestosql/aggregates/tests/GeometricMeanTest.cpp

+          [](auto row) {
+            double logSum = 0;
+            int64_t count = 0;
+            for (int32_t i = 0; i < 10; ++i) {


These computations repeat multiple times. Consider refactoring:

makeFlatVector<double>( 10, [](auto row) { return geometricMean(10, [] (auto row) { return row * 10 + i}; }),

xumingming · 2023-09-23T15:53:28Z

By the way, one question about the new SimpleAggregateAdapter, is this adapter suitable for all aggregate functions? some aggs need some dedicated data structure such as heap to implement, don't whether it is suitable to implement using SimpleAggregateAdapter?

mbasmanova · 2023-09-23T16:11:40Z

By the way, one question about the new SimpleAggregateAdapter, is this adapter suitable for all aggregate functions? some aggs need some dedicated data structure such as heap to implement, don't whether it is suitable to implement using SimpleAggregateAdapter?

Yes, it should be possible to write all aggregate functions this way (except lambda aggregate functions, e.g. reduce_agg https://facebookincubator.github.io/velox/functions/presto/aggregate.html#reduce_agg).

That said, this is a new framework and there might be some rough edges or bugs, but we are committed to improve it as needed.

mbasmanova

Looks great. One last ask. Would you run the AggregationFuzzer and share (1) command like you used to run it; (2) the results (a few last lines of the output starting at "Total functions tested:..."). Make sure to run the fuzzer for at least 45 min. Thanks.

xumingming · 2023-09-24T03:40:25Z

I run the AggregationFuzzer, result is the following:

./velox/exec/tests/velox_aggregation_fuzzer_test --only "geometric_mean" --duration_sec 3600
I0924 10:36:58.475366 2070634 Compression.cpp:474] Initialized zstd compressor with compression level 7
[==========] Running 0 tests from 0 test suites.
[==========] 0 tests from 0 test suites ran. (0 ms total)
[  PASSED  ] 0 tests.

mbasmanova · 2023-09-24T05:13:37Z

@xumingming Oh... I don't see any results here. Can you re-run with --logtostderr?

xumingming · 2023-09-24T11:05:34Z

The command:

 ./velox/exec/tests/velox_aggregation_fuzzer_test --logtostderr --only "geometric_mean" --duration_sec 3600

The result:

I20230924 18:54:47.324401 2828931 AggregationFuzzer.cpp:780] ==============================> Done with iteration 5585
I20230924 18:54:47.324419 2828931 AggregationFuzzer.cpp:1503] Total functions tested: 1
I20230924 18:54:47.324442 2828931 AggregationFuzzer.cpp:1504] Total masked aggregations: 883 (15.81%)
I20230924 18:54:47.324810 2828931 AggregationFuzzer.cpp:1506] Total global aggregations: 478 (8.56%)
I20230924 18:54:47.324817 2828931 AggregationFuzzer.cpp:1508] Total group-by aggregations: 4040 (72.32%)
I20230924 18:54:47.324823 2828931 AggregationFuzzer.cpp:1510] Total distinct aggregations: 559 (10.01%)
I20230924 18:54:47.324829 2828931 AggregationFuzzer.cpp:1512] Total window expressions: 509 (9.11%)
I20230924 18:54:47.324834 2828931 AggregationFuzzer.cpp:1514] Total aggregations verified against DuckDB: 495 (8.86%)
I20230924 18:54:47.324839 2828931 AggregationFuzzer.cpp:1516] Total failed aggregations: 0 (0.00%)
[==========] Running 0 tests from 0 test suites.
[==========] 0 tests from 0 test suites ran. (0 ms total)
[  PASSED  ] 0 tests.

facebook-github-bot · 2023-09-24T18:46:33Z

@kagamiori has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2023-09-25T21:33:16Z

@kagamiori merged this pull request in bab2e7e.

conbench-facebook · 2023-09-25T21:54:04Z

Conbench analyzed the 1 benchmark run on commit bab2e7ed.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

Summary: Fixes facebookincubator#6666 Pull Request resolved: facebookincubator#6678 Reviewed By: amitkdutta Differential Revision: D49580085 Pulled By: kagamiori fbshipit-source-id: 359acd31e17664ffca0fe48006063748eddb3df1

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 21, 2023

mbasmanova requested review from laithsakka and kagamiori September 21, 2023 14:43

Add geometric_mean Presto aggregate function

425d45f

xumingming force-pushed the implement-presto-geometric-mean branch from 15cfc86 to 425d45f Compare September 21, 2023 14:46

mbasmanova reviewed Sep 21, 2023

View reviewed changes

kagamiori reviewed Sep 21, 2023

View reviewed changes

Update according to review comments

c2fe523

mbasmanova reviewed Sep 23, 2023

View reviewed changes

Skip registering companion function for geometric_mean

09b743f

mbasmanova reviewed Sep 23, 2023

View reviewed changes

mbasmanova approved these changes Sep 23, 2023

View reviewed changes

Update according to comments

25ee1e7

xumingming force-pushed the implement-presto-geometric-mean branch from caf9ca1 to 25ee1e7 Compare September 25, 2023 14:04

facebook-github-bot closed this in bab2e7e Sep 25, 2023

facebook-github-bot added the Merged label Sep 25, 2023


		testAggregations({data}, {"c0"}, {"geometric_mean(c1)"}, {expected});
		}

Add geometric_mean Presto aggregate function #6678

Add geometric_mean Presto aggregate function #6678

Conversation

xumingming commented Sep 21, 2023 • edited by mbasmanova

netlify bot commented Sep 21, 2023 • edited

✅ Deploy Preview for meta-velox canceled.

mbasmanova left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xumingming commented Sep 22, 2023

kagamiori commented Sep 22, 2023

xumingming commented Sep 23, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mbasmanova left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xumingming commented Sep 23, 2023

mbasmanova commented Sep 23, 2023

mbasmanova left a comment

Choose a reason for hiding this comment

xumingming commented Sep 24, 2023

mbasmanova commented Sep 24, 2023

xumingming commented Sep 24, 2023

facebook-github-bot commented Sep 24, 2023

facebook-github-bot commented Sep 25, 2023

conbench-facebook bot commented Sep 25, 2023

xumingming commented Sep 21, 2023 •

edited by mbasmanova

netlify bot commented Sep 21, 2023 •

edited