-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add geometric_mean Presto aggregate function #6678
Add geometric_mean Presto aggregate function #6678
Conversation
✅ Deploy Preview for meta-velox canceled.
|
15cfc86
to
425d45f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great % a few small comments.
@kagamiori Wei, would you also take a look?
@@ -82,6 +82,12 @@ General Aggregate Functions | |||
each input value occurs. Supports integral, floating-point, | |||
boolean, timestamp, and date input types. | |||
|
|||
.. function:: geometric_mean(x) -> double | |||
|
|||
Returns the geometric mean of all input values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps, turn "geometric mean" into a link to https://en.wikipedia.org/wiki/Geometric_mean to allow folks who are not familiar with this term to quickly learn about it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -298,7 +298,7 @@ Here is a list of all scalar and aggregate Presto functions with functions that | |||
:func:`array_sum` flatten_geometry_collections localtimestamp :func:`sign` :func:`timezone_minute` :func:`entropy` | |||
:func:`array_union` :func:`floor` :func:`log10` simplify_geometry :func:`to_base` evaluate_classifier_predictions | |||
:func:`arrays_overlap` fnv1_32 :func:`log2` :func:`sin` :func:`to_base64` :func:`every` | |||
:func:`asin` fnv1_64 :func:`lower` :func:`slice` :func:`to_base64url` geometric_mean | |||
:func:`asin` fnv1_64 :func:`lower` :func:`slice` :func:`to_base64url` :func:`geometric_mean` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not update coverage maps in this PR. These need to be updated in a separate PR. Also, make sure to auto-generate the file, not edit by hand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -57,10 +58,10 @@ target_link_libraries( | |||
velox_functions_util | |||
Folly::folly) | |||
|
|||
if(${VELOX_BUILD_TESTING}) | |||
if (${VELOX_BUILD_TESTING}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unrelated changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reverted
*/ | ||
|
||
#include <cmath> | ||
#include "velox/exec/Aggregate.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this include needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SimpleAggregateAdapter.h already includes Aggregate.h. So I think we don't need to include Aggregate.h again here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
|
||
using OutputType = double; | ||
|
||
static bool toIntermediate( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kagamiori Wei, can we allow void toIntermediate method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we can add support for void toIntermediate(...)
. The adapter will assume this method always return true.
SimpleAggregateAdapter<GeometricMeanAggregate<double>>>( | ||
resultType); | ||
default: | ||
VELOX_FAIL( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VELOX_USER_FAIL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
default: | ||
VELOX_FAIL( | ||
"Unsupported result type for final aggregation: {}", | ||
resultType->kindName()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resultType->toString()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
inputType->kindName()); | ||
} | ||
} else { | ||
switch (resultType->kind()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For intermediate agg, resultType will be ROW. Can you make sure to handle this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
TEST_F(GeometricMeanTest, globalEmpty) { | ||
auto data = makeRowVector({ | ||
makeFlatVector<int64_t>(std::vector<int64_t>{}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makeFlatVector<int64_t> -> makeFlatVector
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
static bool toIntermediate( | ||
exec::out_type<Row<double, int64_t>>& out, | ||
exec::arg_type<T> in) { | ||
out.copy_from(std::make_tuple(static_cast<double>(in), 1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make sure this code path is covered with tests?
Please, also run AggregationFuzzer and make sure this function is covered well. Use --only geometric_mean --duration_sec 1800
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would also be good to test companion functions with AggregationFuzzer since you registered companion functions altogether. i.e., use --only "geometric_mean,geometric_mean_partial,geometric_mean_merge,geometric_mean_merge_extract" --duration_sec 1800
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two questions here:
- How to we test
toIntermediate
individually? Is there examples I can take a look at? - I tried the AggregationFuzzer, it fails with the following error message:
1 extra rows, 1 missing rows
1 of extra rows:
null | null | null | [-2.6020417535176192,4]
1 of missing rows:
null | null | null | [-2.602041753517619,4]
It seems fails with precision issue? is this expected for floating type functions? Or something wrong with my implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How to we test toIntermediate individually? Is there examples I can take a look at?
AggregationTestBase::testAggregations has logic to test this. See velox/functions/lib/aggregates/tests/AggregationTestBase.cpp
SCOPED_TRACE("Run partial + final with abandon partial agg");
However, this logic applies only if input has 2 or more vectors. Hence, to trigger that logic pass 2 or more vectors to testAggregations.
It seems fails with precision issue?
Fuzzer uses assertEqualResults which has logic to compare with epsilon. Maybe this logic doesn't work for some reason. Would you, please, investigate?
// Compare actualRows with expectedRows and return whether they match. Compare
// actualRows and expectedRows with epsilon if needed and allowed. Otherwise,
// compare their values directly. The underlying assumption is that aggregation
// results can be sorted by unique keys and floating-point values in them are
// computed in different ways and hence require epsilon comparison. For results
// of other operations, floating-point values are likely copied from inputs and
// hence can be compared directly.
bool assertEqualResults(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The failure with precision issue is probably due to this: #4481.
To confirm that toIntermediate is tested, you can write a test case as @mbasmanova suggested and set a breakpoint in toIntermediate to make sure the breakpoint gets hit when you run the test.
*/ | ||
|
||
#include <cmath> | ||
#include "velox/exec/Aggregate.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SimpleAggregateAdapter.h already includes Aggregate.h. So I think we don't need to include Aggregate.h again here.
|
||
using OutputType = double; | ||
|
||
static bool toIntermediate( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we can add support for void toIntermediate(...)
. The adapter will assume this method always return true.
|
||
void addInput(HashStringAllocator* /*allocator*/, exec::arg_type<T> data) { | ||
logSum_ += std::log(data); | ||
count_ = checkedPlus<int64_t>(count_, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
combine() would need checkedPlus because the merge companion function (if registered) can receive arbitrarily large count_
values. I think it would be more consistent using checkedPlus here too.
SimpleAggregateAdapter<GeometricMeanAggregate<double>>>( | ||
resultType); | ||
default: | ||
VELOX_FAIL( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbasmanova, Looks like some of our existing UDAFs use VELOX_FAIL here and some others use VELOX_USER_FAIL here. We should make them consistent. Should it throw VeloxUserError here since signature binding should fail first if it's end users giving arguments of incorrect type?
static bool toIntermediate( | ||
exec::out_type<Row<double, int64_t>>& out, | ||
exec::arg_type<T> in) { | ||
out.copy_from(std::make_tuple(static_cast<double>(in), 1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would also be good to test companion functions with AggregationFuzzer since you registered companion functions altogether. i.e., use --only "geometric_mean,geometric_mean_partial,geometric_mean_merge,geometric_mean_merge_extract" --duration_sec 1800
|
||
testAggregations({data}, {"c0"}, {"geometric_mean(c1)"}, {expected}); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to also add a test case that compares the aggregation result with DuckDB geometric_mean? See examples in velox/functions/prestosql/aggregates/tests/ArrayAggTest.cpp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The version of DuckDB seems does not have geometric_mean
yet:
Function:verifyDuckDBResult, Expression: result->success DuckDB query failed: Catalog Error: Scalar Function with name geometric_mean does not exist!
Did you mean "median"?
LINE 1: SELECT geometric_mean(c0) FROM tmp
^
SELECT geometric_mean(c0) FROM tmp, Source: RUNTIME, ErrorCode: INVALID_STATE
And I checked with DuckDB's source code, seems geometric_mean is added recently: https://github.com/duckdb/duckdb/blame/239f51293c429168774c3943e96ddf2451253a07/src/catalog/default/default_functions.cpp#L102
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kagamiori do I need to upgrade DuckDB to add the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upgrading is not possible. There is a GitHub issue about that with details. Look it up. We are working on changing AggregationFuzzer to use Presto as source of truth: #6595 Once this is done we'll be able to verify all aggregate functions.
AggregationFuzzer reports:
Which corresponding to this check:
Which I learned from |
Hi @xumingming, did this failure happen on geometric_mean itself or on its companion functions (e.g., geometric_mean_merge or geometric_mean_merge_extract)? I would assume the latter. If that's the case, it's possible that fuzzer creates random input data that contain null fields to the companion functions. (The check here expects both fields to be non-null because for geometric_mean itself, combine() consumes intermediate states produced by extractAccumulators() and extractAccumulators() doesn't produce null fields. But this is not the case for geometric_mean_merge and geometric_mean_merge_extract whose input doesn't have to be from extractAccumulators(). SimpleAverageAggregate didn't have the issue with fuzzer because its companion functions were not tested in fuzzer.) The |
After these changes:
I re-run the Aggregation Fuzzer, result:
And as @kagamiori said geometric_mean_partial due to known issue: #4481 |
} | ||
} | ||
}, | ||
true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change this to false to skip registering companion functions; this should help avoid fuzzer failures
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
void combine( | ||
HashStringAllocator* /*allocator*/, | ||
exec::arg_type<Row<double, int64_t>> other) { | ||
// Use VELOX_USER_CHECK here to make aggregation fuzzer happy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this comment and change registration code to not register companion functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xumingming Looks great % a few minor comments.
const TypePtr& resultType, | ||
const core::QueryConfig& /*config*/) | ||
-> std::unique_ptr<exec::Aggregate> { | ||
VELOX_CHECK_EQ(argTypes.size(), 1, "{} takes one argument", name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
USER check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
} | ||
|
||
TEST_F(GeometricMeanTest, groupByTwoPhases) { | ||
// Use two data vectors to test two-phase agg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
groupByTwoPhases
Usage of "two phases" term here is confusing. testAggregation runs lots of plans including two phase plan partial + final.
Here, you are running aggregation on multiple batches / vectors of inputs. Hence, let's rename to groupByMultipleBatches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
TEST_F(GeometricMeanTest, groupByTwoPhases) { | ||
// Use two data vectors to test two-phase agg | ||
auto data1 = makeRowVector({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data1, data2 names are an anti-pattern. Consider
std::vector<RowVectorPtr> data = {
makeRowVector(...),
makeRowVector(...),
makeRowVector(...),
};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
[](auto row) { | ||
double logSum = 0; | ||
int64_t count = 0; | ||
for (int32_t i = 0; i < 10; ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These computations repeat multiple times. Consider refactoring:
makeFlatVector<double>(
10,
[](auto row) {
return geometricMean(10, [] (auto row) { return row * 10 + i};
}),
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
By the way, one question about the new SimpleAggregateAdapter, is this adapter suitable for all aggregate functions? some aggs need some dedicated data structure such as heap to implement, don't whether it is suitable to implement using SimpleAggregateAdapter? |
Yes, it should be possible to write all aggregate functions this way (except lambda aggregate functions, e.g. reduce_agg https://facebookincubator.github.io/velox/functions/presto/aggregate.html#reduce_agg). That said, this is a new framework and there might be some rough edges or bugs, but we are committed to improve it as needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. One last ask. Would you run the AggregationFuzzer and share (1) command like you used to run it; (2) the results (a few last lines of the output starting at "Total functions tested:..."). Make sure to run the fuzzer for at least 45 min. Thanks.
I run the AggregationFuzzer, result is the following:
|
@xumingming Oh... I don't see any results here. Can you re-run with --logtostderr? |
The command:
The result:
|
@kagamiori has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
caf9ca1
to
25ee1e7
Compare
@kagamiori merged this pull request in bab2e7e. |
Conbench analyzed the 1 benchmark run on commit There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
Summary: Fixes facebookincubator#6666 Pull Request resolved: facebookincubator#6678 Reviewed By: amitkdutta Differential Revision: D49580085 Pulled By: kagamiori fbshipit-source-id: 359acd31e17664ffca0fe48006063748eddb3df1
Fixes #6666