Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Introduce uniqCombined64() to get sane results for cardinality > UINT_MAX #7222

Merged
merged 4 commits into from
Oct 9, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
65 changes: 38 additions & 27 deletions dbms/src/AggregateFunctions/AggregateFunctionUniqCombined.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@
#include <DataTypes/DataTypeDate.h>
#include <DataTypes/DataTypeDateTime.h>

#include <functional>

namespace DB
{
namespace ErrorCodes
Expand All @@ -17,17 +19,17 @@ namespace ErrorCodes

namespace
{
template <UInt8 K>
template <UInt8 K, typename HashValueType>
struct WithK
{
template <typename T>
using AggregateFunction = AggregateFunctionUniqCombined<T, K>;
using AggregateFunction = AggregateFunctionUniqCombined<T, K, HashValueType>;

template <bool is_exact, bool argument_is_tuple>
using AggregateFunctionVariadic = AggregateFunctionUniqCombinedVariadic<is_exact, argument_is_tuple, K>;
using AggregateFunctionVariadic = AggregateFunctionUniqCombinedVariadic<is_exact, argument_is_tuple, K, HashValueType>;
};

template <UInt8 K>
template <UInt8 K, typename HashValueType>
AggregateFunctionPtr createAggregateFunctionWithK(const DataTypes & argument_types, const Array & params)
{
/// We use exact hash function if the arguments are not contiguous in memory, because only exact hash function has support for this case.
Expand All @@ -37,36 +39,45 @@ namespace
{
const IDataType & argument_type = *argument_types[0];

AggregateFunctionPtr res(createWithNumericType<WithK<K>::template AggregateFunction>(*argument_types[0], argument_types, params));
AggregateFunctionPtr res(createWithNumericType<WithK<K, HashValueType>::template AggregateFunction>(*argument_types[0], argument_types, params));

WhichDataType which(argument_type);
if (res)
return res;
else if (which.isDate())
return std::make_shared<typename WithK<K>::template AggregateFunction<DataTypeDate::FieldType>>(argument_types, params);
return std::make_shared<typename WithK<K, HashValueType>::template AggregateFunction<DataTypeDate::FieldType>>(argument_types, params);
else if (which.isDateTime())
return std::make_shared<typename WithK<K>::template AggregateFunction<DataTypeDateTime::FieldType>>(argument_types, params);
return std::make_shared<typename WithK<K, HashValueType>::template AggregateFunction<DataTypeDateTime::FieldType>>(argument_types, params);
else if (which.isStringOrFixedString())
return std::make_shared<typename WithK<K>::template AggregateFunction<String>>(argument_types, params);
return std::make_shared<typename WithK<K, HashValueType>::template AggregateFunction<String>>(argument_types, params);
else if (which.isUUID())
return std::make_shared<typename WithK<K>::template AggregateFunction<DataTypeUUID::FieldType>>(argument_types, params);
return std::make_shared<typename WithK<K, HashValueType>::template AggregateFunction<DataTypeUUID::FieldType>>(argument_types, params);
else if (which.isTuple())
{
if (use_exact_hash_function)
return std::make_shared<typename WithK<K>::template AggregateFunctionVariadic<true, true>>(argument_types, params);
return std::make_shared<typename WithK<K, HashValueType>::template AggregateFunctionVariadic<true, true>>(argument_types, params);
else
return std::make_shared<typename WithK<K>::template AggregateFunctionVariadic<false, true>>(argument_types, params);
return std::make_shared<typename WithK<K, HashValueType>::template AggregateFunctionVariadic<false, true>>(argument_types, params);
}
}

/// "Variadic" method also works as a fallback generic case for a single argument.
if (use_exact_hash_function)
return std::make_shared<typename WithK<K>::template AggregateFunctionVariadic<true, false>>(argument_types, params);
return std::make_shared<typename WithK<K, HashValueType>::template AggregateFunctionVariadic<true, false>>(argument_types, params);
else
return std::make_shared<typename WithK<K, HashValueType>::template AggregateFunctionVariadic<false, false>>(argument_types, params);
}

template <UInt8 K>
AggregateFunctionPtr createAggregateFunctionWithHashType(bool use_64_bit_hash, const DataTypes & argument_types, const Array & params)
{
if (use_64_bit_hash)
return createAggregateFunctionWithK<K, UInt64>(argument_types, params);
else
return std::make_shared<typename WithK<K>::template AggregateFunctionVariadic<false, false>>(argument_types, params);
return createAggregateFunctionWithK<K, UInt32>(argument_types, params);
}

AggregateFunctionPtr createAggregateFunctionUniqCombined(
AggregateFunctionPtr createAggregateFunctionUniqCombined(bool use_64_bit_hash,
const std::string & name, const DataTypes & argument_types, const Array & params)
{
/// log2 of the number of cells in HyperLogLog.
Expand All @@ -80,12 +91,10 @@ namespace
"Aggregate function " + name + " requires one parameter or less.", ErrorCodes::NUMBER_OF_ARGUMENTS_DOESNT_MATCH);

UInt64 precision_param = applyVisitor(FieldVisitorConvertToNumber<UInt64>(), params[0]);

// This range is hardcoded below
if (precision_param > 20 || precision_param < 12)
throw Exception(
"Parameter for aggregate function " + name + "is out or range: [12, 20].", ErrorCodes::ARGUMENT_OUT_OF_BOUND);

"Parameter for aggregate function " + name + " is out or range: [12, 20].", ErrorCodes::ARGUMENT_OUT_OF_BOUND);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

precision = precision_param;
}

Expand All @@ -95,23 +104,23 @@ namespace
switch (precision)
{
case 12:
return createAggregateFunctionWithK<12>(argument_types, params);
return createAggregateFunctionWithHashType<12>(use_64_bit_hash, argument_types, params);
case 13:
return createAggregateFunctionWithK<13>(argument_types, params);
return createAggregateFunctionWithHashType<13>(use_64_bit_hash, argument_types, params);
case 14:
return createAggregateFunctionWithK<14>(argument_types, params);
return createAggregateFunctionWithHashType<14>(use_64_bit_hash, argument_types, params);
case 15:
return createAggregateFunctionWithK<15>(argument_types, params);
return createAggregateFunctionWithHashType<15>(use_64_bit_hash, argument_types, params);
case 16:
return createAggregateFunctionWithK<16>(argument_types, params);
return createAggregateFunctionWithHashType<16>(use_64_bit_hash, argument_types, params);
case 17:
return createAggregateFunctionWithK<17>(argument_types, params);
return createAggregateFunctionWithHashType<17>(use_64_bit_hash, argument_types, params);
case 18:
return createAggregateFunctionWithK<18>(argument_types, params);
return createAggregateFunctionWithHashType<18>(use_64_bit_hash, argument_types, params);
case 19:
return createAggregateFunctionWithK<19>(argument_types, params);
return createAggregateFunctionWithHashType<19>(use_64_bit_hash, argument_types, params);
case 20:
return createAggregateFunctionWithK<20>(argument_types, params);
return createAggregateFunctionWithHashType<20>(use_64_bit_hash, argument_types, params);
}

__builtin_unreachable();
Expand All @@ -121,7 +130,9 @@ namespace

void registerAggregateFunctionUniqCombined(AggregateFunctionFactory & factory)
{
factory.registerFunction("uniqCombined", createAggregateFunctionUniqCombined);
using namespace std::placeholders;
factory.registerFunction("uniqCombined", std::bind(createAggregateFunctionUniqCombined, false, _1, _2, _3));
factory.registerFunction("uniqCombined64", std::bind(createAggregateFunctionUniqCombined, true, _1, _2, _3));
}

}
70 changes: 39 additions & 31 deletions dbms/src/AggregateFunctions/AggregateFunctionUniqCombined.h
Original file line number Diff line number Diff line change
Expand Up @@ -24,43 +24,43 @@ namespace DB
{
namespace detail
{
/** Hash function for uniqCombined.
/** Hash function for uniqCombined/uniqCombined64 (based on Ret).
*/
template <typename T>
template <typename T, typename Ret>
struct AggregateFunctionUniqCombinedTraits
{
static UInt32 hash(T x)
static Ret hash(T x)
{
return static_cast<UInt32>(intHash64(x));
return static_cast<Ret>(intHash64(x));
}
};

template <>
struct AggregateFunctionUniqCombinedTraits<UInt128>
template <typename Ret>
struct AggregateFunctionUniqCombinedTraits<UInt128, Ret>
{
static UInt32 hash(UInt128 x)
static Ret hash(UInt128 x)
{
return sipHash64(x);
}
};

template <>
struct AggregateFunctionUniqCombinedTraits<Float32>
template <typename Ret>
struct AggregateFunctionUniqCombinedTraits<Float32, Ret>
{
static UInt32 hash(Float32 x)
static Ret hash(Float32 x)
{
UInt64 res = ext::bit_cast<UInt64>(x);
return static_cast<UInt32>(intHash64(res));
return static_cast<Ret>(intHash64(res));
}
};

template <>
struct AggregateFunctionUniqCombinedTraits<Float64>
template <typename Ret>
struct AggregateFunctionUniqCombinedTraits<Float64, Ret>
{
static UInt32 hash(Float64 x)
static Ret hash(Float64 x)
{
UInt64 res = ext::bit_cast<UInt64>(x);
return static_cast<UInt32>(intHash64(res));
return static_cast<Ret>(intHash64(res));
}
};

Expand Down Expand Up @@ -98,29 +98,34 @@ struct AggregateFunctionUniqCombinedDataWithKey<Key, 17>
};


template <typename T, UInt8 K>
struct AggregateFunctionUniqCombinedData : public AggregateFunctionUniqCombinedDataWithKey<UInt32, K>
template <typename T, UInt8 K, typename HashValueType>
struct AggregateFunctionUniqCombinedData : public AggregateFunctionUniqCombinedDataWithKey<HashValueType, K>
{
};


template <UInt8 K>
struct AggregateFunctionUniqCombinedData<String, K> : public AggregateFunctionUniqCombinedDataWithKey<UInt64, K>
/// For String keys, 64 bit hash is always used (both for uniqCombined and uniqCombined64),
/// because of backwards compatibility (64 bit hash was already used for uniqCombined).
template <UInt8 K, typename HashValueType>
struct AggregateFunctionUniqCombinedData<String, K, HashValueType> : public AggregateFunctionUniqCombinedDataWithKey<UInt64 /*always*/, K>
{
};


template <typename T, UInt8 K>
template <typename T, UInt8 K, typename HashValueType>
class AggregateFunctionUniqCombined final
: public IAggregateFunctionDataHelper<AggregateFunctionUniqCombinedData<T, K>, AggregateFunctionUniqCombined<T, K>>
: public IAggregateFunctionDataHelper<AggregateFunctionUniqCombinedData<T, K, HashValueType>, AggregateFunctionUniqCombined<T, K, HashValueType>>
{
public:
AggregateFunctionUniqCombined(const DataTypes & argument_types_, const Array & params_)
: IAggregateFunctionDataHelper<AggregateFunctionUniqCombinedData<T, K>, AggregateFunctionUniqCombined<T, K>>(argument_types_, params_) {}
: IAggregateFunctionDataHelper<AggregateFunctionUniqCombinedData<T, K, HashValueType>, AggregateFunctionUniqCombined<T, K, HashValueType>>(argument_types_, params_) {}

String getName() const override
{
return "uniqCombined";
if constexpr (std::is_same_v<HashValueType, UInt64>)
return "uniqCombined64";
else
return "uniqCombined";
}

DataTypePtr getReturnType() const override
Expand All @@ -133,7 +138,7 @@ class AggregateFunctionUniqCombined final
if constexpr (!std::is_same_v<T, String>)
{
const auto & value = assert_cast<const ColumnVector<T> &>(*columns[0]).getElement(row_num);
this->data(place).set.insert(detail::AggregateFunctionUniqCombinedTraits<T>::hash(value));
this->data(place).set.insert(detail::AggregateFunctionUniqCombinedTraits<T, HashValueType>::hash(value));
}
else
{
Expand Down Expand Up @@ -172,17 +177,17 @@ class AggregateFunctionUniqCombined final
* You can pass multiple arguments as is; You can also pass one argument - a tuple.
* But (for the possibility of efficient implementation), you can not pass several arguments, among which there are tuples.
*/
template <bool is_exact, bool argument_is_tuple, UInt8 K>
class AggregateFunctionUniqCombinedVariadic final : public IAggregateFunctionDataHelper<AggregateFunctionUniqCombinedData<UInt64, K>,
AggregateFunctionUniqCombinedVariadic<is_exact, argument_is_tuple, K>>
template <bool is_exact, bool argument_is_tuple, UInt8 K, typename HashValueType>
class AggregateFunctionUniqCombinedVariadic final : public IAggregateFunctionDataHelper<AggregateFunctionUniqCombinedData<UInt64, K, HashValueType>,
AggregateFunctionUniqCombinedVariadic<is_exact, argument_is_tuple, K, HashValueType>>
{
private:
size_t num_args = 0;

public:
explicit AggregateFunctionUniqCombinedVariadic(const DataTypes & arguments, const Array & params)
: IAggregateFunctionDataHelper<AggregateFunctionUniqCombinedData<UInt64, K>,
AggregateFunctionUniqCombinedVariadic<is_exact, argument_is_tuple, K>>(arguments, params)
: IAggregateFunctionDataHelper<AggregateFunctionUniqCombinedData<UInt64, K, HashValueType>,
AggregateFunctionUniqCombinedVariadic<is_exact, argument_is_tuple, K, HashValueType>>(arguments, params)
{
if (argument_is_tuple)
num_args = typeid_cast<const DataTypeTuple &>(*arguments[0]).getElements().size();
Expand All @@ -192,7 +197,10 @@ class AggregateFunctionUniqCombinedVariadic final : public IAggregateFunctionDat

String getName() const override
{
return "uniqCombined";
if constexpr (std::is_same_v<HashValueType, UInt64>)
return "uniqCombined64";
else
return "uniqCombined";
}

DataTypePtr getReturnType() const override
Expand All @@ -202,7 +210,7 @@ class AggregateFunctionUniqCombinedVariadic final : public IAggregateFunctionDat

void add(AggregateDataPtr place, const IColumn ** columns, size_t row_num, Arena *) const override
{
this->data(place).set.insert(typename AggregateFunctionUniqCombinedData<UInt64, K>::Set::value_type(
this->data(place).set.insert(typename AggregateFunctionUniqCombinedData<UInt64, K, HashValueType>::Set::value_type(
UniqVariadicHash<is_exact, argument_is_tuple>::apply(num_args, columns, row_num)));
}

Expand Down
2 changes: 2 additions & 0 deletions dbms/tests/queries/0_stateless/01016_uniqCombined64.reference
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
10021957
10021969
9 changes: 9 additions & 0 deletions dbms/tests/queries/0_stateless/01016_uniqCombined64.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
-- for small cardinality the 64 bit hash perform worse, but for 1e10:
-- 4 byte hash: 2.8832809652e10
-- 8 byte hash: 0.9998568925e10
-- but hence checking with 1e10 values takes too much time (~45 secs), this
-- test is just to ensure that the result is different (and to document the
-- outcome).

SELECT uniqCombined(number) FROM numbers(toUInt64(1e7));
SELECT uniqCombined64(number) FROM numbers(toUInt64(1e7));
10 changes: 9 additions & 1 deletion docs/en/query_language/agg_functions/reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -546,6 +546,7 @@ We recommend using this function in almost all scenarios.
**See Also**

- [uniqCombined](#agg_function-uniqcombined)
- [uniqCombined64](#agg_function-uniqcombined64)
- [uniqHLL12](#agg_function-uniqhll12)
- [uniqExact](#agg_function-uniqexact)

Expand Down Expand Up @@ -573,13 +574,16 @@ The function takes a variable number of parameters. Parameters can be `Tuple`, `

Function:

- Calculates a hash for all parameters in the aggregate, then uses it in calculations.
- Calculates a hash (64-bit hash for `String` and 32-bit otherwise) for all parameters in the aggregate, then uses it in calculations.
- Uses a combination of three algorithms: array, hash table, and HyperLogLog with an error correction table.

For a small number of distinct elements, an array is used. When the set size is larger, a hash table is used. For a larger number of elements, HyperLogLog is used, which will occupy a fixed amount of memory.

- Provides the result deterministically (it doesn't depend on the query processing order).

!! note "Note"
Since it uses 32-bit hash for non-`String` type, the result will have very high error for cardinalities significantly larger than `UINT_MAX` (error will raise quickly after a few tens of billions of distinct values), hence in this case you should use [uniqCombined64](#agg_function-uniqcombined64)

Compared to the [uniq](#agg_function-uniq) function, the `uniqCombined`:

- Consumes several times less memory.
Expand All @@ -589,9 +593,13 @@ Compared to the [uniq](#agg_function-uniq) function, the `uniqCombined`:
**See Also**

- [uniq](#agg_function-uniq)
- [uniqCombined64](#agg_function-uniqcombined64)
- [uniqHLL12](#agg_function-uniqhll12)
- [uniqExact](#agg_function-uniqexact)

## uniqCombined64 {#agg_function-uniqcombined64}

Same as [uniqCombined](#agg_function-uniqcombined), but uses 64-bit hash for all data types.

## uniqHLL12 {#agg_function-uniqhll12}

Expand Down