Add APPROX_COUNT_DISTINCT Function#15338
Conversation
…t_distinct function.
| "Aggregate functions [%s] should only have two arguments", functionName)); | ||
| } | ||
|
|
||
| if (argumentTypes.size() == 2 && !DOUBLE.equals(argumentTypes.get(1))) { |
There was a problem hiding this comment.
for second parameter, just need to be number? you can use isSupportedMathNumericType function in this class
There was a problem hiding this comment.
Second parameter should only be a Literal
| @@ -0,0 +1,7 @@ | |||
| package org.apache.iotdb.db.queryengine.execution.operator.source.relational.aggregation; | |||
|
|
|||
| public interface HyperLogLogState { | |||
There was a problem hiding this comment.
what's this used for?
| private static final long INSTANCE_SIZE = | ||
| RamUsageEstimator.shallowSizeOfInstance(ApproxCountDistinctAccumulator.class); | ||
| private final TSDataType seriesDataType; | ||
| private final HyperLogLogStateFactory.SingleHyperLogLogState state = |
There was a problem hiding this comment.
Don't need to use SingleHyperLogLogState to wrap the HyperLogLog, you can refer to TableModeAccumulator
|
|
||
| @Override | ||
| public void addInput(Column[] arguments, AggregationMask mask) { | ||
| HyperLogLog hll; |
There was a problem hiding this comment.
HyperLogLog hll should be inited in constructor instead of getOrCreate each time addInput is called
| private final int m; | ||
| // Number of bits used for register indexing | ||
| private final int b; | ||
| private final double maxStandardError; |
There was a problem hiding this comment.
| private final double maxStandardError; |
It seems that we don't need this field
| } | ||
| } | ||
|
|
||
| // 序列化 |
There was a problem hiding this comment.
| // 序列化 | |
| // serialize |
| public void add(Binary value) { | ||
| offer( | ||
| hashFunction | ||
| .hashString(value.getStringValue(TSFileConfig.STRING_CHARSET), StandardCharsets.UTF_8) |
There was a problem hiding this comment.
| .hashString(value.getStringValue(TSFileConfig.STRING_CHARSET), StandardCharsets.UTF_8) | |
| .hashBytes(value.getValues()) |
|
|
||
| @Override | ||
| public long getEstimatedSize() { | ||
| return INSTANCE_SIZE; |
There was a problem hiding this comment.
you should also add the memory size of HyperLogLog
| RamUsageEstimator.shallowSizeOfInstance(GroupedApproxCountDistinctAccumulator.class); | ||
| private final TSDataType seriesDataType; | ||
|
|
||
| private final HyperLogLogStateFactory.GroupedHyperLogLogState state = |
There was a problem hiding this comment.
You should create a class named HyperLogLogBigArray, you can refer to BinaryBigArray
| ObjectBigArray<HyperLogLog> hlls; | ||
| if (arguments.length == 1) { | ||
| hlls = HyperLogLogStateFactory.getOrCreateHyperLogLog(state); | ||
| } else if (arguments.length == 2) { | ||
| double maxStandardError = arguments[1].getDouble(0); | ||
| hlls = HyperLogLogStateFactory.getOrCreateHyperLogLog(state, maxStandardError); | ||
| } else { | ||
| throw new IllegalArgumentException( | ||
| "argument of APPROX_COUNT_DISTINCT should be one column with Max Standard Error"); | ||
| } |
There was a problem hiding this comment.
init it in constrcutor
…ata partitions are involved
…igArray and MapBigArray
…improved max standard error handling
…ApproxCountDistinctAccumulator and HyperLogLog
…ifying reset methods in ApproxCountDistinctAccumulator and GroupedApproxCountDistinctAccumulator
…perLogLog when creating a new instance using get.
…y province and time
This pull request introduces the
ApproxCountDistinctAccumulatorand its associated classes to support theAPPROX_COUNT_DISTINCTaggregation function in the IoTDB query engine. The changes include the addition of new classes and modifications to existing factory methods to incorporate the new accumulator.Key Changes:
New Classes:
ApproxCountDistinctAccumulator: Implements the logic for theAPPROX_COUNT_DISTINCTaggregation function, including methods for adding input data, merging intermediate results, and evaluating the final result.HyperLogLog: A probabilistic data structure used byApproxCountDistinctAccumulatorto estimate the cardinality of a data set.HyperLogLogState: An interface for managing the state ofHyperLogLoginstances.HyperLogLogStateFactory: Provides factory methods to create and manageHyperLogLogstates, including single and grouped states.Factory Method Modifications:
AccumulatorFactory.java: Updated to include cases forAPPROX_COUNT_DISTINCTin thecreateBuiltinGroupedAccumulatorandcreateBuiltinAccumulatormethods. [1] [2]