Skip to content

Add APPROX_COUNT_DISTINCT Function#15338

Merged
JackieTien97 merged 18 commits intoapache:masterfrom
FearfulTomcat27:approx_count_distinct
Apr 25, 2025
Merged

Add APPROX_COUNT_DISTINCT Function#15338
JackieTien97 merged 18 commits intoapache:masterfrom
FearfulTomcat27:approx_count_distinct

Conversation

@FearfulTomcat27
Copy link
Copy Markdown
Contributor

This pull request introduces the ApproxCountDistinctAccumulator and its associated classes to support the APPROX_COUNT_DISTINCT aggregation function in the IoTDB query engine. The changes include the addition of new classes and modifications to existing factory methods to incorporate the new accumulator.

Key Changes:

New Classes:

  • ApproxCountDistinctAccumulator: Implements the logic for the APPROX_COUNT_DISTINCT aggregation function, including methods for adding input data, merging intermediate results, and evaluating the final result.
  • HyperLogLog: A probabilistic data structure used by ApproxCountDistinctAccumulator to estimate the cardinality of a data set.
  • HyperLogLogState: An interface for managing the state of HyperLogLog instances.
  • HyperLogLogStateFactory: Provides factory methods to create and manage HyperLogLog states, including single and grouped states.

Factory Method Modifications:

  • AccumulatorFactory.java: Updated to include cases for APPROX_COUNT_DISTINCT in the createBuiltinGroupedAccumulator and createBuiltinAccumulator methods. [1] [2]

"Aggregate functions [%s] should only have two arguments", functionName));
}

if (argumentTypes.size() == 2 && !DOUBLE.equals(argumentTypes.get(1))) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for second parameter, just need to be number? you can use isSupportedMathNumericType function in this class

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second parameter should only be a Literal

@@ -0,0 +1,7 @@
package org.apache.iotdb.db.queryengine.execution.operator.source.relational.aggregation;

public interface HyperLogLogState {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's this used for?

private static final long INSTANCE_SIZE =
RamUsageEstimator.shallowSizeOfInstance(ApproxCountDistinctAccumulator.class);
private final TSDataType seriesDataType;
private final HyperLogLogStateFactory.SingleHyperLogLogState state =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need to use SingleHyperLogLogState to wrap the HyperLogLog, you can refer to TableModeAccumulator


@Override
public void addInput(Column[] arguments, AggregationMask mask) {
HyperLogLog hll;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HyperLogLog hll should be inited in constructor instead of getOrCreate each time addInput is called

private final int m;
// Number of bits used for register indexing
private final int b;
private final double maxStandardError;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private final double maxStandardError;

It seems that we don't need this field

}
}

// 序列化
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// 序列化
// serialize

public void add(Binary value) {
offer(
hashFunction
.hashString(value.getStringValue(TSFileConfig.STRING_CHARSET), StandardCharsets.UTF_8)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.hashString(value.getStringValue(TSFileConfig.STRING_CHARSET), StandardCharsets.UTF_8)
.hashBytes(value.getValues())


@Override
public long getEstimatedSize() {
return INSTANCE_SIZE;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should also add the memory size of HyperLogLog

RamUsageEstimator.shallowSizeOfInstance(GroupedApproxCountDistinctAccumulator.class);
private final TSDataType seriesDataType;

private final HyperLogLogStateFactory.GroupedHyperLogLogState state =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should create a class named HyperLogLogBigArray, you can refer to BinaryBigArray

Comment on lines +55 to +64
ObjectBigArray<HyperLogLog> hlls;
if (arguments.length == 1) {
hlls = HyperLogLogStateFactory.getOrCreateHyperLogLog(state);
} else if (arguments.length == 2) {
double maxStandardError = arguments[1].getDouble(0);
hlls = HyperLogLogStateFactory.getOrCreateHyperLogLog(state, maxStandardError);
} else {
throw new IllegalArgumentException(
"argument of APPROX_COUNT_DISTINCT should be one column with Max Standard Error");
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

init it in constrcutor

@JackieTien97 JackieTien97 merged commit 923bb2c into apache:master Apr 25, 2025
56 of 57 checks passed
@FearfulTomcat27 FearfulTomcat27 deleted the approx_count_distinct branch October 9, 2025 14:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants