-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Describe the bug
Wrong pre-aggregations count distinct approx count when using PostgreSQL database
Please see the Main problem section for the exact bug/problem we are experiencing, everything else is to provide a context for better understanding the problem.
We are getting wrong pre-aggregation values when using the PostgreSQL cube, here is the entire PostgreSQL cube code:
cube(`ExampleCube`, {
sql: `
select
timestamp,
"memberId"::text as member_id
from
activities
`,
dimensions: {
member_id: {
sql: `${CUBE}."member_id"`,
type: `string`,
},
timestamp: {
sql: `${CUBE}.timestamp`,
type: `time`,
},
},
measures: {
count: {
sql: `${CUBE}."member_id"`,
type: `count_distinct_approx`,
},
},
preAggregations: {
main: {
measures: [ExampleCube.count],
dimensions: [],
timeDimension: ExampleCube.timestamp,
granularity: `day`,
},
},
joins: {},
segments: {},
});
We just count the number of distinct member_id in the entire time range.
If I comment out the main pre-aggregation and run the following query:
{
"measures": [
"ExampleCube.count"
],
"timeDimensions": [
{
"dimension": "ExampleCube.timestamp",
"dateRange": [
"2000-01-01",
"2024-01-01"
]
}
],
"filters": []
}
It generates the following SQL (PostgreSQL):
SELECT
round(hll_cardinality(hll_add_agg(hll_hash_any("example_cube"."member_id")))) "example_cube__count"
FROM
(
select
timestamp,
"memberId"::text as member_id
from
activities
) AS "example_cube" WHERE ("example_cube".timestamp >= $1::timestamptz AND "example_cube".timestamp <= $2::timestamptz)
Which gives the following count: 178751.
This value is correct, compared with non-additive count(distinct ...):
crowd=> select round(hll_cardinality(hll_add_agg(hll_hash_any("memberId")))), count(distinct "memberId") from activities;
round | count
--------+--------
178918 | 176937
(1 row)
The delta is 1.1% which is less than 2% (expected max. delta from HLL approximations).
Now when I uncomment the main pre-aggregation, it generates the following SQL query (this is run in cubestore mysql-like dialect):
SELECT cardinality(merge(`example_cube__count`)) `example_cube__count` FROM dev_pre_aggregations.example_cube_main AS `example_cube__main` WHERE (`example_cube__timestamp_day` >= to_timestamp(?) AND `example_cube__timestamp_day` <= to_timestamp(?))
And reports that it used the pre-aggregation:
./scripts/cube-query.sh: pre-aggregations used:
{
"dev_pre_aggregations.example_cube_main": {
"lastUpdatedAt": 1693202345000,
"refreshKeyValues": [],
"targetTableName": "dev_pre_aggregations.example_cube_main_ec4opi43_z5l3wp4l_1ieodt9"
}
}
The value returned is 192401:
[
{
"ExampleCube.count": "192401"
}
]
This gives delta 8.74% (192401 vs. 176937) which is way over promised 2% delta from HLL.
Running this query directly on cubestore's mysql-like instance:
MySQL [dev_pre_aggregations]> SELECT cardinality(merge(`example_cube__count`)) `example_cube__count` FROM dev_pre_aggregations.example_cube_main_ec4opi43_z5l3wp4l_1ieodt9 AS `example_cube__main` WHERE (`example_cube__timestamp_day` >= to_timestamp('2000-01-01 00:00:00') AND `example_cube__timestamp_day` <= to_timestamp('2024-01-01 00:00:00'));
+---------------------+
| example_cube__count |
+---------------------+
| 192401 |
+---------------------+
1 row in set (0.025 sec)
Now to show that the problem is in cube.js/cubestore I'm creating an exact same cube but using a SnowFlake database type.
This DB is an exact copy of the PostgreSQL one, column types are the same (member_id is text and timestamp is timestamp), table name is the same, and data is the same:
cube(`SnowExampleCube`, {
data_source: `snowflake`,
sql: `
select
timestamp,
memberId::text as member_id
from
crowd_prod_public.activities
`,
dimensions: {
member_id: {
sql: `${CUBE}.member_id`,
type: `string`,
},
timestamp: {
sql: `${CUBE}.timestamp`,
type: `time`,
},
},
measures: {
count: {
sql: `${CUBE}.member_id`,
type: `count_distinct_approx`,
},
},
preAggregations: {
main: {
measures: [SnowExampleCube.count],
dimensions: [],
timeDimension: SnowExampleCube.timestamp,
granularity: `day`,
},
},
joins: {},
segments: {},
});
The query I run is (a mirror of the previous PostgreSQL one - replacing ExampleCube with SnowExampleCube):
{
"measures": [
"SnowExampleCube.count"
],
"timeDimensions": [
{
"dimension": "SnowExampleCube.timestamp",
"dateRange": [
"2000-01-01",
"2024-01-01"
]
}
],
"filters": []
}
First I run this with pre-aggregations commented out, it generates the following snowflake SQL:
SELECT
APPROX_COUNT_DISTINCT("snow_example_cube".member_id) "snow_example_cube__count"
FROM
(
select
timestamp,
memberId::text as member_id
from
crowd_prod_public.activities
) AS "snow_example_cube" WHERE ("snow_example_cube".timestamp >= ?::timestamp_tz AND "snow_example_cube".timestamp <= ?::timestamp_tz)
The result is:
[
{
"SnowExampleCube.count": "174089"
}
]
The delta is: 1.7% which is OK (below 2%) when running directly on the snowflake DB:
cube_js_user#CUBE_JS_WH@FIVETRAN_INGEST.CROWD_PROD_PUBLIC>select APPROX_COUNT_DISTINCT(memberId), count(distinct memberId) from crowd_prod_public.activities;
+---------------------------------+--------------------------+
| APPROX_COUNT_DISTINCT(MEMBERID) | COUNT(DISTINCT MEMBERID) |
|---------------------------------+--------------------------|
| 174089 | 177125 |
+---------------------------------+--------------------------+
1 Row(s) produced. Time Elapsed: 1.502s
Now when I uncomment the main pre-aggregation it generates SQL in cubestore mysql-like dialect:
SELECT cardinality(merge(`snow_example_cube__count`)) `snow_example_cube__count` FROM dev_pre_aggregations.snow_example_cube_main AS `snow_example_cube__main` WHERE (`snow_example_cube__timestamp_day` >= to_timestamp(?) AND `snow_example_cube__timestamp_day` <= to_timestamp(?))
Which gives the result ``174077:
[
{
"SnowExampleCube.count": "174077"
}
]
That uses the main pre-aggregation:
./scripts/cube-query.sh: pre-aggregations used:
{
"dev_pre_aggregations.snow_example_cube_main": {
"lastUpdatedAt": 1693203291000,
"refreshKeyValues": [],
"targetTableName": "dev_pre_aggregations.snow_example_cube_main_4cgg35nv_5uvzwhjs_1ieoeqr"
}
}
The delta is: 1.75% which is OK and below the allowed 2%.
Main problem
Now the question is why the pre-aggregation generated from the PostgreSQL cube gives an 8.74% approximate count distinct difference while the SnowFlake one 1.74%?
Generated cubestore mysql-like SQLs are basically the same (they differ by column names & pre-aggregated table name - due to cube name differences but the syntax and logic seem to be the same).
See them one below another with spacing added in 1st to see that they are almost the same:
SELECT cardinality(merge(`example_cube__count`)) `example_cube__count` FROM dev_pre_aggregations.example_cube_main AS `example_cube__main` WHERE (`example_cube__timestamp_day` >= to_timestamp(?) AND `example_cube__timestamp_day` <= to_timestamp(?))
SELECT cardinality(merge(`snow_example_cube__count`)) `snow_example_cube__count` FROM dev_pre_aggregations.snow_example_cube_main AS `snow_example_cube__main` WHERE (`snow_example_cube__timestamp_day` >= to_timestamp(?) AND `snow_example_cube__timestamp_day` <= to_timestamp(?))
This problem blocks us from using pre-aggregations because counts are too different from real count distinct values (we cannot use count distinct because it is not additive and cannot be used in pre-aggregations, and without pre-aggregations queries are too slow)>
To sum up:
- When we calculate approx. count distinct in
PostgreSQLwithout pre-aggregations - value is OK (delta below 2%). - When we calculate approx. count distinct in
SnowFlakewithout pre-aggregations - value is OK (delta below 2%). - When we calculate approx. count distinct in
SnowFlakewith pre-aggregations - value is OK (delta below 2%). - When we calculate approx. count distinct in
PostgreSQLwith pre-aggregations - value is NOT OK (delta is > 8.7%).
The problem seems to be (my blind shot) with data storage in cube store for pre-aggregations generated by PostgreSQL queries, also providing a few rows from both pre-aggregations tables generated from PostgreSQL and SnowFlake:
PostgreSQL one:
MySQL [dev_pre_aggregations]> select * from dev_pre_aggregations.example_cube_main_ec4opi43_z5l3wp4l_1ieodt9 limit 3;
+-----------------------------+----------------------------+
| example_cube__timestamp_day | example_cube__count |
+-----------------------------+----------------------------+
| 2001-01-01T00:00:00.000Z | 0x020B02000000F0370000624A |
| 2002-01-08T00:00:00.000Z | 0x020B02000000E4380000F09D |
| 2002-01-10T00:00:00.000Z | 0x020B01000000C24B |
+-----------------------------+----------------------------+
3 rows in set (0.032 sec)
SnowFlake one:
MySQL [dev_pre_aggregations]> select * from dev_pre_aggregations.snow_example_cube_main_4cgg35nv_5uvzwhjs_1ieoeqr limit 3;
+----------------------------------+----------------------------+
| snow_example_cube__timestamp_day | snow_example_cube__count |
+----------------------------------+----------------------------+
| 2001-01-01T00:00:00.000Z | 0x020C02000000E2A5000084FF |
| 2002-01-08T00:00:00.000Z | 0x020C02000000581F0000A8F7 |
| 2002-01-10T00:00:00.000Z | 0x020C010000002280 |
+----------------------------------+----------------------------+
3 rows in set (0.038 sec)
All I can see is that tyhose from PostgreSQL start with 0x020B0 while those from SnowFlake start with 0x020C0...
Other suggestions
-
It was also suggested that this can happen due to partitioning: Please take a look at the pre-aggregation definition - it does not use any partitioning.
-
It was also suggested that this is due to granularity (it is set to
dayin the pre-aggregation) so HLL is first computed daily and then a final value is constructed from those daily values.
Yes it is saved daily and then the final value is calculated from daily values - but this is the perfect correct usage of HLL - the entire idea of HLL is that we can pre-calculate for whatever time range and then merge those precalculated values and still get correct results.
Here is the proof that we can do this and have zero delta in both PostgreSQL and Snowflake HLL versions:
PostgreSQLHLL difference when calculating for all-time at once and for calculating daily HLL results first and then merging them:
with hll as (
select
date_trunc('day', "timestamp")::date as day,
count(distinct "memberId"::text) as contributors,
hll_add_agg(hll_hash_text("memberId"::text)) as hll_data
from
activities
group by
date_trunc('day', "timestamp")::date
), hll_agg as (
select
round(hll_cardinality(hll_union_agg(hll_data))) as contributors
from
hll
), stats as (
select
h.contributors as approx_from_daily_aggregations,
round(hll_cardinality(hll_add_agg(hll_hash_any(a."memberId"::text)))) as approx_from_all_data,
count(distinct a."memberId"::text) as exact
from
activities a,
hll_agg h
group by
1
)
select
approx_from_daily_aggregations,
abs(((100.0 * approx_from_daily_aggregations::decimal) / exact::decimal) - 100.0)::text || '%' as approx_from_daily_aggregations_delta,
approx_from_all_data,
abs(((100.0 * approx_from_all_data::decimal) / exact::decimal) - 100.0)::text || '%' as approx_from_all_data_delta,
abs(((100.0 * approx_from_daily_aggregations::decimal) / approx_from_all_data::decimal) - 100.0)::text || '%' as delta_between_all_time_and_daily_aggs,
exact
from
stats
;
Results (note 0% delta):
approx_from_daily_aggregations | approx_from_daily_aggregations_delta | approx_from_all_data | approx_from_all_data_delta | delta_between_all_time_and_daily_aggs | exact
--------------------------------+--------------------------------------+----------------------+----------------------------+---------------------------------------+--------
178782 | 1.0473127638770354% | 178782 | 1.0473127638770354% | 0.0000000000000000% | 176929
(1 row)
SnowflakeHLL difference when calculating for all-time at once and for calculating daily HLL results first and then merging them:
with hll_daily as (
select
date_trunc('day', timestamp)::date as day,
count(distinct memberId::text) as contributors,
hll_accumulate(memberId::text) as hll_data
from
crowd_prod_public.activities
group by
date_trunc('day', timestamp)::date
), hll_agg as (
select
hll_estimate(hll_combine(hll_data)) as contributors
from
hll_daily
), stats as (
select
h.contributors as approx_from_daily_aggregations,
approx_count_distinct(a.memberId::text) as approx_from_all_data,
count(distinct a.memberId::text) as exact
from
activities a,
hll_agg h
group by
1
)
select
approx_from_daily_aggregations,
abs(((100.0 * approx_from_daily_aggregations::decimal) / exact::decimal) - 100.0)::text || '%' as approx_from_daily_aggregations_delta,
approx_from_all_data,
abs(((100.0 * approx_from_all_data::decimal) / exact::decimal) - 100.0)::text || '%' as approx_from_all_data_delta,
abs(((100.0 * approx_from_daily_aggregations::decimal) / approx_from_all_data::decimal) - 100.0)::text || '%' as delta_between_all_time_and_daily_aggs,
exact
from
stats
;
Results (note 0% delta):
+--------------------------------+--------------------------------------+----------------------+----------------------------+---------------------------------------+--------+
| APPROX_FROM_DAILY_AGGREGATIONS | APPROX_FROM_DAILY_AGGREGATIONS_DELTA | APPROX_FROM_ALL_DATA | APPROX_FROM_ALL_DATA_DELTA | DELTA_BETWEEN_ALL_TIME_AND_DAILY_AGGS | EXACT |
|--------------------------------+--------------------------------------+----------------------+----------------------------+---------------------------------------+--------|
| 174089 | 1.714044% | 174089 | 1.714044% | 0.000000% | 177125 |
+--------------------------------+--------------------------------------+----------------------+----------------------------+---------------------------------------+--------+
1 Row(s) produced. Time Elapsed: 215.618s
Final words
- Please fix counting
count distinct approxwhen using pre-aggregations withPostgreSQLdatabase.
This was first reported here.