Wrong pre-aggregations count distinct approx count when using PostgreSQL database

**Describe the bug**
Wrong pre-aggregations count distinct approx count when using PostgreSQL database

Please see the `Main problem` section for the exact bug/problem we are experiencing, everything else is to provide a context for better understanding the problem.

We are getting wrong pre-aggregation values when using the `PostgreSQL` cube, here is the entire `PostgreSQL` cube code:
```
cube(`ExampleCube`, {
  sql: `
  select
    timestamp,
    "memberId"::text as member_id
  from
    activities
  `,
  dimensions: {
    member_id: {
      sql: `${CUBE}."member_id"`,
      type: `string`,
    },
    timestamp: {
      sql: `${CUBE}.timestamp`,
      type: `time`,
    },
  },
  measures: {
    count: {
      sql: `${CUBE}."member_id"`,
      type: `count_distinct_approx`,
    },
  },
  preAggregations: {
    main: {
      measures: [ExampleCube.count],
      dimensions: [],
      timeDimension: ExampleCube.timestamp,
      granularity: `day`,
    },
  },
  joins: {},
  segments: {},
});
```

We just count the number of distinct `member_id` in the entire time range.

If I comment out the `main` pre-aggregation and run the following query:
```
{
  "measures": [
    "ExampleCube.count"
  ],
  "timeDimensions": [
    {
      "dimension": "ExampleCube.timestamp",
      "dateRange": [
        "2000-01-01",
        "2024-01-01"
      ]
    }
  ],
  "filters": []
}
```

It generates the following SQL (`PostgreSQL`):
```
SELECT
      round(hll_cardinality(hll_add_agg(hll_hash_any("example_cube"."member_id")))) "example_cube__count"
    FROM
      (
  select
    timestamp,
    "memberId"::text as member_id
  from
    activities
  ) AS "example_cube"  WHERE ("example_cube".timestamp >= $1::timestamptz AND "example_cube".timestamp <= $2::timestamptz)
```

Which gives the following count: `178751`.
This value is correct, compared with non-additive `count(distinct ...)`:
```
crowd=> select round(hll_cardinality(hll_add_agg(hll_hash_any("memberId")))), count(distinct "memberId") from activities;
 round  | count
--------+--------
 178918 | 176937
(1 row)
```

The delta is `1.1%` which is less than 2% (expected max. delta from HLL approximations).

Now when I uncomment the `main` pre-aggregation, it generates the following SQL query (this is run in `cubestore` mysql-like dialect):
```
SELECT cardinality(merge(`example_cube__count`)) `example_cube__count` FROM dev_pre_aggregations.example_cube_main AS `example_cube__main`  WHERE (`example_cube__timestamp_day` >= to_timestamp(?) AND `example_cube__timestamp_day` <= to_timestamp(?))
```

And reports that it used the pre-aggregation:
```
./scripts/cube-query.sh: pre-aggregations used:
{
  "dev_pre_aggregations.example_cube_main": {
    "lastUpdatedAt": 1693202345000,
    "refreshKeyValues": [],
    "targetTableName": "dev_pre_aggregations.example_cube_main_ec4opi43_z5l3wp4l_1ieodt9"
  }
}
```

The value returned is `192401`:
```
[
  {
    "ExampleCube.count": "192401"
  }
]
```

This gives delta `8.74%` (`192401 vs. 176937`) which is way over promised 2% delta from `HLL`.
Running this query directly on cubestore's mysql-like instance:
```
MySQL [dev_pre_aggregations]> SELECT cardinality(merge(`example_cube__count`)) `example_cube__count` FROM dev_pre_aggregations.example_cube_main_ec4opi43_z5l3wp4l_1ieodt9 AS `example_cube__main`  WHERE (`example_cube__timestamp_day` >= to_timestamp('2000-01-01 00:00:00') AND `example_cube__timestamp_day` <= to_timestamp('2024-01-01 00:00:00'));
+---------------------+
| example_cube__count |
+---------------------+
|              192401 |
+---------------------+
1 row in set (0.025 sec)
```


Now to show that the problem is in cube.js/cubestore I'm creating an `exact same cube` but using a `SnowFlake` database type.
This DB is an exact copy of the `PostgreSQL` one, column types are the same (`member_id` is `text` and `timestamp` is timestamp), table name is the same, and data is the same:
```
cube(`SnowExampleCube`, {
  data_source: `snowflake`,
  sql: `
  select
    timestamp,
    memberId::text as member_id
  from
    crowd_prod_public.activities
  `,
  dimensions: {
    member_id: {
      sql: `${CUBE}.member_id`,
      type: `string`,
    },
    timestamp: {
      sql: `${CUBE}.timestamp`,
      type: `time`,
    },
  },
  measures: {
    count: {
      sql: `${CUBE}.member_id`,
      type: `count_distinct_approx`,
    },
  },
  preAggregations: {
    main: {
      measures: [SnowExampleCube.count],
      dimensions: [],
      timeDimension: SnowExampleCube.timestamp,
      granularity: `day`,
    },
  },
  joins: {},
  segments: {},
});
```

The query I run is (a mirror of the previous `PostgreSQL` one - replacing `ExampleCube` with `SnowExampleCube`):
```
{
  "measures": [
    "SnowExampleCube.count"
  ],
  "timeDimensions": [
    {
      "dimension": "SnowExampleCube.timestamp",
      "dateRange": [
        "2000-01-01",
        "2024-01-01"
      ]
    }
  ],
  "filters": []
}
```

First I run this with pre-aggregations commented out, it generates the following `snowflake` SQL:
```
SELECT
      APPROX_COUNT_DISTINCT("snow_example_cube".member_id) "snow_example_cube__count"
    FROM
      (
  select
    timestamp,
    memberId::text as member_id
  from
    crowd_prod_public.activities
  ) AS "snow_example_cube"  WHERE ("snow_example_cube".timestamp >= ?::timestamp_tz AND "snow_example_cube".timestamp <= ?::timestamp_tz)
```

The result is:
```
[
  {
    "SnowExampleCube.count": "174089"
  }
]
```

The delta is: `1.7%` which is OK (below 2%) when running directly on the `snowflake` DB:
```
cube_js_user#CUBE_JS_WH@FIVETRAN_INGEST.CROWD_PROD_PUBLIC>select APPROX_COUNT_DISTINCT(memberId), count(distinct memberId) from crowd_prod_public.activities;
+---------------------------------+--------------------------+
| APPROX_COUNT_DISTINCT(MEMBERID) | COUNT(DISTINCT MEMBERID) |
|---------------------------------+--------------------------|
|                          174089 |                   177125 |
+---------------------------------+--------------------------+
1 Row(s) produced. Time Elapsed: 1.502s
```

Now when I uncomment the `main` pre-aggregation it generates SQL in cubestore mysql-like dialect:
```
SELECT cardinality(merge(`snow_example_cube__count`)) `snow_example_cube__count` FROM dev_pre_aggregations.snow_example_cube_main AS `snow_example_cube__main`  WHERE (`snow_example_cube__timestamp_day` >= to_timestamp(?) AND `snow_example_cube__timestamp_day` <= to_timestamp(?))
```
Which gives the result ``174077:
```
[
  {
    "SnowExampleCube.count": "174077"
  }
]
```

That uses the `main` pre-aggregation:
```
./scripts/cube-query.sh: pre-aggregations used:
{
  "dev_pre_aggregations.snow_example_cube_main": {
    "lastUpdatedAt": 1693203291000,
    "refreshKeyValues": [],
    "targetTableName": "dev_pre_aggregations.snow_example_cube_main_4cgg35nv_5uvzwhjs_1ieoeqr"
  }
}
```
The delta is: `1.75%` which is OK and below the allowed 2%.



## Main problem


Now the question is why the pre-aggregation generated from the `PostgreSQL` cube gives an `8.74%` approximate count distinct difference while the `SnowFlake` one `1.74%`?

Generated `cubestore` `mysql-like` SQLs are basically the same (they differ by column names & pre-aggregated table name - due to cube name differences but the syntax and logic seem to be the same).

See them one below another with spacing added in 1st to see that they are almost the same:
```
SELECT cardinality(merge(`example_cube__count`))      `example_cube__count`      FROM dev_pre_aggregations.example_cube_main      AS `example_cube__main`       WHERE (`example_cube__timestamp_day` >= to_timestamp(?)      AND `example_cube__timestamp_day` <= to_timestamp(?))
SELECT cardinality(merge(`snow_example_cube__count`)) `snow_example_cube__count` FROM dev_pre_aggregations.snow_example_cube_main AS `snow_example_cube__main`  WHERE (`snow_example_cube__timestamp_day` >= to_timestamp(?) AND `snow_example_cube__timestamp_day` <= to_timestamp(?))
```

This problem blocks us from using pre-aggregations because counts are too different from real count distinct values (we cannot use `count distinct` because it is not additive and cannot be used in pre-aggregations, and without pre-aggregations queries are too slow)>

To sum up:
- When we calculate approx. count distinct in `PostgreSQL` without pre-aggregations - value is OK (delta below 2%).
- When we calculate approx. count distinct in `SnowFlake`  without pre-aggregations - value is OK (delta below 2%).
- When we calculate approx. count distinct in `SnowFlake`  with pre-aggregations - value is OK (delta below 2%).
- When we calculate approx. count distinct in `PostgreSQL` with pre-aggregations - value is *NOT* OK (delta is > 8.7%).

The problem seems to be (my blind shot) with data storage in cube store for pre-aggregations generated by `PostgreSQL` queries, also providing a few rows from both pre-aggregations tables generated from `PostgreSQL` and `SnowFlake`:

`PostgreSQL` one:
```
MySQL [dev_pre_aggregations]> select * from dev_pre_aggregations.example_cube_main_ec4opi43_z5l3wp4l_1ieodt9 limit 3;
+-----------------------------+----------------------------+
| example_cube__timestamp_day | example_cube__count        |
+-----------------------------+----------------------------+
| 2001-01-01T00:00:00.000Z    | 0x020B02000000F0370000624A |
| 2002-01-08T00:00:00.000Z    | 0x020B02000000E4380000F09D |
| 2002-01-10T00:00:00.000Z    | 0x020B01000000C24B         |
+-----------------------------+----------------------------+
3 rows in set (0.032 sec)
```

`SnowFlake` one:
```
MySQL [dev_pre_aggregations]> select * from dev_pre_aggregations.snow_example_cube_main_4cgg35nv_5uvzwhjs_1ieoeqr limit 3;
+----------------------------------+----------------------------+
| snow_example_cube__timestamp_day | snow_example_cube__count   |
+----------------------------------+----------------------------+
| 2001-01-01T00:00:00.000Z         | 0x020C02000000E2A5000084FF |
| 2002-01-08T00:00:00.000Z         | 0x020C02000000581F0000A8F7 |
| 2002-01-10T00:00:00.000Z         | 0x020C010000002280         |
+----------------------------------+----------------------------+
3 rows in set (0.038 sec)
```

All I can see is that tyhose from `PostgreSQL` start with `0x020B0` while those from `SnowFlake` start with `0x020C0`...

## Other suggestions

1) It was also suggested that this can happen due to partitioning: Please take a look at the pre-aggregation definition - it does not use any partitioning.

2) It was also suggested that this is due to granularity (it is set to `day` in the pre-aggregation) so HLL is first computed daily and then a final value is constructed from those daily values.

Yes it is saved daily and then the final value is calculated from daily values - *but this is the perfect correct usage of HLL* - the entire idea of HLL is that we can pre-calculate for whatever time range and then merge those precalculated values and still get correct results.

Here is the proof that we can do this and have *zero delta* in both `PostgreSQL` and `Snowflake` HLL versions:

- `PostgreSQL` HLL difference when calculating for all-time at once and for calculating daily HLL results first and then merging them:
```
with hll as (
  select
    date_trunc('day', "timestamp")::date as day,
    count(distinct "memberId"::text) as contributors,
    hll_add_agg(hll_hash_text("memberId"::text)) as hll_data
  from
    activities
  group by
    date_trunc('day', "timestamp")::date
), hll_agg as (
  select
    round(hll_cardinality(hll_union_agg(hll_data))) as contributors
  from
    hll
), stats as (
  select
    h.contributors as approx_from_daily_aggregations,
    round(hll_cardinality(hll_add_agg(hll_hash_any(a."memberId"::text)))) as approx_from_all_data,
    count(distinct a."memberId"::text) as exact
  from
    activities a,
    hll_agg h
  group by
    1
) 
select
  approx_from_daily_aggregations,
  abs(((100.0 * approx_from_daily_aggregations::decimal) / exact::decimal) - 100.0)::text || '%' as approx_from_daily_aggregations_delta,
  approx_from_all_data,
  abs(((100.0 * approx_from_all_data::decimal) / exact::decimal) - 100.0)::text || '%' as approx_from_all_data_delta,
  abs(((100.0 * approx_from_daily_aggregations::decimal) / approx_from_all_data::decimal) - 100.0)::text || '%' as delta_between_all_time_and_daily_aggs,
  exact
from
  stats
;
```

Results (note *0%* delta):
```
 approx_from_daily_aggregations | approx_from_daily_aggregations_delta | approx_from_all_data | approx_from_all_data_delta | delta_between_all_time_and_daily_aggs | exact
--------------------------------+--------------------------------------+----------------------+----------------------------+---------------------------------------+--------
                         178782 | 1.0473127638770354%                  |               178782 | 1.0473127638770354%        | 0.0000000000000000%                   | 176929
(1 row)
```


- `Snowflake` HLL difference when calculating for all-time at once and for calculating daily HLL results first and then merging them:
```
with hll_daily as (
  select
    date_trunc('day', timestamp)::date as day,
    count(distinct memberId::text) as contributors,
    hll_accumulate(memberId::text) as hll_data
  from
    crowd_prod_public.activities
  group by
    date_trunc('day', timestamp)::date
), hll_agg as (
  select
    hll_estimate(hll_combine(hll_data)) as contributors
  from
    hll_daily
), stats as (
  select
    h.contributors as approx_from_daily_aggregations,
    approx_count_distinct(a.memberId::text) as approx_from_all_data,
    count(distinct a.memberId::text) as exact
  from
    activities a,
    hll_agg h
  group by
    1
)
select
  approx_from_daily_aggregations,
  abs(((100.0 * approx_from_daily_aggregations::decimal) / exact::decimal) - 100.0)::text || '%' as approx_from_daily_aggregations_delta,
  approx_from_all_data,
  abs(((100.0 * approx_from_all_data::decimal) / exact::decimal) - 100.0)::text || '%' as approx_from_all_data_delta,
  abs(((100.0 * approx_from_daily_aggregations::decimal) / approx_from_all_data::decimal) - 100.0)::text || '%' as delta_between_all_time_and_daily_aggs,
  exact
from
  stats
;
```

Results (note *0%* delta):
```
+--------------------------------+--------------------------------------+----------------------+----------------------------+---------------------------------------+--------+
| APPROX_FROM_DAILY_AGGREGATIONS | APPROX_FROM_DAILY_AGGREGATIONS_DELTA | APPROX_FROM_ALL_DATA | APPROX_FROM_ALL_DATA_DELTA | DELTA_BETWEEN_ALL_TIME_AND_DAILY_AGGS |  EXACT |
|--------------------------------+--------------------------------------+----------------------+----------------------------+---------------------------------------+--------|
|                         174089 | 1.714044%                            |               174089 | 1.714044%                  | 0.000000%                             | 177125 |
+--------------------------------+--------------------------------------+----------------------+----------------------------+---------------------------------------+--------+
1 Row(s) produced. Time Elapsed: 215.618s
```




### Final words

- Please fix counting `count distinct approx` when using pre-aggregations with `PostgreSQL` database.


This was first reported [here](https://github.com/CrowdDotDev/linux-foundation-support/issues/129).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Wrong pre-aggregations count distinct approx count when using PostgreSQL database #7069

Main problem

Other suggestions

Final words

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Wrong pre-aggregations count distinct approx count when using PostgreSQL database #7069

Description

Main problem

Other suggestions

Final words

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions