Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: full support of group by nullable column #4079

Closed
Tracked by #3428
sundy-li opened this issue Feb 9, 2022 · 8 comments · Fixed by #4340
Closed
Tracked by #3428

Feature: full support of group by nullable column #4079

sundy-li opened this issue Feb 9, 2022 · 8 comments · Fixed by #4340
Assignees
Labels
A-query Area: databend query C-feature Category: feature community-take

Comments

@sundy-li
Copy link
Member

sundy-li commented Feb 9, 2022

Summary

Description for this feature.

Currently

we ignore the nulls to generate the hash key of the group by columns.

  pub fn fixed_hash(column: &ColumnRef, ptr: *mut u8, step: usize) -> Result<()> {
        let column = column.convert_full_column();
        // TODO support nullable
        let column = Series::remove_nullable(&column);
        ...
    }

So the results are not correct in this query:

MySQL [(none)]> create table t(a UInt64, b UInt32) Engine = Fuse;
Query OK, 0 rows affected (0.003 sec)

MySQL [(none)]>
MySQL [(none)]> insert into t(a,b)  select if (number % 3 = 1, null, number) as a, number + 3 as b from numbers(10);
Query OK, 0 rows affected (0.006 sec)

MySQL [(none)]> select a%3 as c, count(1) from t group by c;
+------+----------+
| c    | count(1) |
+------+----------+
|    0 |        7 |
|    2 |        3 |
+------+----------+
2 rows in set (0.005 sec)

How:

  1. a % 3 as c is Nullable(UInt8). If c is UInt8, we can hash column c into UInt8 Column using fixed_hash. But we should use extra bits to identify the null value, so UInt16 is used: [ 8bit(value) -- 8bits(just use 1 bit to identify null) ]

  2. If there are multiple nullable columns, say: N, N * 8 bits will be used.

  3. The final bits must be rounded into primitive types: [ Nullabe(UInt8), UInt8 ] --> 24bits ---> round into 32bits, UInt32Column will be used to store the hashkeys.

This task can be taken after #4074 is merged.

Related:

  • DataBlock::choose_hash_method
  • Series::fixed_hash
  • Series::serialize
@sundy-li sundy-li added C-feature Category: feature A-query Area: databend query labels Feb 9, 2022
@wangzhen11aaa
Copy link
Contributor

wangzhen11aaa commented Feb 9, 2022

/assignme

@wangzhen11aaa
Copy link
Contributor

wangzhen11aaa commented Feb 10, 2022

Compared with Clickhouse with the same sql.
Clickhouse's result is more reasonable.

image

If there are multiple nullable columns, say: N, N * 8 bits will be used.

About this statement, I have a question. If we have N = 40 maybe, then 320 bits will be used, what type will we use?

@sundy-li
Copy link
Member Author

If we have N = 40 maybe, then 320 bits will be used, what type will we use?

If it overflows u64, defaults to use String(binary) type.

@wangzhen11aaa
Copy link
Contributor

If we have N = 40 maybe, then 320 bits will be used, what type will we use?

If it overflows u64, defaults to use String(binary) type.

ok

@jiangzhe
Copy link

If we have N = 40 maybe, then 320 bits will be used, what type will we use?

If it overflows u64, defaults to use String(binary) type.

Why not combine the null bits for multiple columns to save space?
Use prefix or suffix byte to store all null bits in case of multiple columns.
Therefore group by no more than 8 columns, only requires one extra byte.

@sundy-li
Copy link
Member Author

Therefore group by no more than 8 columns, only requires one extra byte.

Yes, I also thought about this. But we should introduce native UInt24, UInt48 columns firstly.

@wangzhen11aaa
Copy link
Contributor

wangzhen11aaa commented Feb 12, 2022

Therefore group by no more than 8 columns, only requires one extra byte.

Yes, I also thought about this. But we should introduce native UInt24, UInt48 columns firstly.

This kind of type may be not friendly with the cache line.That's may be in too low-level layer to consider this question.

@wangzhen11aaa
Copy link
Contributor

wangzhen11aaa commented Feb 14, 2022

Summary

Description for this feature.

Currently

we ignore the nulls to generate the hash key of the group by columns.

  pub fn fixed_hash(column: &ColumnRef, ptr: *mut u8, step: usize) -> Result<()> {
        let column = column.convert_full_column();
        // TODO support nullable
        let column = Series::remove_nullable(&column);
        ...
    }

So the results are not correct in this query:

MySQL [(none)]> create table t(a UInt64, b UInt32) Engine = Fuse;
Query OK, 0 rows affected (0.003 sec)

MySQL [(none)]>
MySQL [(none)]> insert into t(a,b)  select if (number % 3 = 1, null, number) as a, number + 3 as b from numbers(10);
Query OK, 0 rows affected (0.006 sec)

MySQL [(none)]> select a%3 as c, count(1) from t group by c;
+------+----------+
| c    | count(1) |
+------+----------+
|    0 |        7 |
|    2 |        3 |
+------+----------+
2 rows in set (0.005 sec)

How:

  1. a % 3 as c is Nullable(UInt8). If c is UInt8, we can hash column c into UInt8 Column using fixed_hash. But we should use extra bits to identify the null value, so UInt16 is used: [ 8bit(value) -- 8bits(just use 1 bit to identify null) ]
  2. If there are multiple nullable columns, say: N, N * 8 bits will be used.
  3. The final bits must be rounded into primitive types: [ Nullabe(UInt8), UInt8 ] --> 24bits ---> round into 32bits, UInt32Column will be used to store the hashkeys.

This task can be taken after #4074 is merged.

This method will be very similar with clickhouse's way. It uses the array<UInt8, getBitmapSize()> to represent the nullable.
We need some Type promotion method to compute the next-higher level Type. right?
@sundy-li

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-query Area: databend query C-feature Category: feature community-take
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants