Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] support function hll_from_base64 #31320

Closed
2 of 3 tasks
learner1212 opened this issue Feb 23, 2024 · 10 comments
Closed
2 of 3 tasks

[Feature] support function hll_from_base64 #31320

learner1212 opened this issue Feb 23, 2024 · 10 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@learner1212
Copy link

Search before asking

  • I had searched in the issues and found no similar issues.

Description

doris have the bitmap function bitmap_from_base64, can hll function support function hll_from_base64 like bitmap?

Use case

No response

Related issues

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@learner1212 learner1212 added the kind/feature Categorizes issue or PR as related to a new feature. label Feb 23, 2024
@superdiaodiao
Copy link
Contributor

Could you give us more info? Like, how this function behaves in another OLAP engine or mysql or others.

@learner1212
Copy link
Author

Could you give us more info? Like, how this function behaves in another OLAP engine or mysql or others.

我们现在的场景是: 在导入doris之前,为了减少数据量,我们使用spark或者flink按照不同维度进行了聚合, 而我们的指标又需要用到近似计算去计算uv值, 目前这个场景我们只能直接将明细数据导入doris, 这个对 doris的压力就非常大。
因此我们希望Doris能支持这个函数, 有了这个函数, 我们就可以在spark或者flink编写udaf hll_to_base64, 在聚合的时候计算出base64值, 然后在导入Doris的时候使用 hll_from_base64完成导入。
非常感谢关注这个问题。

@cambyzju
Copy link
Contributor

Could you give us more info? Like, how this function behaves in another OLAP engine or mysql or others.

我们现在的场景是: 在导入doris之前,为了减少数据量,我们使用spark或者flink按照不同维度进行了聚合, 而我们的指标又需要用到近似计算去计算uv值, 目前这个场景我们只能直接将明细数据导入doris, 这个对 doris的压力就非常大。 因此我们希望Doris能支持这个函数, 有了这个函数, 我们就可以在spark或者flink编写udaf hll_to_base64, 在聚合的时候计算出base64值, 然后在导入Doris的时候使用 hll_from_base64完成导入。 非常感谢关注这个问题。

doris也是支持udf的,有试过吗?

@learner1212
Copy link
Author

Could you give us more info? Like, how this function behaves in another OLAP engine or mysql or others.

我们现在的场景是: 在导入doris之前,为了减少数据量,我们使用spark或者flink按照不同维度进行了聚合, 而我们的指标又需要用到近似计算去计算uv值, 目前这个场景我们只能直接将明细数据导入doris, 这个对 doris的压力就非常大。 因此我们希望Doris能支持这个函数, 有了这个函数, 我们就可以在spark或者flink编写udaf hll_to_base64, 在聚合的时候计算出base64值, 然后在导入Doris的时候使用 hll_from_base64完成导入。 非常感谢关注这个问题。

doris也是支持udf的,有试过吗?

我看官网文档写的支持类型里没有二进制类型, 应该是不支持吧?

@learner1212
Copy link
Author

Could you give us more info? Like, how this function behaves in another OLAP engine or mysql or others.

I can't found this function in any other OLAP engine. I need convert a base64 string(from spark/flink agg result) into a hll binary, and we found doris support function bitmap_from_base64 that can convert a base64 string to bitmap binary, it looks a bit similar to what we are looking for.

@DO12345whatever
Copy link

I have a similar issue where we want to import data from Apache Druid, which stores HLL- and Thetasketch colums as b64 encoded strings.
Additionally it is not even possible to export/import Doris tables with hll colums, as those are not supported by orc/parquet and csv simply drops their content https://doris.apache.org/docs/dev/data-operate/export/outfile/#notice

@superdiaodiao
Copy link
Contributor

I can't found this function in any other OLAP engine. I need convert a base64 string(from spark/flink agg result) into a hll binary, and we found doris support function bitmap_from_base64 that can convert a base64 string to bitmap binary, it looks a bit similar to what we are looking for.

OK, I will do it, please assign this to me.

@learner1212
Copy link
Author

I can't found this function in any other OLAP engine. I need convert a base64 string(from spark/flink agg result) into a hll binary, and we found doris support function bitmap_from_base64 that can convert a base64 string to bitmap binary, it looks a bit similar to what we are looking for.

OK, I will do it, please assign this to me.

@morningman Can you assign this issue to him?

morningman pushed a commit that referenced this issue Apr 16, 2024
…e64 (#32089)

Issue Number: #31320 

Support two hll functions:

- hll_from_base64
Convert a base64 string(result of function hll_to_base64) into a hll.
- hll_to_base64
Convert an input hll to a base64 string.
yiguolei pushed a commit that referenced this issue Apr 16, 2024
…e64 (#32089)

Issue Number: #31320 

Support two hll functions:

- hll_from_base64
Convert a base64 string(result of function hll_to_base64) into a hll.
- hll_to_base64
Convert an input hll to a base64 string.
yiguolei pushed a commit that referenced this issue Apr 17, 2024
…e64 (#32089)

Issue Number: #31320 

Support two hll functions:

- hll_from_base64
Convert a base64 string(result of function hll_to_base64) into a hll.
- hll_to_base64
Convert an input hll to a base64 string.
@superdiaodiao
Copy link
Contributor

Could you give us more info? Like, how this function behaves in another OLAP engine or mysql or others.

我们现在的场景是: 在导入doris之前,为了减少数据量,我们使用spark或者flink按照不同维度进行了聚合, 而我们的指标又需要用到近似计算去计算uv值, 目前这个场景我们只能直接将明细数据导入doris, 这个对 doris的压力就非常大。 因此我们希望Doris能支持这个函数, 有了这个函数, 我们就可以在spark或者flink编写udaf hll_to_base64, 在聚合的时候计算出base64值, 然后在导入Doris的时候使用 hll_from_base64完成导入。 非常感谢关注这个问题。

Moreover, you can use our hll UDFs including to_hll in hive/spark to load data into doris after this PR: #33896

@learner1212
Copy link
Author

Could you give us more info? Like, how this function behaves in another OLAP engine or mysql or others.

我们现在的场景是: 在导入doris之前,为了减少数据量,我们使用spark或者flink按照不同维度进行了聚合, 而我们的指标又需要用到近似计算去计算uv值, 目前这个场景我们只能直接将明细数据导入doris, 这个对 doris的压力就非常大。 因此我们希望Doris能支持这个函数, 有了这个函数, 我们就可以在spark或者flink编写udaf hll_to_base64, 在聚合的时候计算出base64值, 然后在导入Doris的时候使用 hll_from_base64完成导入。 非常感谢关注这个问题。

Moreover, you can use our hll UDFs including to_hll in hive/spark to load data into doris after this PR: #33896

when I use [hll_from_base64 in streamload, I got an error: reason: [ANALYSIS_ERROR]TStatus: errCode = 2, detailMessage = HLL column must use hll_hash function, like online_cnt=hll_hash(xxx) or online_cnt=hll_empty()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

5 participants