Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] About supporting sequenceCount function and retention function #10473

Open
2 of 3 tasks
Tracked by #11706 ...
FreeOnePlus opened this issue Jun 28, 2022 · 7 comments
Open
2 of 3 tasks
Tracked by #11706 ...
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@FreeOnePlus
Copy link
Contributor

Search before asking

  • I had searched in the issues and found no similar issues.

Description

In ClickHouse, support for sequenceCount function and retention function is provided, but Doris currently does not have these two functions, and hopes to gain support

Use case

The following are the explanations and use cases of the two functions in ClickHouse

sequenceCount(pattern)(time, cond1, cond2, …)

Counts the number of event chains that matched the pattern. The function searches event chains that do not overlap. It starts to search for the next chain after the current chain is matched.

WARNING
Events that occur at the same second may lay in the sequence in an undefined order affecting the result.

sequenceCount(pattern)(timestamp, cond1, cond2, ...)

Arguments

timestamp — Column considered to contain time data. Typical data types are Date and DateTime. You can also use any of the supported UInt data types.

cond1, cond2 — Conditions that describe the chain of events. Data type: UInt8. You can pass up to 32 condition arguments. The function takes only the events described in these conditions into account. If the sequence contains data that isn’t described in a condition, the function skips them.

Parameters

pattern — Pattern string. See Pattern syntax.
Returned values

Number of non-overlapping event chains that are matched.
Type: UInt64.

Example

Consider data in the t table:

┌─time─┬─number─┐
│ 1 │ 1 │
│ 2 │ 3 │
│ 3 │ 2 │
│ 4 │ 1 │
│ 5 │ 3 │
│ 6 │ 2 │
└──────┴────────┘

Count how many times the number 2 occurs after the number 1 with any amount of other numbers between them:

SELECT sequenceCount('(?1).*(?2)')(time, number = 1, number = 2) FROM t

┌─sequenceCount('(?1).*(?2)')(time, equals(number, 1), equals(number, 2))─┐
│ 2 │
└─────────────────────────────────────────────────────────────────────────┘

retention function

The function takes as arguments a set of conditions from 1 to 32 arguments of type UInt8 that indicate whether a certain condition was met for the event. Any condition can be specified as an argument (as in WHERE).

The conditions, except the first, apply in pairs: the result of the second will be true if the first and second are true, of the third if the first and third are true, etc.

Syntax

retention(cond1, cond2, ..., cond32);

Arguments

cond — An expression that returns a UInt8 result (1 or 0).
Returned value

The array of 1 or 0.

1 — Condition was met for the event.
0 — Condition wasn’t met for the event.
Type: UInt8.

Example

Let’s consider an example of calculating the retention function to determine site traffic.

  1. Сreate a table to illustrate an example.

CREATE TABLE retention_test(date Date, uid Int32) ENGINE = Memory;

INSERT INTO retention_test SELECT '2020-01-01', number FROM numbers(5);
INSERT INTO retention_test SELECT '2020-01-02', number FROM numbers(10);
INSERT INTO retention_test SELECT '2020-01-03', number FROM numbers(15);

Input table:

Query:

SELECT * FROM retention_test

Result:

┌───────date─┬─uid─┐
│ 2020-01-01 │ 0 │
│ 2020-01-01 │ 1 │
│ 2020-01-01 │ 2 │
│ 2020-01-01 │ 3 │
│ 2020-01-01 │ 4 │
└────────────┴─────┘
┌───────date─┬─uid─┐
│ 2020-01-02 │ 0 │
│ 2020-01-02 │ 1 │
│ 2020-01-02 │ 2 │
│ 2020-01-02 │ 3 │
│ 2020-01-02 │ 4 │
│ 2020-01-02 │ 5 │
│ 2020-01-02 │ 6 │
│ 2020-01-02 │ 7 │
│ 2020-01-02 │ 8 │
│ 2020-01-02 │ 9 │
└────────────┴─────┘
┌───────date─┬─uid─┐
│ 2020-01-03 │ 0 │
│ 2020-01-03 │ 1 │
│ 2020-01-03 │ 2 │
│ 2020-01-03 │ 3 │
│ 2020-01-03 │ 4 │
│ 2020-01-03 │ 5 │
│ 2020-01-03 │ 6 │
│ 2020-01-03 │ 7 │
│ 2020-01-03 │ 8 │
│ 2020-01-03 │ 9 │
│ 2020-01-03 │ 10 │
│ 2020-01-03 │ 11 │
│ 2020-01-03 │ 12 │
│ 2020-01-03 │ 13 │
│ 2020-01-03 │ 14 │
└────────────┴─────┘

  1. Group users by unique ID uid using the retention function.

Query:

SELECT
uid,
retention(date = '2020-01-01', date = '2020-01-02', date = '2020-01-03') AS r
FROM retention_test
WHERE date IN ('2020-01-01', '2020-01-02', '2020-01-03')
GROUP BY uid
ORDER BY uid ASC

Result:

┌─uid─┬─r───────┐
│ 0 │ [1,1,1] │
│ 1 │ [1,1,1] │
│ 2 │ [1,1,1] │
│ 3 │ [1,1,1] │
│ 4 │ [1,1,1] │
│ 5 │ [0,0,0] │
│ 6 │ [0,0,0] │
│ 7 │ [0,0,0] │
│ 8 │ [0,0,0] │
│ 9 │ [0,0,0] │
│ 10 │ [0,0,0] │
│ 11 │ [0,0,0] │
│ 12 │ [0,0,0] │
│ 13 │ [0,0,0] │
│ 14 │ [0,0,0] │
└─────┴─────────┘

  1. Calculate the total number of site visits per day.

Query:

SELECT
sum(r[1]) AS r1,
sum(r[2]) AS r2,
sum(r[3]) AS r3
FROM
(
SELECT
uid,
retention(date = '2020-01-01', date = '2020-01-02', date = '2020-01-03') AS r
FROM retention_test
WHERE date IN ('2020-01-01', '2020-01-02', '2020-01-03')
GROUP BY uid
)

Result:

┌─r1─┬─r2─┬─r3─┐
│ 5 │ 5 │ 5 │
└────┴────┴────┘

Where:

r1- the number of unique visitors who visited the site during 2020-01-01 (the cond1 condition).
r2- the number of unique visitors who visited the site during a specific time period between 2020-01-01 and 2020-01-02 (cond1 and cond2 conditions).
r3- the number of unique visitors who visited the site during a specific time period between 2020-01-01 and 2020-01-03 (cond1 and cond3 conditions).

Related issues

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@FreeOnePlus FreeOnePlus added the kind/feature Categorizes issue or PR as related to a new feature. label Jun 28, 2022
@luzhijing luzhijing mentioned this issue Aug 11, 2022
36 tasks
@isHuangXin
Copy link
Contributor

isHuangXin commented Aug 30, 2022

[WeOpen-Star] I want to try it. please assign this issue to me~

@Yukang-Lian
Copy link
Collaborator

[WeOpen-Star] I want to try it. please assign this issue to me~

Hello handsome, my mentor asked me to try to implement the retention function, and I see thst you are trying to implement it. I think if you haven't implemented it yet, could you please give me this task (just Retention function)? thank you very much!

@isHuangXin
Copy link
Contributor

Plz unassign me. Because doris does not support building the project in the M1 ARM programming environment. At that time, my computer is an M1 chip.

@Yukang-Lian
Copy link
Collaborator

Thank you very much! In fact, I am also using m1 macbook. I usually develop on macbook and compile on Alibaba Cloud server. This is my suggestion and I hope it can help you. 😄

@isHuangXin
Copy link
Contributor

Thank you very much! In fact, I am also using m1 macbook. I usually develop on macbook and compile on Alibaba Cloud server. This is my suggestion and I hope it can help you. 😄

Get it. Thanks for your advice and I'll try that.

@dataroaring
Copy link
Contributor

retention is picked by a guy. #12800.

@Yukang-Lian
Copy link
Collaborator

retention function has been finished #13056
I will start to implement sequenceCount #13118 and sequenceMatch #13117 soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

4 participants