-
Notifications
You must be signed in to change notification settings - Fork 6.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement sequenceNextNode()
function useful for flow analysis
.
#19766
Conversation
flow analysis
.sequenceNextNode()
function useful for flow analysis
.
d97f77f
to
8db0caf
Compare
I added documentation. |
It seems that the failed tests are not related to the PR.
|
Probably, It is possible to reduce memory usage or network traffic by encoding data and timestamps when aggregated data are large. |
I added |
It seems that the failed tests are irrelevant. @alexey-milovidov could you help review this PR? If any problem, please let me know. |
@achimbab First of all, I think that we need more details in description, about terminology and what do we call Also it's good to add general description to code in comments, more details about how generally algorithm works, what it stores in aggregate function data, does it store all Nodes in memory, why do we use arena allocator for them and so on. |
@vdimir I implemented
It refers to activities of users of websites.
The
In the below tables schema, the action column refers to the event. CREATE TABLE test_flow (dt DateTime, id int, action String)
ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(dt)
ORDER BY id; For example,
I will write comments as soon as possible.
The aggregate function stores an array of If the pattern is
It uses arena allocator for efficiency as GroupArray does. |
I added comments to the source code. |
@vdimir |
Yes, I'm going to look at code. Could you describe in general what changes you made since last version? |
I modified the core algorithm because the core algorithm had the exceptional case. The changes from the last version:
Modified parameters
Added
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions
@@ -276,6 +276,28 @@ class AggregateFunctionNullVariadic final | |||
this->nested_function->add(this->nestedPlace(place), nested_columns, row_num, arena); | |||
} | |||
|
|||
void insertResultInto(AggregateDataPtr place, IColumn & to, Arena * arena) const override |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Build succeed and test passed without this function? Do we need it? If really need there should be test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, I will remove the function.
docs/en/sql-reference/aggregate-functions/parametric-functions.md
Outdated
Show resolved
Hide resolved
@vdimir
Thank you. |
If we consider exception case from #19766 (comment)
We got with and NULL. But why not wish for id = 1, like
Is it true that in most cases base_cond and first event will be same? Also order of arguments looks confusing, because we have base_cond between timestamp_column and event_column. Maybe Actually, maybe we can consider base as part of chain? Could you provide explicit example why we distinct it?
|
The exceptional case in flow analysis is that: The number of people decreases from
So I fixed the algorithm to return NULL for
|
I considered semantic of this function close to |
I will explain why Prepare Data CREATE TABLE bar (timestamp DateTime, id Int32, page String, ref_url String) ENGINE=TinyLog;
INSERT INTO bar VALUES (1, 1, 'home', ''), (2, 1, 'home', ''), (3, 1, 'giftbox', 'black_friday'), (4, 1, 'wish', ''), (1, 2, 'home', 'black_friday'), (2, 2, 'giftbox', ''), (3, 2, 'wish', ''), (1, 3, 'default', 'black_friday'), (2, 3, 'promotion', ''); Get first pages with ref_url = 'black_friday' I want to get the pages where the beginning of rows for each id. SELECT
id,
sequenceNextNode('forward', 'head')(timestamp, ref_url = 'black_friday', page)
FROM bar
GROUP BY id
FORMAT PrettySpace
id sequenceNextNode('forward', 'head')(timestamp, equals(ref_url, 'black_friday'), page)
3 default
2 home
1 ᴺᵁᴸᴸ
It is not achieved with SELECT
id,
sequenceNextNode('forward', 'first_match')(timestamp, 1, page, ref_url = 'black_friday')
FROM bar
GROUP BY id
FORMAT PrettySpace
id sequenceNextNode('forward', 'first_match')(timestamp, 1, page, equals(ref_url, 'black_friday'))
3 promotion // this value is not the same with the above result.
2 giftbox // this value is not the same with the above result.
1 wish // this value is not the same with the above result.
|
If |
Second part of #19766 (comment)
What do you think? Also function still looks a bit special for me, so I'm inclined to merge it available under flag (like |
That's better, I will fix the order of arguments.
In most cases end-users will not use
I agree with you. |
I fixed source codes d13d69e
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll merge it after #22762 to pass settings explicitly into function
{ | ||
if (CurrentThread::isInitialized()) | ||
{ | ||
const Context * query_context = CurrentThread::get().getQueryContext(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to merge PR #22762 first to pass setting explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you a lot.
docs/en/sql-reference/aggregate-functions/parametric-functions.md
Outdated
Show resolved
Hide resolved
docs/en/sql-reference/aggregate-functions/parametric-functions.md
Outdated
Show resolved
Hide resolved
docs/en/sql-reference/aggregate-functions/parametric-functions.md
Outdated
Show resolved
Hide resolved
docs/en/sql-reference/aggregate-functions/parametric-functions.md
Outdated
Show resolved
Hide resolved
docs/en/sql-reference/aggregate-functions/parametric-functions.md
Outdated
Show resolved
Hide resolved
docs/en/sql-reference/aggregate-functions/parametric-functions.md
Outdated
Show resolved
Hide resolved
docs/en/sql-reference/aggregate-functions/parametric-functions.md
Outdated
Show resolved
Hide resolved
docs/en/sql-reference/aggregate-functions/parametric-functions.md
Outdated
Show resolved
Hide resolved
Internal documentation ticket: DOCSUP-10025. |
I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Implement
sequenceNextNode()
function useful forflow analysis
.Detailed description / Documentation draft:
The
sequenceNextNode(timestamp_column, event_column, event1, event2, ...)
is an aggregate function that returns a value of next event that matched the events.For instance, it can be used when events are
A->B->C->E->F
and you want to know the event followingB->C
, which isE
.It is useful for
flow analysis
.The query statement searching the event following
B->C
:Thank you.