Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement sequenceNextNode() function useful for flow analysis. #19766

Merged
merged 49 commits into from
Jun 3, 2021

Conversation

achimbab
Copy link
Contributor

@achimbab achimbab commented Jan 28, 2021

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Implement sequenceNextNode() function useful for flow analysis.

Detailed description / Documentation draft:

The sequenceNextNode(timestamp_column, event_column, event1, event2, ...) is an aggregate function that returns a value of next event that matched the events.

sequenceNextNode([descending_order])(timestamp, event_column, event1, event2, ... eventN)
return:
  event_column[next_index], if the pattern is matched and next value exists.
  null, if the pattern isn’t matched or next value doesn't exist.

For instance, it can be used when events are A->B->C->E->F and you want to know the event following B->C, which is E.

It is useful for flow analysis.

The query statement searching the event following B->C :

CREATE TABLE test_flow (dt DateTime, id int, action String) 
ENGINE = MergeTree() 
PARTITION BY toYYYYMMDD(dt) 
ORDER BY id;

INSERT INTO test_flow VALUES (1, 1, 'A') (2, 1, 'B') (3, 1, 'C') (4, 1, 'E') (5, 1, 'F');

SELECT id, sequenceNextNode(0)(dt, action, action = 'B', action = 'C') FROM test_flow GROUP BY id;

Thank you.

@robot-clickhouse robot-clickhouse added doc-alert pr-feature Pull request with new product feature labels Jan 28, 2021
@achimbab achimbab changed the title Implement sequenceNextNode as a helper function of flow analysis. Implement sequenceNextNode() function useful for flow analysis. Jan 28, 2021
@achimbab
Copy link
Contributor Author

achimbab commented Jan 31, 2021

I added documentation.

@achimbab
Copy link
Contributor Author

achimbab commented Feb 2, 2021

It seems that the failed tests are not related to the PR.

  1. test_distributed_ddl_parallel/test.py::test_all_in_parallel

Code: 36. DB::Exception: Received from 172.18.0.6:9000. DB::Exception: There was an error on [n1:9000]: Code: 36, e.displayText() = DB::Exception: external dictionary 'slow_dict' not found (version 21.2.1.5869). Stack trace:

  1. test_testkeeper_back_to_back/test.py::test_multitransactions

assert len(list(minio.list_objects(cluster.minio_bucket, 'data/'))) == FILES_OVERHEAD_PER_PART_WIDE + FILES_OVERHEAD
AssertionError: assert 54 == (15 + 1)

@achimbab
Copy link
Contributor Author

achimbab commented Feb 2, 2021

Probably, It is possible to reduce memory usage or network traffic by encoding data and timestamps when aggregated data are large.

@achimbab
Copy link
Contributor Author

achimbab commented Feb 4, 2021

I added sequenceFirstNode to get the starting points.
When events arguments are omitted in sequenceNextNode, it is replaced with sequenceFirstNode automatically.

@achimbab
Copy link
Contributor Author

achimbab commented Feb 9, 2021

It seems that the failed tests are irrelevant.

@alexey-milovidov could you help review this PR?

If any problem, please let me know.

@vdimir vdimir self-assigned this Feb 17, 2021
@vdimir
Copy link
Member

vdimir commented Feb 17, 2021

@achimbab
Sorry for long delay. Thank you for your PR!

First of all, I think that we need more details in description, about terminology and what do we call event and chain, how do we consider data in table (some special form of graph I suppose), maybe add some links to external resources about this concepts.

Also it's good to add general description to code in comments, more details about how generally algorithm works, what it stores in aggregate function data, does it store all Nodes in memory, why do we use arena allocator for them and so on.

@achimbab
Copy link
Contributor Author

achimbab commented Feb 17, 2021

@vdimir
It's okay.

I implemented sequenceNextNode for flow analysis

event

It refers to activities of users of websites.

chain

The chain consists of the events of users.

data in table

In the below tables schema, the action column refers to the event.

CREATE TABLE test_flow (dt DateTime, id int, action String) 
ENGINE = MergeTree() 
PARTITION BY toYYYYMMDD(dt) 
ORDER BY id;

For example,

dt id action
2021-01-01 00:00:00 1 visit_our_shoppingmall
2021-01-01 00:01:00 1 sign_in
2021-01-01 00:02:00 1 browse_item
2021-01-01 00:03:00 1 add_to_cart
2021-01-01 00:04:00 1 buy_all_items

Also it's good to add general description to code in comments, more details about how generally algorithm works,

I will write comments as soon as possible.

what it stores in aggregate function data

The aggregate function stores an array of (value of event_column, timestamp, bitset of matched eventN)

If the pattern is visit_our_shoppingmall -> sign_in -> browse_item. The matched events and the next value are depicted at the below table.

dt id action is_row_matched?
2021-01-01 00:00:00 1 visit_our_shoppingmall matched event1
2021-01-01 00:01:00 1 sign_in matched event2
2021-01-01 00:02:00 1 browse_item matched event3, the chain of events are matched next value is the result value
2021-01-01 00:03:00 1 add_to_cart it is the result value
2021-01-01 00:04:00 1 buy_all_items -

does it store all Nodes in memory, why do we use arena allocator for them and so on.

It uses arena allocator for efficiency as GroupArray does.

@achimbab
Copy link
Contributor Author

@vdimir

I added comments to the source code.
Feel free to comment if there are anything I have to do in order to merge this PR. Have a nice day.

@achimbab
Copy link
Contributor Author

achimbab commented Mar 31, 2021

@vdimir
It seems that failed testcases are irrelavent with this PR.
May I get your code-review again?

@vdimir
Copy link
Member

vdimir commented Mar 31, 2021

@achimbab

Yes, I'm going to look at code.

Could you describe in general what changes you made since last version?

@achimbab
Copy link
Contributor Author

achimbab commented Apr 1, 2021

@vdimir

I modified the core algorithm because the core algorithm had the exceptional case.

The changes from the last version:

  • Deleted an algorithm like Boyer-Moore. The current version of the algorithm uses just simple for-loop in order to check whether the chain of events are matched.
  • Modified parameters
  • Added base condition argument

Modified parameters

  • option 1: direction - Used to navigate to directions

    • forward : Moving forward
    • backward: Moving backward
  • option 2: base - Used to set the base point.

    • head : Set the base point to the first event
    • tail : Set the base point to the last event
    • first_match : Set the base point to the first matched event1
    • last_match : Set the base point to the last matched event1

Added base condition argument.

  • base_condition — Condition that the base point must fulfill.

Examples

Behavior for forward and head

ALTER TABLE test_flow DELETE WHERE 1 = 1 settings mutations_sync = 1;
INSERT INTO test_flow VALUES (1, 1, 'Home') (2, 1, 'Gift') (3, 1, 'Exit');
INSERT INTO test_flow VALUES (1, 2, 'Home') (2, 2, 'Home') (3, 2, 'Gift') (4, 2, 'Basket');
INSERT INTO test_flow VALUES (1, 3, 'Gift') (2, 3, 'Home') (3, 3, 'Gift') (4, 3, 'Basket');
SELECT id, sequenceNextNode('forward', 'head')(dt, page = 'Home', page, page = 'Home', page = 'Gift') FROM test_flow GROUP BY id;
 
                  dt   id   page
 1970-01-01 09:00:01    1   Home // Base point, Matched with Home
 1970-01-01 09:00:02    1   Gift // Matched with Gift
 1970-01-01 09:00:03    1   Exit // The result 
 1970-01-01 09:00:01    2   Home // Base point, Matched with Home
 1970-01-01 09:00:02    2   Home // Unmatched with Gift
 1970-01-01 09:00:03    2   Gift
 1970-01-01 09:00:04    2   Basket    
 
 1970-01-01 09:00:01    3   Gift // Base point, Unmatched with Home
 1970-01-01 09:00:02    3   Home      
 1970-01-01 09:00:03    3   Gift      
 1970-01-01 09:00:04    3   Basket    

Behavior for backward and tail

SELECT id, sequenceNextNode('backward', 'tail')(dt, page = 'Basket', page, page = 'Basket', page = 'Gift') FROM test_flow GROUP BY id;
                 dt   id   page
1970-01-01 09:00:01    1   Home
1970-01-01 09:00:02    1   Gift
1970-01-01 09:00:03    1   Exit // Base point, Unmatched with Basket
                                     
1970-01-01 09:00:01    2   Home 
1970-01-01 09:00:02    2   Home // The result 
1970-01-01 09:00:03    2   Gift // Matched with Gift
1970-01-01 09:00:04    2   Basket // Base point, Matched with Basket
                                     
1970-01-01 09:00:01    3   Gift
1970-01-01 09:00:02    3   Home // The result 
1970-01-01 09:00:03    3   Gift // Base point, Matched with Gift
1970-01-01 09:00:04    3   Basket // Base point, Matched with Basket

Behavior for forward and first_match

SELECT id, sequenceNextNode('forward', 'first_match')(dt, page = 'Gift', page, page = 'Gift') FROM test_flow GROUP BY id;
                 dt   id   page
1970-01-01 09:00:01    1   Home
1970-01-01 09:00:02    1   Gift // Base point
1970-01-01 09:00:03    1   Exit // The result
                                     
1970-01-01 09:00:01    2   Home 
1970-01-01 09:00:02    2   Home 
1970-01-01 09:00:03    2   Gift // Base point
1970-01-01 09:00:04    2   Basket  The result
                                     
1970-01-01 09:00:01    3   Gift // Base point
1970-01-01 09:00:02    3   Home // Thre result
1970-01-01 09:00:03    3   Gift   
1970-01-01 09:00:04    3   Basket    
SELECT id, sequenceNextNode('forward', 'first_match')(dt, page = 'Gift', page, page = 'Gift', page = 'Home') FROM test_flow GROUP BY id;
                 dt   id   page
1970-01-01 09:00:01    1   Home
1970-01-01 09:00:02    1   Gift // Base point
1970-01-01 09:00:03    1   Exit // Unmatched with Home
                                     
1970-01-01 09:00:01    2   Home 
1970-01-01 09:00:02    2   Home 
1970-01-01 09:00:03    2   Gift // Base point
1970-01-01 09:00:04    2   Basket // Unmatched with Home
                                     
1970-01-01 09:00:01    3   Gift // Base point
1970-01-01 09:00:02    3   Home // Matched with Home
1970-01-01 09:00:03    3   Gift // The result
1970-01-01 09:00:04    3   Basket    

Behavior for backward and last_match

SELECT id, sequenceNextNode('backward', 'last_match')(dt, page = 'Gift', page, page = 'Gift') FROM test_flow GROUP BY id;
                 dt   id   page
1970-01-01 09:00:01    1   Home // The result
1970-01-01 09:00:02    1   Gift // Base point
1970-01-01 09:00:03    1   Exit 
                                     
1970-01-01 09:00:01    2   Home 
1970-01-01 09:00:02    2   Home // The result
1970-01-01 09:00:03    2   Gift // Base point
1970-01-01 09:00:04    2   Basket    
                                     
1970-01-01 09:00:01    3   Gift 
1970-01-01 09:00:02    3   Home // The result
1970-01-01 09:00:03    3   Gift // Base point  
1970-01-01 09:00:04    3   Basket    
SELECT id, sequenceNextNode('backward', 'last_match')(dt, page = 'Gift', page, page = 'Gift', page = 'Home') FROM test_flow GROUP BY id;
                 dt   id   page
1970-01-01 09:00:01    1   Home // Matched with Home, the result is null
1970-01-01 09:00:02    1   Gift // Base point
1970-01-01 09:00:03    1   Exit 
                                     
1970-01-01 09:00:01    2   Home // The result
1970-01-01 09:00:02    2   Home // Matched with Home
1970-01-01 09:00:03    2   Gift // Base point
1970-01-01 09:00:04    2   Basket    
                                     
1970-01-01 09:00:01    3   Gift // The result
1970-01-01 09:00:02    3   Home // Matched with Home
1970-01-01 09:00:03    3   Gift // Base point  
1970-01-01 09:00:04    3   Basket    

Behavior for base_condition

CREATE TABLE test_flow_basecond
(
    `dt` DateTime,
    `id` int,
    `page` String,
    `ref` String
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(dt)
ORDER BY id
INSERT INTO test_flow_basecond VALUES (1, 1, 'A', 'ref4') (2, 1, 'A', 'ref3') (3, 1, 'B', 'ref2') (4, 1, 'B', 'ref1');
SELECT id, sequenceNextNode('forward', 'head')(dt, ref = 'ref1', page, page = 'A') FROM test_flow_basecond GROUP BY id;
                  dt   id   page   ref 
 1970-01-01 09:00:01    1   A      ref4 // The head can't be base point becasue the ref column of the head unmatched with 'ref1'.
 1970-01-01 09:00:02    1   A      ref3 
 1970-01-01 09:00:03    1   B      ref2 
 1970-01-01 09:00:04    1   B      ref1 
SELECT id, sequenceNextNode('backward', 'tail')(dt, ref = 'ref4', page, page = 'B') FROM test_flow_basecond GROUP BY id;
                  dt   id   page   ref 
 1970-01-01 09:00:01    1   A      ref4
 1970-01-01 09:00:02    1   A      ref3 
 1970-01-01 09:00:03    1   B      ref2 
 1970-01-01 09:00:04    1   B      ref1 // The tail can't be base point becasue the ref column of the tail unmatched with 'ref4'.
SELECT id, sequenceNextNode('forward', 'first_match')(dt, ref = 'ref3', page, page = 'A') FROM test_flow_basecond GROUP BY id;
                  dt   id   page   ref 
 1970-01-01 09:00:01    1   A      ref4 // This row can't be base point becasue the ref column unmatched with 'ref3'.
 1970-01-01 09:00:02    1   A      ref3 // Base point
 1970-01-01 09:00:03    1   B      ref2 // The result
 1970-01-01 09:00:04    1   B      ref1 
SELECT id, sequenceNextNode('backward', 'last_match')(dt, ref = 'ref2', page, page = 'B') FROM test_flow_basecond GROUP BY id;
                  dt   id   page   ref 
 1970-01-01 09:00:01    1   A      ref4
 1970-01-01 09:00:02    1   A      ref3 // The result
 1970-01-01 09:00:03    1   B      ref2 // Base point
 1970-01-01 09:00:04    1   B      ref1 // This row can't be base point becasue the ref column unmatched with 'ref2'. 

Copy link
Member

@vdimir vdimir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions

@@ -276,6 +276,28 @@ class AggregateFunctionNullVariadic final
this->nested_function->add(this->nestedPlace(place), nested_columns, row_num, arena);
}

void insertResultInto(AggregateDataPtr place, IColumn & to, Arena * arena) const override
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build succeed and test passed without this function? Do we need it? If really need there should be test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, I will remove the function.

@achimbab
Copy link
Contributor Author

achimbab commented Apr 2, 2021

@vdimir
I fixed some codes.

  • Remove AggregateFunctionNullVariadic::insertResultInto
  • Prevent forward with tail and backward with head
  • Minor fix for exception messages

Thank you.

@vdimir
Copy link
Member

vdimir commented Apr 6, 2021

@achimbab

If we consider exception case from #19766 (comment)

CREATE TABLE foo (timestamp DateTime, id Int32, page String) ENGINE=TinyLog;

INSERT INTO foo VALUES (1, 1, 'home'), (2, 1, 'home'), (3, 1, 'giftbox'), (4, 1, 'wish'), (1, 2, 'home'), (2, 2, 'giftbox'), (3, 2, 'wish');
SELECT
    id,
    sequenceNextNode('forward', 'first_match')(timestamp, page = 'home', page, page = 'home', page = 'giftbox') AS next
FROM foo
GROUP BY id

┌─id─┬─next─┐
│  2 │ wish │
│  1 │ ᴺᵁᴸᴸ │
└────┴──────┘

We got with and NULL. But why not wish for id = 1, like

1	1	home
2	1	home    // base (first_match)
3	1	giftbox // second event
4	1	wish    // result for id = 1
1	2	home    // base (first_match)
2	2	giftbox // second event
3	2	wish    // result for id = 2


But, base_condition is required with head and tail because there is no way to specify that condition.
i.e.
When a end-user wants navigate from the first event with referrer_sns = 'facebook' the end-user can use sequenceNextNode() with base_condition of referrer_sns = 'facebook'.

Is it true that in most cases base_cond and first event will be same? Also order of arguments looks confusing, because we have base_cond between timestamp_column and event_column. Maybe (timestamp, event_column, base_cond, first_cond, second_cond, ...) will be better?

Actually, maybe we can consider base as part of chain? Could you provide explicit example why we distinct it?
What if we consider first event as base and will iterate from 1 here

for (i = 0; i < events_size && base + i < data.value.size(); ++i)

@achimbab
Copy link
Contributor Author

achimbab commented Apr 6, 2021

@vdimir

We got with and NULL. But why not wish for id = 1, like

The exceptional case in flow analysis is that:

The number of people decreases from home -> giftbox but, the number of people increase at wish.

home (2 people) -> home (1 person)
                -> giftbox (1 person) -> wish (2 people)

// Oops, there are 2 people in the `wish`, but there is only 1 person in the previous flow `giftbox`.

So I fixed the algorithm to return NULL for id = 1 in this case.

1	1	home   // base (first_match) because page = 'home'
2	1	home    // not matched, return NULL
3	1	giftbox 
4	1	wish    

@vdimir
Copy link
Member

vdimir commented Apr 6, 2021

The number of people decreases from home -> giftbox but, the number of people increase at wish.

I considered semantic of this function close to windowFunnel with strict_order, but with additional features like forward/backward and head/tail and returned next event, but not number of matches.

@achimbab
Copy link
Contributor Author

achimbab commented Apr 6, 2021

@vdimir

Actually, maybe we can consider base as part of chain? Could you provide explicit example why we distinct it?

I will explain why base_cond is required for our cases by an example.

Prepare Data

CREATE TABLE bar (timestamp DateTime, id Int32, page String, ref_url String) ENGINE=TinyLog;
INSERT INTO bar VALUES (1, 1, 'home', ''), (2, 1, 'home', ''), (3, 1, 'giftbox', 'black_friday'), (4, 1, 'wish', ''), (1, 2, 'home', 'black_friday'), (2, 2, 'giftbox', ''), (3, 2, 'wish', ''), (1, 3, 'default', 'black_friday'), (2, 3, 'promotion', '');

Get first pages with ref_url = 'black_friday'

I want to get the pages where the beginning of rows for each id.

SELECT
    id,
    sequenceNextNode('forward', 'head')(timestamp, ref_url = 'black_friday', page)
FROM bar
GROUP BY id
FORMAT PrettySpace

 id   sequenceNextNode('forward', 'head')(timestamp, equals(ref_url, 'black_friday'), page)

  3   default                                                                               
  2   home                                                                                  
  1   ᴺᵁᴸᴸ                                                                                  

It is not achieved with first_match, because first_match don't return the pages where the beginning of rows for each id.

SELECT
    id,
    sequenceNextNode('forward', 'first_match')(timestamp, 1, page, ref_url = 'black_friday')
FROM bar
GROUP BY id
FORMAT PrettySpace

 id   sequenceNextNode('forward', 'first_match')(timestamp, 1, page, equals(ref_url, 'black_friday'))

  3   promotion    // this value is not the same with the above result.
  2   giftbox          // this value is not the same with the above result.
  1   wish             // this value is not the same with the above result.

@achimbab
Copy link
Contributor Author

achimbab commented Apr 6, 2021

If sequenceNextNode('forward', 'first_match')(timestamp, 1, page, ref_url = 'black_friday') returns the current value instead of the next value then sequenceNextNode() requires another argument indicating the position of return value.

@vdimir
Copy link
Member

vdimir commented Apr 7, 2021

@achimbab

Second part of #19766 (comment)

Is it true that in most cases base_cond and first event will be same? Also order of arguments looks confusing, because we have base_cond between timestamp_column and event_column. Maybe (timestamp, event_column, base_cond, first_cond, second_cond, ...) will be better?

What do you think?


Also function still looks a bit special for me, so I'm inclined to merge it available under flag (like SET allow_experimental_funnel_functions = 1), so we will able to change signature or remove this function later (maybe once it will be possible to do this with window functions or like that)

@achimbab
Copy link
Contributor Author

achimbab commented Apr 7, 2021

@vdimir

Also order of arguments looks confusing, because we have base_cond between timestamp_column and event_column. Maybe (timestamp, event_column, base_cond, first_cond, second_cond, ...) will be better?

That's better, I will fix the order of arguments.

Is it true that in most cases base_cond and first event will be same?

In most cases end-users will not use base_cond. A few people want to use base_cond.

Also function still looks a bit special for me, so I'm inclined to merge it available under flag (like SET allow_experimental_funnel_functions = 1), so we will able to change signature or remove this function later (maybe once it will be possible to do this with window functions or like that)

I agree with you.

@achimbab
Copy link
Contributor Author

achimbab commented Apr 7, 2021

I fixed source codes d13d69e

  • Add allow_experimental_funnel_functions
  • Modify the order of arguments

Copy link
Member

@vdimir vdimir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll merge it after #22762 to pass settings explicitly into function

{
if (CurrentThread::isInitialized())
{
const Context * query_context = CurrentThread::get().getQueryContext();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to merge PR #22762 first to pass setting explicitly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you a lot.

@achimbab
Copy link
Contributor Author

achimbab commented Jun 2, 2021

@vdimir
I made createAggregateFunctionSequenceNode use explicit settings #22762.
Thank you.

@vdimir vdimir merged commit 17f0900 into ClickHouse:master Jun 3, 2021
@gyuton
Copy link
Contributor

gyuton commented Jun 3, 2021

Internal documentation ticket: DOCSUP-10025.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-feature Pull request with new product feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants