New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

feat(buffer_worker): refactor buffer/resource workers to always use queue and use offload mode #9642

Merged

thalesmg merged 8 commits into emqx:master from thalesmg:refactor-buffer-collect-calls-v50

Jan 5, 2023

Contributor

thalesmg commented Dec 29, 2022 •

edited

https://emqx.atlassian.net/browse/EMQX-8623

Currently, we face several issues trying to keep resource metrics
reasonable. For example, when a resource is re-created and has its
metrics reset, but then its durable queue resumes its previous work
and leads to strange (often negative) metrics.

Instead using counters that are shared by more than one worker to
manage gauges, we introduce an ETS table whose key is not only scoped
by the Resource ID as before, but also by the worker ID. This way,
when a worker starts/terminates, they should set their own gauges to
their values (often 0 or replayq:count when resuming off a queue).
With this scoping and initialization procedure, we'll hopefully avoid
hitting those strange metrics scenarios and have better control over
the gauges.

This makes the buffer/resource workers always use replayq for
queuing, along with collecting multiple requests in a single call.
This is done to avoid long message queues for the buffer workers and
rely on replayq's capabilities of offloading to disk and detecting
overflow.

Also, this deprecates the enable_batch and enable_queue resource
creation options, as: i) queuing is now always enables; ii) batch_size > 1
<=> batch_enabled. The corresponding metric
dropped.queue_not_enabled is dropped, along with batching. The
batching is too ephemeral, especially considering a default batch time
of 20 ms, and is not shown in the dashboard, so it was removed.

Also, fixes a bug related to message loss in kafka producer when connection is down. Currently, the kafka producer bridge will test the connection to kafka itself to say if the resource is connected. However, if kafka or its connection is down, this’ll make messages to be lost, as there are no buffer workers for this bridge, and, being “down”, the resource won’t call the wolff producers, leading to both message loss and wrong metrics (as there won’t be failed counters bumped).

Contributor Author

thalesmg commented Dec 29, 2022

Based/depends on #9619

thalesmg force-pushed the refactor-buffer-collect-calls-v50 branch 2 times, most recently from 9fb4fc6 to 2a4a502 Compare

December 29, 2022 20:52

zmstone reviewed

View reviewed changes

apps/emqx_bridge/src/schema/emqx_bridge_schema.erl Show resolved Hide resolved

thalesmg force-pushed the refactor-buffer-collect-calls-v50 branch 9 times, most recently from 4793230 to 0ed22de Compare

December 31, 2022 20:19

thalesmg mentioned this pull request

refactor(metrics): use absolute gauge values rather than deltas (v5.0) #9619

Merged

thalesmg force-pushed the refactor-buffer-collect-calls-v50 branch 3 times, most recently from 346b530 to 7731d38 Compare

January 2, 2023 14:13

thalesmg commented

View reviewed changes

apps/emqx_resource/src/emqx_resource_worker.erl

    
            @@ -80,11 +98,13 @@ start_link(Id, Index, Opts) ->
          
              sync_query(Id, Request, Opts) ->

                  PickKey = maps:get(pick_key, Opts, self()),

                  Timeout = maps:get(timeout, Opts, infinity),

                  emqx_resource_metrics:matched_inc(Id),

Contributor Author

thalesmg Jan 2, 2023

Matched moved here because otherwise retries would bump the matched metric as if they were new requests.

thalesmg commented

View reviewed changes

apps/emqx_resource/src/emqx_resource_worker.erl

    
              get_first_n_from_queue(_Q, 0, Acc) ->

                  lists:reverse(Acc);

              get_first_n_from_queue(Q, N, Acc) when N > 0 ->

                  case replayq:peek(Q) of

Contributor Author

thalesmg Jan 2, 2023

Changed to pop because, previously, the head of the queue would be duplicated N times.

thalesmg force-pushed the refactor-buffer-collect-calls-v50 branch 2 times, most recently from 9ff1329 to b39adb4 Compare

January 2, 2023 14:35

thalesmg commented

View reviewed changes

apps/emqx_resource/src/emqx_resource_worker.erl Outdated Show resolved Hide resolved

thalesmg force-pushed the refactor-buffer-collect-calls-v50 branch 2 times, most recently from ae23aaf to 2cf4a99 Compare

January 2, 2023 16:58

thalesmg commented

View reviewed changes

apps/emqx_resource/src/emqx_resource_worker.erl Show resolved Hide resolved

thalesmg force-pushed the refactor-buffer-collect-calls-v50 branch 4 times, most recently from fc459c9 to 1db6a80 Compare

January 2, 2023 18:48

Contributor Author

thalesmg commented Jan 2, 2023 •

edited

~~Fixme: the telemetry handler ID is changed everytime the resource is created, leading to duplicate metrics.~~

thalesmg requested review from lafirest, HJianBo, ieQu1 and zhongwencool as code owners

January 3, 2023 14:55

thalesmg requested review from qzhuyan

January 3, 2023 14:56

qzhuyan previously approved these changes

View reviewed changes

id previously approved these changes

View reviewed changes

lafirest reviewed

View reviewed changes

apps/emqx_resource/src/emqx_resource_worker.erl

    
            @@ -110,7 +129,9 @@ simple_async_query(Id, Request, ReplyFun) ->
          
                  %% would mess up the metrics anyway.  `undefined' is ignored by

                  %% `emqx_resource_metrics:*_shift/3'.

                  Index = undefined,

                  Result = call_query(async, Id, Index, ?QUERY(ReplyFun, Request, false), #{}),

                  QueryOpts = #{},

                  emqx_resource_metrics:matched_inc(Id),

Member

lafirest Jan 4, 2023

Maybe the block/2 also needs to add this line, emqx_resource_metrics:matched_inc(Id, length(Query)), ?

Contributor Author

thalesmg Jan 4, 2023

Makes sense. But searching the code base, I could not find any usage of block/2 🤔

I think I'll just remove it. 😺

lafirest reviewed

View reviewed changes

apps/emqx_resource/src/emqx_resource_worker.erl

    
                              true ->

                                  {keep_state, St, {state_timeout, ResumeT, resume}};

                                  {keep_state, Data0, {state_timeout, ResumeT, resume}};

Member

lafirest Jan 4, 2023

There may be a bug if data is in the replayQ, its HasBeenSent will always be false even though it may have been sent N times

Contributor Author

thalesmg Jan 4, 2023 •

edited

Before this refactoring, the only requests that did bump the retried.* counters were the async requests, and they are the only ones that do get inserted into the inflight table and marked as sent.

I'm not entirely sure this was the intended behavior indeed. But I think that this behavior is kept because, after this refactoring, the path that an async request does is:

The request gets enqueued into replayq.
It's then popped (possible in a batch).
It's appended to the inflight table (if there's room).
In the first ("outer") handle_query_result, since it'll return {async_return, ok}, it'll be ack'ed in replayq, removing it from there.
Eventually, when {batch_,}reply_after_query is called, then it'll bump either retried.* counter.

For sync requests, you are right: they won't bump the retried.* counters. Do you think that's the original intention? We would need to track what has been retried somehow.

We can't change the replayq items without re-appending them to the queue, and that would change the order of the requests, unfortunately. One way could be to keep a table of hashes of requests and then check if those have been sent or not. 🤔

I'm assuming that we want to keep the order in replayq to prevent things like re-ordering messages from the same client.

What do you think?

Member

lafirest Jan 4, 2023

Personally, I think we can just ignore this here, then create a track ticket to the product owner, and left the discussion to them

Contributor Author

thalesmg Jan 4, 2023

Sounds good to me: we can revisit this point in a future PR, if required. 😺

thalesmg dismissed stale reviews from id and qzhuyan via

3e0bde0

January 4, 2023 12:25

thalesmg requested a review from a team as a code owner

January 4, 2023 12:25

thalesmg force-pushed the refactor-buffer-collect-calls-v50 branch 3 times, most recently from 43eb968 to 9e099be Compare

January 4, 2023 16:32

thalesmg and others added 8 commits

January 5, 2023 10:11


          test: attempt to reduce flakiness

5bd9f11


          fix(kafka_producer): fix message loss when kafka connection is down

8e59319


          fix(kafka_producer): avoid multiplication of metrics when bridge is r…

0fd8880

…ecreated


          feat(buffer_worker): use offload mode for replayq

bf3983e

To avoid confusion for the users as to what persistence guarantees we
offer when buffering bridges/resources, we will always enable offload
mode for `replayq`.  With this, when the buffer size is above the max
segment size, it'll flush the queue to disk, but on recovery after a
restart it'll clean the existing segments rather than resuming from
them.


          feat(buffer_worker): refactor buffer/resource workers to always use q…

fd360ac

…ueue

This makes the buffer/resource workers always use `replayq` for
queuing, along with collecting multiple requests in a single call.
This is done to avoid long message queues for the buffer workers and
rely on `replayq`'s capabilities of offloading to disk and detecting
overflow.

Also, this deprecates the `enable_batch` and `enable_queue` resource
creation options, as: i) queuing is now always enables; ii) batch_size
> 1 <=> batch_enabled.  The corresponding metric
`dropped.queue_not_enabled` is dropped, along with `batching`.  The
batching is too ephemeral, especially considering a default batch time
of 20 ms, and is not shown in the dashboard, so it was removed.


          docs: improve descriptions

af31ed4

Co-authored-by: Zaiming (Stone) Shi <zmstone@gmail.com>


          docs: improve descriptions

Thanks to @qzhuyan for the corrections.


          refactor: remove unused function

70eb5ff

thalesmg force-pushed the refactor-buffer-collect-calls-v50 branch from 9e099be to 70eb5ff Compare

January 5, 2023 13:16

ieQu1 approved these changes

View reviewed changes

thalesmg merged commit 3437151 into emqx:master

thalesmg deleted the refactor-buffer-collect-calls-v50 branch

January 5, 2023 14:37

This was referenced Jan 5, 2023

feat(bridges): integrate PostgreSQL into bridges #9685

Merged

refactor(metrics): change gauge metrics to be scoped by worker id #9605

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

zmstone zmstone left review comments

lafirest lafirest left review comments

ieQu1 ieQu1 approved these changes

id id left review comments

terry-xiaoyu Awaiting requested review from terry-xiaoyu

HJianBo Awaiting requested review from HJianBo

zhongwencool Awaiting requested review from zhongwencool

qzhuyan Awaiting requested review from qzhuyan