Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(buffer_worker): avoid setting flush timer when inflight is full #10717

Merged
merged 2 commits into from
May 17, 2023

Conversation

thalesmg
Copy link
Contributor

@thalesmg thalesmg commented May 16, 2023

Targeting release-50

Fixes https://emqx.atlassian.net/browse/EMQX-9902

When the buffer worker inflight window is full, we don’t need to set a timer to flush the messages again because there’s no more room, and one of the inflight windows will flush the buffer worker by calling flush_worker.

Currently, we do set the timer on such situation, and this fact combined with the default batch time of 0 yields a busy loop situation where the CPU spins a lot while inflight messages do not return.

This original change uncovered another bug that was hidden by it:
maybe_flush_after_async_reply was sending a message to the wrong
PID. It was sending a message to self() meaning to target a buffer
worker, but self() in that context is never the buffer worker, it's
the connector's worker.

This change also revealed a race condition where the buffer workers
could stop flushing messages. So we piggy-backed on the atomic update
of the table size count to check if the buffer worker should be poked
to continue flushing. This allows us to get rid of
maybe_flush_after_async_reply altogether.

Summary

🤖 Generated by Copilot at a77f6e5

Optimize resource buffer worker performance by removing redundant timer calls. This affects the file emqx_resource_buffer_worker.erl and the function ensure_flush_timer/1.

PR Checklist

Please convert it to a draft if any of the following conditions are not met. Reviewers may skip over until all the items are checked:

  • Added tests for the changes
  • Changed lines covered in coverage report
  • Change log has been added to changes/{ce,ee}/(feat|perf|fix)-<PR-id>.en.md files
  • For internal contributor: there is a jira ticket to track this change
  • If there should be document changes, a PR to emqx-docs.git is sent, or a jira ticket is created to follow up
  • Schema changes are backward compatible

Checklist for CI (.github/workflows) changes

  • If changed package build workflow, pass this action (manual trigger)
  • Change log has been added to changes/ dir for user-facing artifacts update

@thalesmg thalesmg force-pushed the fix-bw-flush-timer-full-infl-r50 branch from a77f6e5 to 617b032 Compare May 16, 2023 12:51
Fixes https://emqx.atlassian.net/browse/EMQX-9902

When the buffer worker inflight window is full, we don’t need to set a
timer to flush the messages again because there’s no more room, and
one of the inflight windows will flush the buffer worker by calling
`flush_worker`.

Currently, we do set the timer on such situation, and this fact
combined with the default batch time of 0 yields a busy loop situation
where the CPU spins a lot while inflight messages do not return.
@thalesmg thalesmg force-pushed the fix-bw-flush-timer-full-infl-r50 branch from 617b032 to 657df05 Compare May 16, 2023 14:29
@zmstone zmstone self-requested a review May 16, 2023 19:54
…table room is made

The previous commit uncovered another bug that was hidden by it:
`maybe_flush_after_async_reply` was sending a message to the wrong
PID.  It was sending a message to `self()` meaning to target a buffer
worker, but `self()` in that context is never the buffer worker, it's
the connector's worker.

This change also revealed a race condition where the buffer workers
could stop flushing messages.  So we piggy-backed on the atomic update
of the table size count to check if the buffer worker should be poked
to continue flushing.  This allows us to get rid of
`maybe_flush_after_async_reply` altogether.
@thalesmg thalesmg marked this pull request as ready for review May 17, 2023 13:18
@thalesmg thalesmg requested a review from a team as a code owner May 17, 2023 13:18
@thalesmg thalesmg requested a review from zmstone May 17, 2023 16:36
@@ -337,7 +325,8 @@ resume_from_blocked(Data) ->
{next_state, running, Data}
end;
{expired, Ref, Batch} ->
IsAcked = ack_inflight(InflightTID, Ref, Id, Index),
WorkerPid = self(),
IsAcked = ack_inflight(InflightTID, Ref, Id, Index, WorkerPid),
Copy link
Contributor

@savonarola savonarola May 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m.b. an inline fun

ack_inflight_from_worker(InflightTID, Ref, Id, Index) ->
  ack_inflight(InflightTID, Ref, Id, Index, self()).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have another PR underway that will also touch this function, I'll try to do it in that PR.

@thalesmg thalesmg merged commit 10f6edd into emqx:release-50 May 17, 2023
@thalesmg thalesmg deleted the fix-bw-flush-timer-full-infl-r50 branch May 17, 2023 17:49
FlushCheck = dec_inflight_remove(InflightTID, Count, Removed),
case FlushCheck of
continue -> ok;
flush -> ?MODULE:flush_worker(WorkerPid)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: not quite clear how continue opposes to flush 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would no_flush or dont_flush be better? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed here: #10740

true ->
ok;
false ->
ack_inflight(InflightTID, Ref, Id, Index)
ack_inflight(InflightTID, Ref, Id, Index, WorkerPid)
end,
{Data1, WorkerMRef} = ensure_async_worker_monitored(Data0, Result),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: ensure_async_worker_monitored we also have WorkerPids, couldn't be this confusing?

true ->
ok;
false ->
ack_inflight(InflightTID, Ref, Id, Index)
ack_inflight(InflightTID, Ref, Id, Index, WorkerPid)
end,
{Data1, WorkerMRef} = ensure_async_worker_monitored(Data0, Result),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: In ensure_async_worker_monitored we also have WorkerPids, couldn't be that confusing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, there are workers all around 😅
I think the differentiating point is that we have buffer workers (the current module) and async workers (from the connectors). I'll add the Async modifier to the variable name to help with that differentiation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed here: #10740

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants