Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: willmsg not published in takeover #11868

Merged
merged 14 commits into from Feb 29, 2024

Conversation

qzhuyan
Copy link
Contributor

@qzhuyan qzhuyan commented Nov 2, 2023

Fixes #10551
Following the content in MQTT 5.0 spec I draw the state diagram of willmsg.

stateDiagram-v2
  [*] --> will_msg_present: will_msg present in connect
  will_msg_present --> scheduled: server conn \nabnormal shutdown (a.)
  will_msg_present --> scheduled: client conn \nabnormal shutdown (a.)
  will_msg_present --> scheduled: take over (d.)
  will_msg_present --> [*]: discconnect normally (RC 0) (e.)\n will_msg removed
  scheduled --> cancelled: reconnected (c.)
  scheduled --> publish: session ends timeout (a.)
  scheduled --> publish: will_delay timeout (a.)
  will_msg_present --> publish: session ends (clean start) (f.)
  cancelled --> [*] :will_msg removed
  publish --> [*] :will_msg removed

(a.) MQTT-3.1.2-8
(b.) MQTT-3.1.2-10
(c.) MQTT 3.1.3-9
(d.) MQTT-3.1.4-3
(e.) MQTT-3.14.4-3
(f.) ch 3.1.3.2.2

and it could be simplified to (in EMQX terms)

stateDiagram-v2
[*] --> will_msg_present: will_msg present in connect
will_msg_present --> scheduled: sock_close, takeover\n connection abort
will_msg_present --> [*]: disconnect RC 0\n will_msg_removed
will_msg_present --> publish: discard, kick\n session ends
scheduled --> cancelled: reconnect\n taken over
scheduled --> publish: expired\nsession ends
scheduled --> publish: will_delay timeout (a.)
cancelled --> [*] :will_msg removed
publish --> [*] :will_msg removed

So following applies

  1. will_msg is stored in the conn proc heap when connected.
  2. No will_msg publishing if will_msg is absent.
  3. Receiving MQTT.Disconnect RC:0 from client removes the will_msg from proc heap.
  4. When session ends, EMQX publish will_msg.
    applies to 'discard', 'internal_error', 'kick' and 'expired'
  5. EMQX defer will_msg publish for connection abortion.
    applies to 'takeover' , 'sock_close' (socket error) this is also the default error handling
  6. will_msg and scheduled willmsg publishing must be removed after publish [note 4.]

@note,

  1. EMQX named two takeover scenarios to reflact the term 'takenover' in MQTT 5.0
    a. Discard: takeover with clean_session = True
    b. Takeover: takeover with clean_session = False

  2. For 'internal_error', we don't have chance to keep the session?

  3. 'kick' in EMQX also removes the session

  4. Already triggered will_msg publishing will not work if will_msg is absent.

Summary

🤖 Generated by Copilot at b497f84

This pull request fixes a bug with will messages not being published after session takeover, adds a feature to gracefully close the transport layer when shutting down a connection, refactors and simplifies the code for handling will messages and channel termination, and updates the test suite to cover different scenarios for session takeover. It also adds a change log file changes/ce/fix-11868.en.md to document the bug fix.

PR Checklist

Please convert it to a draft if any of the following conditions are not met. Reviewers may skip over until all the items are checked:

  • Added tests for the changes
  • Added property-based tests for code which performs user input validation
  • Changed lines covered in coverage report
  • Change log has been added to changes/(ce|ee)/(feat|perf|fix|breaking)-<PR-id>.en.md files
  • For internal contributor: there is a jira ticket to track this change
  • Created PR to emqx-docs if documentation update is required, or link to a follow-up jira ticket
  • Schema changes are backward compatible

Checklist for CI (.github/workflows) changes

  • If changed package build workflow, pass this action (manual trigger)
  • Change log has been added to changes/ dir for user-facing artifacts update

@qzhuyan qzhuyan force-pushed the fix/william/takeover-pub-willmsg branch 4 times, most recently from 482d4af to a4c202e Compare November 6, 2023 14:22
@qzhuyan qzhuyan marked this pull request as ready for review November 6, 2023 14:41
@qzhuyan qzhuyan requested review from lafirest and a team as code owners November 6, 2023 14:41
Copy link
Contributor

@keynslug keynslug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

apps/emqx/test/emqx_takeover_SUITE.erl Outdated Show resolved Hide resolved
apps/emqx/test/emqx_takeover_SUITE.erl Outdated Show resolved Hide resolved
apps/emqx/test/emqx_takeover_SUITE.erl Outdated Show resolved Hide resolved
apps/emqx/src/emqx_channel.erl Outdated Show resolved Hide resolved
apps/emqx/src/emqx_channel.erl Outdated Show resolved Hide resolved
apps/emqx/test/emqx_takeover_SUITE.erl Outdated Show resolved Hide resolved
@qzhuyan qzhuyan force-pushed the fix/william/takeover-pub-willmsg branch from a4c202e to f60eb8b Compare November 10, 2023 10:55
@qzhuyan
Copy link
Contributor Author

qzhuyan commented Nov 10, 2023

I did some updates. but need to dig more about the possible reasons of shutdown and connection states where willmsg publishing is allowed.

@qzhuyan qzhuyan force-pushed the fix/william/takeover-pub-willmsg branch 2 times, most recently from 805ff6e to 6f779dc Compare November 11, 2023 15:51
Copy link
Contributor

@keynslug keynslug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏼

?assertNot(IsWill2),
emqtt:stop(CPid2),
emqtt:stop(CPidSub),
?assert(not is_process_alive(CPid1)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny nit. The assert_client_exit/3 call several lines before returns only when CPid1 process exits, so this line seems kinda unnecessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will leave it it is test code, assertions just helps.

apps/emqx/test/emqx_shared_sub_SUITE.erl Outdated Show resolved Hide resolved
apps/emqx/src/emqx_channel.erl Outdated Show resolved Hide resolved
%% a. expired (session expired)
%% c. discarded (Session ends because another process starts new session with the same clientid)
%% b. kicked. (kicked by operation)
%% d. internal_error (maybe not recoverable)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. It seems a bit unsafe to assume that in the event of hitting internal error publishing should still work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just list them, may miss something, the entire call path, call stack is hard to follow.

Copy link
Contributor Author

@qzhuyan qzhuyan Nov 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this in spec:

In the case of a Server shutdown or failure, the Server MAY defer publication of Will Messages until a subsequent restart. If this happens, there might be a delay between the time the Server experienced failure and when the Will Message is published.

So I think it is ok to either publish/not publish/delay publish willmsg when server error.

Comment on lines 1439 to 1446
terminate({shutdown, Reason}, Channel) when
Reason =:= discarded;
Reason =:= takenover
Reason =:= expired orelse
Reason =:= takenover orelse
Reason =:= kicked orelse
Reason =:= discarded
->
run_terminate_hook(Reason, Channel);
terminate(Reason, Channel = #channel{clientinfo = ClientInfo, will_msg = WillMsg}) ->
%% since will_msg is set to undefined as soon as it is published,
%% if will_msg still exists when the session is terminated, it
%% must be published immediately.
WillMsg =/= undefined andalso publish_will_msg(ClientInfo, WillMsg),
run_terminate_hook(Reason, Channel).
Channel1 = maybe_publish_will_msg(Reason, Channel),
run_terminate_hook(Reason, Channel1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit. This clause's logic is identical to the next catch-all clause, the only difference is taking the inner atom from shutdown tuple. I'd argue this makes the shutdown logic harder to follow.

In general, I suspect that one thing that could help to make things simpler to follow is separating maybe_publish_will_msg into a set of 2 functions: one is for scheduling willmsg publishing (e.g. essentially called only when sock_closed), another one is for deciding what to do on channel termination (when there's no point to schedule anything). This should also eliminate the need to do something explicitly on handle_call(kick, ...), i.e. here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit. This clause's logic is identical to the next catch-all clause, the only difference is taking the inner atom from shutdown tuple. I'd argue this makes the shutdown logic harder to follow.

I thought the same and I did remove it until I found run_terminate_hook relies on the Reason (kicked) instead of {shutdown, kicked}, the hooks are exposed to the users so I think it better to keep it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I suspect that one thing that could help to make things simpler to follow is separating maybe_publish_will_msg into a set of 2 functions: one is for scheduling willmsg publishing (e.g. essentially called only when sock_closed), another one is for deciding what to do on channel termination (when there's no point to schedule anything). This should also eliminate the need to do something explicitly on handle_call(kick, ...), i.e. here.

I found scheduling willmsg publishing will not always work when the process terminates before the timer get fired.
Simple fix would be to just extend the process lifetime but that is costly and it cannot be done in terminate without changing the whole call stack.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good catch. However, it also means that terminate hook now will be called with expired, instead of {shutdown, expired} as before. Although I'm not sure if former behavior is important to preserve here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found scheduling willmsg publishing will not always work when the process terminates before the timer get fired.

I mean, that seems the reason why the logic is hard to follow: currently maybe_publish_will_msg is called both from sock_closed context (where scheduling timers makes sense) and terminate context (where scheduling timers doesn't make sense). Yet some of its codepaths end up with timers being scheduled anyway, which is confusing: it's very hard to tell in which context they are scheduled. Thus it seems worthwhile to try to separate this into maybe_schedule_will_msg/1 and maybe_publish_will_msg_on_terminate/2, at least on the first look.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I suspect that one thing that could help to make things simpler to follow is separating maybe_publish_will_msg into a set of 2 functions: one is for scheduling willmsg publishing (e.g. essentially called only when sock_closed), another one is for deciding what to do on channel termination (when there's no point to schedule anything). This should also eliminate the need to do something explicitly on handle_call(kick, ...), i.e. here.

I found scheduling willmsg publishing will not always work when the process terminates before the timer get fired. Simple fix would be to just extend the process lifetime but that is costly and it cannot be done in terminate without changing the whole call stack.

I am wrong. we could only schedule willmsg publishing when session expire > 0.

Reason =:= expired orelse
Reason =:= discarded orelse
Reason =:= kicked orelse
Reason =:= ?chan_terminating orelse
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this keeps the old behavior.

@qzhuyan qzhuyan force-pushed the fix/william/takeover-pub-willmsg branch from 6aff0da to 23c7e59 Compare November 15, 2023 21:11
@qzhuyan qzhuyan force-pushed the fix/william/takeover-pub-willmsg branch 2 times, most recently from 78f4b14 to a0c8c06 Compare November 22, 2023 14:01
@qzhuyan qzhuyan changed the base branch from release-53 to master November 22, 2023 14:02
@qzhuyan qzhuyan closed this Nov 23, 2023
@qzhuyan qzhuyan reopened this Nov 23, 2023
@qzhuyan qzhuyan force-pushed the fix/william/takeover-pub-willmsg branch from a0c8c06 to 125091a Compare February 22, 2024 16:04
@qzhuyan qzhuyan force-pushed the fix/william/takeover-pub-willmsg branch from 125091a to 2ff33f9 Compare February 22, 2024 16:14
apps/emqx/src/emqx_channel.erl Outdated Show resolved Hide resolved
apps/emqx/src/emqx_channel.erl Show resolved Hide resolved
@qzhuyan qzhuyan marked this pull request as draft February 23, 2024 11:22
@qzhuyan qzhuyan force-pushed the fix/william/takeover-pub-willmsg branch from 26fc157 to 6c7b774 Compare February 26, 2024 22:06
@qzhuyan qzhuyan marked this pull request as ready for review February 27, 2024 11:16
@zmstone
Copy link
Member

zmstone commented Feb 29, 2024

kick should not trigger will message due to security reason.

@qzhuyan
Copy link
Contributor Author

qzhuyan commented Feb 29, 2024

kick should not trigger will message due to security reason.

we decided to handle it in separate PR.

@qzhuyan qzhuyan merged commit b82973b into emqx:master Feb 29, 2024
166 checks passed
@qzhuyan qzhuyan deleted the fix/william/takeover-pub-willmsg branch February 29, 2024 12:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Will messages not sent during client session takeover, violating MQTT 3.1.1 specification
4 participants