Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flush chttpd_db monitor refs on demonitor #4906

Merged
merged 1 commit into from
Dec 12, 2023

Conversation

chewbranca
Copy link
Contributor

This ensures that the coordinator process flushes on demonitoring of the attachment refs in chttpd_db. The problem here is that it's possible to receive a 'DOWN' message for the monitor ref that is not receive'ed, causing it to stick around in the coordinator message queue while the next http request is handled. The pending message will not become apparent until the next fabric call is invoked, as fabric expects to have full access to all messages in the calling process, an expectation which is violated by the pending message, causing a case clause crash in the fabric receive message callbacks.

I noticed this during eunit runs with stubbed attachment handles that generate an immediate noproc message on the monitor call. Normal operations should not result in an immediate noproc result on monitoring the attachment process, however, any failure that causes the attachment process to fail between acquisition of the refs and the demonitor calls will induce this bug, causing the next http request handled by the particular chttpd coordinator pool processs to fail on whatever next fabric call is made.

@chewbranca chewbranca force-pushed the flush-chttpd-attachment-monitors branch 3 times, most recently from 5c07273 to c7e3587 Compare December 12, 2023 20:46
This ensures that the coordinator process flushes on demonitoring of the
attachment refs in chttpd_db. The problem here is that it's possible to
receive a 'DOWN' message for the monitor ref that is not receive'ed,
causing it to stick around in the coordinator message queue while the
next http request is handled. The pending message will not become
apparent until the next fabric call is invoked, as fabric expects to
have full access to all messages in the calling process, an expectation
which is violated by the pending message and causes a case clause crash
in the fabric receive message callbacks.

I noticed this during eunit runs with stubbed attachment handles that
generate an immediate noproc message on the monitor call. Normal
operations should not result in an immediate noproc result on monitoring
the attachment process, however, any failure that causes the attachment
process to fail between acquisition of the refs and the demonitor calls
will induce this bug, causing the next http request handled by the
particular chttpd coordinator pool processs to fail on whatever next
fabric call is made.
@chewbranca chewbranca force-pushed the flush-chttpd-attachment-monitors branch from c7e3587 to c3c364c Compare December 12, 2023 21:26
@chewbranca chewbranca merged commit fae8761 into main Dec 12, 2023
14 checks passed
chewbranca added a commit that referenced this pull request Dec 18, 2023
We need to potentially extract the usage delta from the incoming RPC
message. Rather than pattern match on all possible message formats that
could potentially include usage deltas, we instead utilize
rexi_util:extract_delta which matches against tuples ending in
`{delta, Delta}`, and separates that out from the underlying message.

The subtlety here is that receiving the message to extract the delta
changes the behavior as this was previously doing a selective receive
keyed off of the Ref, and then ignoring any other messages that arrived.
I don't know if the selective receive was intended, but I don't think
it's appropriate to leave unexpected messages floating around,
especially given things like issue #4909.

Instead of utilizing a selective receive, this switches to extracting
the message and delta like we need to do, and then in the event it finds
unexpected messages they're logged and skipped.

This selective receive was masking the lack of unlink on the linked
rexi_mon pid in fix #4906. I've also noticed some rpc responses arriving
late as well, but I haven't tracked that down, so let's log when it does
happen.
@janl janl deleted the flush-chttpd-attachment-monitors branch December 20, 2023 18:07
chewbranca added a commit that referenced this pull request Feb 12, 2024
We need to potentially extract the usage delta from the incoming RPC
message. Rather than pattern match on all possible message formats that
could potentially include usage deltas, we instead utilize
rexi_util:extract_delta which matches against tuples ending in
`{delta, Delta}`, and separates that out from the underlying message.

The subtlety here is that receiving the message to extract the delta
changes the behavior as this was previously doing a selective receive
keyed off of the Ref, and then ignoring any other messages that arrived.
I don't know if the selective receive was intended, but I don't think
it's appropriate to leave unexpected messages floating around,
especially given things like issue #4909.

Instead of utilizing a selective receive, this switches to extracting
the message and delta like we need to do, and then in the event it finds
unexpected messages they're logged and skipped.

This selective receive was masking the lack of unlink on the linked
rexi_mon pid in fix #4906. I've also noticed some rpc responses arriving
late as well, but I haven't tracked that down, so let's log when it does
happen.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants