Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

erl_check_io: do not discard ERTS_POLL_EV_IN from active_events #2701

Merged
merged 1 commit into from
Aug 7, 2020

Conversation

max-au
Copy link
Contributor

@max-au max-au commented Jul 29, 2020

This commit fixes a race condition, which can be reproduced
with running a large amount of bidirectional TCP streams.
When the system is fully loaded, it is possible that socket migrated
to scheduler pollset still fires EPOLLIN event. In this case
active_events field of ErtsDrvEventState structure will desynchronise
from the actual state (with EPOLLIN requested via scheduler pollset).
This, in turn, leads to all ERTS_POLL_EV_IN events disregarded. All
reads from the socket will be stopped.
This can only happen for {active, N} mode of a TCP socket, where
fd migrates back and forth to/from scheduler pollset.

This commit fixes a race condition, which can be reproduced
with running a large amount of bidirectional TCP streams.
When the system is fully loaded, it is possible that socket migrated
to scheduler pollset still fires EPOLLIN event. In this case
active_events field of ErtsDrvEventState structure will desynchronise
from the actual state (with EPOLLIN requested via scheduler pollset).
This, in turn, leads to all ERTS_POLL_EV_IN events disregarded. All
reads from the socket will be stopped.
This can only happen for {active, N} mode of a TCP socket, where
fd migrates back and forth to/from scheduler pollset.
@rickard-green rickard-green added team:VM Assigned to OTP team VM testing currently being tested, tag is used by OTP internal CI labels Jul 31, 2020
@rickard-green rickard-green merged commit 9339899 into erlang:maint Aug 7, 2020
@garazdawi
Copy link
Contributor

A bunch (> 15) of testcases have started to fail sporadically after this change was merged. I'm trying to figure out why. The testcases that fail the most are in kernel:

pg_SUITE:thundering_herd
logger_std_h_SUITE:op_switch_to_flush_file
logger_disk_log_h_SUITE:op_switch_to_flush

I'm trying to understand what is going on, any help you can give would be most appreciated.

@garazdawi
Copy link
Contributor

I think I have narrowed it down to a testcase problem. The recv processes exit with reason normal, so the teardown of the Sender does not work which leaves a bunch of processes that consume a lot of CPU which effects the other testcases.

I'll merge a fix in the relevant places.

@max-au
Copy link
Contributor Author

max-au commented Aug 12, 2020

Thank you! Indeed, this leaves many senders when testcase does not fail.

@max-au max-au deleted the max-au/erl-check-io-desync branch June 30, 2021 21:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team:VM Assigned to OTP team VM testing currently being tested, tag is used by OTP internal CI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants