-
Notifications
You must be signed in to change notification settings - Fork 419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
event_monitor: refactor the implementation to support concurrent access #5633
event_monitor: refactor the implementation to support concurrent access #5633
Conversation
22be4e3
to
992aae7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation looks good to me. I wonder how does the the "event-monitor" thread exit gracefully? Also, looks like this thread won't terminate together with other thread upon exit_evt
.
|
Thanks for the explanation, Rob. |
Do you think that we should write to |
Thanks to @brkp's point my reply was slightly off. There is an |
Yes - please make it consistent with the other VMM support threads. I think there should also be some new seccomp rules too? |
@brkp I've drafted this as I think there are still some more bits to do...? |
Thanks -- yeah, sorry I haven't had the time to continue working on this. I also want to improve the error handling in here, alongside the previously mentioned things (seccomp rules, making the thread behavior more consistent with the rest of the VMM threads, etc.). |
This patch modifies `event_monitor` to ensure that concurrent access to `event_log` from multiple threads is safe. Previously, the `event_log` function would acquire a reference to the event log file and write to it without doing any synchronization, which made it prone to data races. This issue likely went under the radar because the relevant `SAFETY` comment on the unsafe block was incomplete. The new implementation spawns a dedicated thread named `event-monitor` solely for writing to the file. It uses the MPMC channel exposed by `flume` to pass messages to the `event-monitor` thread. Since `flume::Sender<T>` implements `Sync`, it is safe for multiple threads to share it and send messages to the `event-monitor` thread. This is not possible with `std::sync::mpsc::Sender<T>` since it's `!Sync`, meaning it is not safe for it to be shared between different threads. The `event_monitor::set_monitor` function now only initializes the required global state and returns an instance of the `Monitor` struct. This decouples the actual logging logic from the `event_monitor` crate. The `event-monitor` thread is then spawned by the `vmm` crate. Signed-off-by: Omer Faruk Bayram <omer.faruk@sartura.hr>
992aae7
to
fea89bd
Compare
I've modified the This change separates the logging logic from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good to me. Just one thing about the seccomp filter. I think it is better to include brk
and mmap
to avoid potential violations for allocating memory on the event_monitor thread.
fea89bd
to
9c647cd
Compare
Thanks for catching that! @likebreath Added |
Signed-off-by: Omer Faruk Bayram <omer.faruk@sartura.hr>
9c647cd
to
8d104b2
Compare
@brkp Could you please share a way to verify this patch on aarch64? |
@peng6662001 Would you mind providing a bit more context? I'm not quite sure what you mean by "verify". |
@brkp Have you ever encountered a "concurrent access" bug?How to reproduce it? |
@peng6662001 Hey, sorry for the late reply. The previous implementation of This becomes problematic when two threads race to write to the log file at the same time. While this may not have been a significant issue in the past due to the scarce use of |
Since the 'write()' to the event file was moved to its own thread (see cloud-hypervisor#5633), we have no reliable way to read the latest contents of the event file from our integration tests, since we can't ensure the 'read()' from our test always happen after 'write()' is completed from Cloud Hypervisor. This is also why we started to see random failures on snapshot_restore tests (particularly when the system workload is high). This patch adds a 1s sleep before reading the event file to mitigate the random failures. Signed-off-by: Bo Chen <chen.bo@intel.com>
Since the 'write()' to the event file was moved to its own thread (see #5633), we have no reliable way to read the latest contents of the event file from our integration tests, since we can't ensure the 'read()' from our test always happen after 'write()' is completed from Cloud Hypervisor. This is also why we started to see random failures on snapshot_restore tests (particularly when the system workload is high). This patch adds a 1s sleep before reading the event file to mitigate the random failures. Signed-off-by: Bo Chen <chen.bo@intel.com>
This patch modifies
event_monitor
to ensure that concurrent access toevent_log
from multiple threads is safe. Previously, theevent_log
function would acquire a reference to the event log file and write to it without doing any synchronization, which made it prone to data races. This issue likely went under the radar because the relevantSAFETY
comment on the unsafe block was incomplete.The new implementation spawns a dedicated thread named
event-monitor
solely for writing to the file. It uses the MPMC channel exposed byflume
to pass messages to theevent-monitor
thread. Sinceflume::Sender<T>
implementsSync
, it is safe for multiple threads to share it and send messages to theevent-monitor
thread.I looked into doing this with the unbounded MPSC in the standard library but unfortunately, it's
!Sync
, which actually is considered to be an API mistake. Meaning, the following snippet has soundness issues and is not safe iftx
were to be astd::sync::mpsc::Sender<T>
:If anyone is aware of a workaround/better pattern for implementing this with the MPSC in the standard library, I'd love to hear about it.
Here are some links that can provide more context for this PR: