Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chrome: Higher rate of "connection retry failed" renderer process crashes #3664

Open
magreenblatt opened this issue Mar 7, 2024 · 3 comments
Labels
bug Bug report chrome Related to the Chrome runtime

Comments

@magreenblatt
Copy link
Collaborator

magreenblatt commented Mar 7, 2024

Describe the bug
Chrome runtime is seeing a higher rate of "connection retry failed" renderer process crashes (code here) compared to Alloy (same web content). We should revisit the discussion in #3260 and either (a) increase the timeouts for Chrome runtime, or (b) change the FATAL to ERROR to avoid crashing. Highlights from the previous discussion:

Original comment by Bernard de Champlain

In all cases that were logged either all of the connection retries for a specific frame eventually were acknowledged (some with well over 30 seconds of delay in rapid burst of ACKs from the browser process) or none of them were. So even though in theory the connection request message isn’t guaranteed to be delivered, I’ve seen no case where some of the request weren’t received (and I have an idea why, in some cases, none were ACKed). The ACK can take a very long time to be received in the browser process UI thread where they are processed. So, I don’t see any point to increase the number of retries, I’d even think that 2 retires are good enough to be sure the first request didn’t just get lost by some very rare occurrence. I do see value in being more patient for the timeout and possible even having an extra grace period for the last retry.

I’ve tried CEF builds with FATAL “Connection retry failure” changed to ERROR status and played with the number of retries, the timeout delays and also made the last retry timeout longer to see what logs I’d get while putting the pc under strain by linking libcef.dll, initial check out Chromium 101 and CEF-101 builds while running our application (on Windows 10). In all cases, the changing of the FATAL log to ERROR status had no detrimental or visual effect except for occasional hiccups.

Trials with only changing the timeout from 4 to 10 seconds have given me perfect results (while keeping the retries at 3). I also changed the FATAL exception to ERROR status for my build. My results with windows under cpu/memory strain that would have previously triggered FATAL multiple times, didn’t trigger it once within 5 hours of strain (while running our regression playlist). All of the frame disconnections ended up properly connected before failure would have been called.

Original comment by Dmitry Azaraev

Sorry, but timeout has nothing reasonable with broken logic. Response time depends only to given system load. Imagine what you overload system to 10x times: every thread will get 10x less time slices (approx), and in sealed cases this may lead to 30x be slower. Just noting what timeouts usually configurable exactly because it’s value reasonable according to system load, but not per se / not constant value.

@magreenblatt magreenblatt added bug Bug report chrome Related to the Chrome runtime labels Mar 7, 2024
@magreenblatt
Copy link
Collaborator Author

The vast majority of crashes are:

[ 00 ] logging::LogMessage::~LogMessage() ( logging.cc:937 )
[ 01 ] static void CefFrameImpl::OnDisconnect(CefFrameImpl::DisconnectReason) ( frame_impl.cc:609 )
[ 02 ] mojo::InterfaceEndpointClient::NotifyError(std::__Cr::optional<mojo::DisconnectReason> const &) ( interface_endpoint_client.cc:739 )
[ 03 ] static void base::internal::Invoker<base::internal::BindState<void (mojo::internal::MultiplexRouter::*)(bool),base::internal::UnretainedWrapper<mojo::internal::MultiplexRouter,base::unretained_traits::MayNotDangle,0>,bool>,void ()>::RunOnce(class base::internal::BindStateBase *) ( bind_internal.h:920 )
[ 04 ] static void base::internal::Invoker<base::internal::BindState<void (mojo::Connector::*)(const char *, unsigned int),base::internal::UnretainedWrapper<mojo::Connector,base::unretained_traits::MayNotDangle,0>,base::internal::UnretainedWrapper<const char,base::unretained_traits::MayNotDangle,0> >,void (unsigned int)>::Run(class base::internal::BindStateBase *, unsigned int) ( bind_internal.h:933 )
[ 05 ] void base::internal::Invoker<base::internal::BindState<void (*)(const base::RepeatingCallback<void (unsigned int)> &, unsigned int, const mojo::HandleSignalsState &),base::RepeatingCallback<void (unsigned int)> >,void (unsigned int, const mojo::HandleSignalsState &)>::Run(class base::internal::BindStateBase *, unsigned int, const struct mojo::HandleSignalsState & const) ( bind_internal.h:933 )
[ 06 ] static void mojo::SimpleWatcher::OnHandleReady(int, unsigned int, const struct mojo::HandleSignalsState & const) ( simple_watcher.cc:278 )

A small minority are:

[ 00 ] logging::LogMessage::~LogMessage() ( logging.cc:938 )
[ 01 ] static void CefFrameImpl::OnDisconnect(CefFrameImpl::DisconnectReason) ( frame_impl.cc:609 )
[ 02 ] static void CefFrameImpl::OnBrowserFrameTimeout() ( frame_impl.cc:529 )
[ 03 ] base::OneShotTimer::FireNow() ( timer.cc:163 )

magreenblatt added a commit that referenced this issue Mar 8, 2024
Introduce different call stacks for different types of disconnects,
and log additional state information in the FATAL message.
@magreenblatt
Copy link
Collaborator Author

magreenblatt commented Mar 8, 2024

It looks like disconnects can occur after browser process shutdown or crash.

For example (at M123):

  1. Run cefclient.exe --vmodule=frame_*=1 --enable-logging=stderr --no-sandbox --enable-chrome-runtime --url=https://www.youtube.com
  2. Close the window
  3. Get the following input
[4720:12084:0308/144027.832:VERBOSE1:frame_host_impl.cc(516)] frame 5-B6DA8473C9D0540FEA38676D78C8420C (main) detached (reason=NEW_MAIN_FRAME, is_connected=1)
[6068:15572:0308/144028.426:VERBOSE1:frame_impl.cc(615)] frame 5-25DB837C3184E63A52DD1739C2CD2EAB disconnected (reason=DETACHED, current_state=CONNECTION_ACKED, FRAME_INVALID)
[6068:15572:0308/144028.426:VERBOSE1:frame_impl.cc(615)] frame 5-B6DA8473C9D0540FEA38676D78C8420C disconnected (reason=BROWSER_FRAME_DISCONNECT, current_state=CONNECTION_ACKED)
[6068:15572:0308/144028.426:VERBOSE1:frame_impl.cc(628)] frame 5-B6DA8473C9D0540FEA38676D78C8420C connection retry scheduled

Repeating the above using a Debug build under system load (for example, while building Chromium) may delay execution sufficiently for the retry to trigger. Also, potentially if the render process is left behind as a zombie (see #3614).

magreenblatt added a commit that referenced this issue Mar 8, 2024
Introduce different call stacks for different types of disconnects,
and log additional state information in the FATAL message.
magreenblatt added a commit that referenced this issue Mar 8, 2024
Introduce different call stacks for different types of disconnects,
and log additional state information in the FATAL message.
@magreenblatt
Copy link
Collaborator Author

Related chromium-dev question about detecting browser process termination from the renderer: https://groups.google.com/a/chromium.org/g/chromium-dev/c/tFJCeTzvsnc/m/lzYhPAfkAAAJ

magreenblatt added a commit that referenced this issue Apr 22, 2024
Split the Alloy runtime into bootstrap and style components. Support
creation of Alloy style browsers and windows with the Chrome runtime.
Chrome runtime (`--enable-chrome-runtime`) + Alloy style
(`--use-alloy-style`) supports Views (`--use-views`), native parent
(`--use-native`) and windowless rendering
(`--off-screen-rendering-enabled`).

Print preview is supported in all cases except with windowless rendering
on all platforms and native parent on MacOS. It is disabled by default
with Alloy style for legacy compatibility. Where supported it can be
enabled or disabled globally using `--[enable|disable]-print-preview` or
configured on a per-RequestContext basis using the
`printing.print_preview_disabled` preference. It also behaves as
expected when triggered via the PDF viewer print button.

Chrome runtime + Alloy style behavior differs from Alloy runtime in the
following significant ways:

- Supports Chrome error pages by default.
- DevTools popups are Chrome style only (cannot be windowless).
- The Alloy extension API will not supported.

Chrome runtime + Alloy style passes all expected Alloy ceftests except
the following:

- `DisplayTest.AutoResize` (Alloy extension API not supported)
- `DownloadTest.*` (Download API not yet supported)
- `ExtensionTest.*` (Alloy extension API not supported)

This change also adds Chrome runtime support for
CefContextMenuHandler::RunContextMenu (see #3293).

This change also explicitly blocks (and doesn't retry) FrameAttached
requests from PDF viewer and print preview excluded frames (see #3664).

Known issues specific to Chrome runtime + Alloy style:
- DevTools popup with windowless rendering doesn't load successfully.
  Use windowed rendering or remote debugging as a workaround.
- Chrome style Window with Alloy style BrowserView (`--use-alloy-style
  --use-chrome-style-window`) does not show Chrome theme changes.

To test:
- Run `ceftests --enable-chrome-runtime --use-alloy-style
       [--use-chrome-style-window] [--use-views|--use-native]
       --gtest_filter=...`
- Run `cefclient --enable-chrome-runtime --use-alloy-style
       [--use-chrome-style-window]
       [--use-views|--use-native|--off-screen-rendering-enabled]`
- Run `cefsimple --enable-chrome-runtime --use-alloy-style [--use-views]`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug report chrome Related to the Chrome runtime
Projects
None yet
Development

No branches or pull requests

1 participant