Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chromium flashes translucent for a while; xpra crashes #3410

Closed
karlkleinpaste opened this issue Jan 4, 2022 · 29 comments
Closed

chromium flashes translucent for a while; xpra crashes #3410

karlkleinpaste opened this issue Jan 4, 2022 · 29 comments
Labels
bug Something isn't working

Comments

@karlkleinpaste
Copy link

Started a couple terminals, a couple widgets, and chromium.
Chromium flashes translucent irregularly, exposing the content of the terminal behind itself.
Sometimes it stops and stays steady for a while, then later resumes translucent flashing.
Eventually xpra crashes. But the underlying dummy X server continues.

To Reproduce

  1. ssh pinkchip xpra start :27 --start-child=mate-terminal
  2. xpra attach ssh://pinkchip/27
  3. wait, observe flashing, wait some more
  4. client xpra disconnects, attempts reconnect, fails, exits
  5. observe existence of server crash dump:
    -rw-r-----+ 1 root root 43087375 Jan 4 15:14 /var/lib/systemd/coredump/core.\x2fusr\x2fbin\x2fxpra\x20s.5271.24a5c5b9cb654d519acab75c3aaf8493.528495.1641327282000000.zst
  6. xpra's xorg is still running:
ps uax | egrep -i xorg\|xpra
root        1943  0.0  0.0 307464 15188 ?        Ss   Jan01   0:01 /usr/bin/abrt-dump-journal-xorg -fxtD
karl      528559  3.4  2.4 1153260 403644 ?      Ssl  14:38   1:21 /usr/libexec/Xorg -noreset -novtswitch -nolisten tcp +extension GLX +extension RANDR +extension RENDER -auth /home/karl/.Xauthority -logfile /run/user/5271/xpra/27/Xorg.log -configdir /run/user/5271/xpra/27/xorg.conf.d/528495 -config /etc/xpra/xorg.conf -depth 24 :27
karl      528741  0.0  0.0 597956  7908 ?        Ssl  14:38   0:00 ibus-daemon --xim --verbose --replace --panel=disable --desktop=xpra --daemonize

System Information (please complete the following information):
F35 and 4.4-10.r30771 at both ends

xpra.mp4
@karlkleinpaste karlkleinpaste added the bug Something isn't working label Jan 4, 2022
@totaam
Copy link
Collaborator

totaam commented Jan 5, 2022

Can you reproduce the problem with XPRA_OPAQUE_REGION=0 ?
ie:

xpra start --env=XPRA_OPAQUE_REGION=0 --start=xterm

Crashes are "good" in that gdb should be able to tell us what's wrong.
Can you get a backtrace?

Which chromium is this? ie: How do I install it?

@karlkleinpaste
Copy link
Author

--env=XPRA_OPAQUE_REGION=0 had no effect. chromium is chromium-freeworld installed from rpmfusion. But the browser flashing is the least of my worries now, as I discuss below.

A crash dump analyzed briefly by coredumpctl dump:
coredumpctl-dump.txt

The problem became rather more peculiar, due to seeing in this coredump that crash occurs in a video codec.

I am able to cause crashes on demand by invoking glxgears. However, the circumstances of failure are very specific. For testing, I am using my main machine, an HP Omen laptop with nvidia 2070 maxq, and a desktop in the next room, an MSI Trident with nvidia 2060.

  • On laptop, I create a session with xterm, and I can connect to it locally (:37), via ssh local self-reference (ssh://localhost/37), and from desktop via ssh reference (ssh://laptop/37), and glxgears invariably works fine in all circumstances: xpra server on laptop never fails.

  • On desktop, I similarly create a session with xterm, and I can connect to it from itself locally and via ssh local self-reference, and things work. But when I connect to the desktop from the laptop with ssh://desktop/37, then glxgears causes immediate crash: xpra server on desktop works when referenced by anything but laptop, in which case it dies.

I wanted to experiment with video encodings to see if I could alter the behavior, but unfortunately none of the "help" invocations work:

$ xpra --encoding=help
xpra: need a mode
$ xpra --encodings=help
xpra: need a mode
$ xpra --video-encoders=help
xpra: need a mode

Nonetheless I tried --encoding=rgb on both desktop server and laptop client, and now it does not crash. A little further experimentation shows me that it's needed only on client side, server side does not matter. I could use a hand in knowing how to get a list of available encodings etc in hope of better selection than just rgb. My concern is that the laptop, as the odd man out in inducing crashes on other machines, is having some bad effect because it advertises capability not handled properly at the servers, and works locally only because it's talking to itself.

Oh, look: Using --encoding=rgb, now I have neither crash nor flash in the original example.

These machines are all maintained in absolute lockstep in terms of their RPM complements: The laptop was generated in F35 first and then its filesystems were carried to other machines with changes only for hostname, IP address, VPN config, and fstab. So the only consequential differences are in the video/graphics hardware supported in each.

totaam added a commit that referenced this issue Jan 5, 2022
@totaam
Copy link
Collaborator

totaam commented Jan 5, 2022

Thanks!

I wanted to experiment with video encodings to see if I could alter the behavior, but unfortunately none of the "help" invocations work:

$ xpra --encoding=help
xpra: need a mode

It isn't lying! Try a mode! ie:

xpra attach --encoding=help

The encodings are different depending on whether you run a server (eg: xpra start) or a client (ie: xpra attach).

it advertises capability not handled properly at the servers

That's still a server bug: no invalid capability should ever be able to trigger a codec crash.
(and in this case, it isn't even invalid, there's nothing wrong with it at all - see below)

and works locally only because it's talking to itself

Yes.
When running locally, all encodings are automatically bypassed in favour of mmap and rgb.


Looking at your coredump, the problem is unfortunately obvious once again: enc_ffmpeg is crashing because of VAAPI (not for the first time: #3174), it is accessing your GPU and crashing somehow...

ff_vaapi_encode_close
avcodec_open2
__pyx_pf_4xpra_6codecs_10enc_ffmpeg_7encoder_7Encoder_8init_encoder

vaapi is so brittle that I regret ever enabling it by default, even after restricting it to newer versions / kernels in 55be7e7.

The temporary workaround is really simple:

xpra start --video-encoders=all,-ffmpeg ...

This should also work:

xpra start --env=XPRA_VAAPI=0 ...

The next version will have it disabled by default, as it should have always been.

totaam added a commit that referenced this issue Jan 5, 2022
@karlkleinpaste
Copy link
Author

Harumph.

xpra start --encoding=help
using systemd-run to wrap 'start' server command
Running scope as unit: run-rb021597f499044d3b5c4520bb9a91fac.scope
xpra server supports the following encodings:
(please wait, encoder initialization may take a few seconds)
2022-01-05 11:13:20,678 Error: failed to write pid 1096366 to pidfile '${XPRA_SESSION_DIR}/server.pid':
2022-01-05 11:13:20,678  [Errno 2] No such file or directory: '${XPRA_SESSION_DIR}/server.pid'
2022-01-05 11:13:20,941 Warning: no VAAPI support
2022-01-05 11:13:20,941  vaapi device context not found

nvidia 2070 has no vaapi support at all? Am I missing a necessary lib?

@totaam
Copy link
Collaborator

totaam commented Jan 6, 2022

Ugh. I'll fix those ugly warnings.

nvidia 2070 has no vaapi support at all? Am I missing a necessary lib?

NVidia 2070 has nvenc support, which is much better / faster than vaapi. And more importantly, it is stable.

@totaam totaam closed this as completed Jan 6, 2022
@karlkleinpaste
Copy link
Author

Regret to inform you... Running without vaapi is fine -- no crash -- but chromium is still doing the translucent flash thing in the absence of --encoding=rgb.
F35 and 4.4-10.r30790.

@totaam totaam reopened this Jan 6, 2022
@totaam
Copy link
Collaborator

totaam commented Jan 6, 2022

Using dnf install chromium-browser I am not seeing any flickering.
Do I need to do something specific after launching chromium-browser?

With -d compress, I am seeing lots of h264 and webp frames (running locally but with --no-mmap).

Perhaps you can identify the screen updates that cause this flickering?
Either from the -d compress log output, or with the colours using XPRA_PAINT_BOX=1 client side.

@karlkleinpaste
Copy link
Author

I don't know if or how much it matters, but I'm using chromium-freeworld, which is somewhat different from chromium-browser. Still from rpmfusion.

I haven't had to do anything specific. Start it (and being configured to resume from where I last quit, it re-opens the 2 tabs I typically keep there), and watch it begin to flash.

@totaam
Copy link
Collaborator

totaam commented Jan 6, 2022

I've just tried chromium-freeworld and got the same result: no flicker whatsoever.
The only difference is that this system is running Fedora 34, not 35.

Does it happen if you open it with a blank static page? (ie: the google search page)

@karlkleinpaste
Copy link
Author

xpra-blank-page.mp4

It flashes the portion according to the boundary of the window behind itself, plus the tab bar.

@totaam
Copy link
Collaborator

totaam commented Jan 6, 2022

How odd!
It must be repainting this area of the window, perhaps xpra is sending alternating low quality and higher quality updates (the -d compress output would show this).
My tests don't trigger any such repaints.

I just remembered something similar: #1438 (comment)

If that's the case, turning off (or on?) opengl client acceleration may resolve the issue if the problem is client side: the non-opengl renderer always goes via RGB.
Or also removing the offending encoding: --encodings=-jpeg or --encodings=-webp or even both: --encodings=-jpeg,-webp.
If the problem is server side then it's more complicated: we need to know if you're using the plain jpeg encoder or nvjpeg. Perhaps nvjpeg doesn't use BT.601 like turbojpeg?


It could also be something quirky like the premultiplied alpha getting unpremultiplied twice, though that's far less likely.
chromium claims to use transparency, then uses an opaque region to disable the transparency it requested... Hence my initial idea of disabling this feature with XPRA_OPAQUE_REGION=0, which could actually make things worse.

@karlkleinpaste
Copy link
Author

karlkleinpaste commented Jan 6, 2022

Using -d compress gave me no new output. Where/how does this output appear? Is it needed in server startup? I tried it only in client attach, figuring that it would be only the client's decision whether to see the extra detail.

Specifically: xpra -d compress attach ssh://ph/73 --border=orange,1

@karlkleinpaste
Copy link
Author

I tried all of --encodings=-jpeg --encodings=-webp --encodings=-jpeg,-webp
No improvement with any.

@totaam
Copy link
Collaborator

totaam commented Jan 7, 2022

Using -d compress gave me no new output.

It is a server flag.
You can also enable it after the server has started using:

xpra control :73 debug enable compress

I tried it only in client attach, figuring that it would be only the client's decision whether to see the extra detail.

These messages go in the server log.

I tried all of ... No improvement with any.

Damn.
Then -d compress will tell us what's happening.
-d damage could potentially also be useful later.

@karlkleinpaste
Copy link
Author

server.log
Excerpt of log of existing session from the time the client was restarted without --encoding=rgb, then enabled compress.
I'm concerned about "failed to setup video pipeline" but that's not happening only whenever the translucent flash occurs. That is, it flashes a lot, but there's relatively few such failures.

Watching the flashing again, I was wrong earlier: It is flashing the entire browser window, not just the portion of the immediately-behind window.

Unrelated: Not sure why there are complaints of inability to forward sound. I have yet to try to use sound in any way within xpra.

@totaam
Copy link
Collaborator

totaam commented Jan 7, 2022

From your log:

Warning: failed to setup video pipeline
 (58, (1, 1), None, 0, 0, None, 'BGRA', (1, 1), 1626, 848, jpeg(BGRA to jpega))

It's trying to setup a pipeline for jpega (jpeg with alpha) so the opaque region stuff didn't kick in or didn't match the full window and so the full browser window is being sent with an alpha channel (which is a complete waste). Perhaps that's right at the beginning as it does send screen updates mostly without the transparency after that. (BGRX vs BGRA)

I see a mix of h264, webp with some jpeg but also the occasional jpega.
There's almost 50 seconds worth of screen updates, do you know which ones correspond to the flickering?

@karlkleinpaste
Copy link
Author

new.log
A few seconds of flashing.
This is a bit of by-guess-and-by-gosh evaluation, but...

The pattern, as I watched from another terminal, tail'ing server.log and watching the browser flicker, is that translucence happens during the many entries of 'encoder': 'x264' periods, each ending with 1, 2, or 3 instances of 'encoder': 'webp' where it goes non-translucent, and where the log pauses briefly.

I tried --encodings=-h264, which caused almost all of the flashing to stop...but that's almost all, not all. Just much less. But it also caused oodles of this sequence of complaints in the log, dozens and dozens of these:

2022-01-07 13:45:17,108 Warning: failed to setup video pipeline (59, (1, 1), None, 0, 0, None, 'BGRA', (1, 1), 1626, 848, jpeg(BGRA to jpega))
Traceback (most recent call last):
  File "/usr/lib64/python3.10/site-packages/xpra/server/window/window_video_source.py", line 1660, in setup_pipeline
    if self.setup_pipeline_option(width, height, src_format, *option):
  File "/usr/lib64/python3.10/site-packages/xpra/server/window/window_video_source.py", line 1747, in setup_pipeline_option
    ve.init_context(encoder_spec.encoding, enc_width, enc_height, enc_in_format, options)
  File "xpra/codecs/jpeg/encoder.pyx", line 171, in xpra.codecs.jpeg.encoder.Encoder.init_context
AssertionError
2022-01-07 13:45:17,108 Error: failed to setup a video pipeline for BGRA at 1626x848
2022-01-07 13:45:17,108  tried the following option
2022-01-07 13:45:17,108  (59, (1, 1), None, 0, 0, None, 'BGRA', (1, 1), 1626, 848, jpeg(BGRA to jpega))
2022-01-07 13:45:17,108 Error: failed to setup a video pipeline for auto encoding with source format BGRA
2022-01-07 13:45:17,108  all encoders: 
2022-01-07 13:45:17,108  supported CSC modes: 
2022-01-07 13:45:17,108  supported encoders: 
2022-01-07 13:45:17,108  encoders CSC modes: 
2022-01-07 13:45:17,117 compress:   9.8ms for 1626x848  pixels at    0,0    for wid=5     using     jpega with ratio   1.1%  ( 5386KB to    60KB), sequence   706, client_options={'quality': 42, 'alpha-offset': 45053, 'encoder': 'xpra.server.window.window_video_source'}, options={'quality': 42, 'speed': 21, 'rgb_formats': ('YUV420P', 'YUV422P', 'YUV444P', 'GBRP', 'BGRA', 'BGRX', 'RGBA', 'RGBX', 'RGB', 'BGR'), 'zlib': True, 'lz4': True, 'content-type': 'browser', 'av-delay': 0, 'window-size': (1626, 848), 'cuda-device-context': None}
2022-01-07 13:45:17,173 compress:   5.5ms for  246x664  pixels at    8,8    for wid=4     using      webp with ratio   3.1%  (  638KB to    20KB), sequence    37, client_options={'rgb_format': 'BGRX', 'quality': 100, 'encoder': 'webp'}, options={'quality': 100, 'speed': 5, 'rgb_formats': ('YUV420P', 'YUV422P', 'YUV444P', 'GBRP', 'BGRX', 'RGBX', 'RGB', 'BGR'), 'zlib': True, 'lz4': True, 'alpha': False, 'content-type': 'picture', 'window-size': (817, 721), 'cuda-device-context': None}

@karlkleinpaste
Copy link
Author

Just for giggles, I tried removing --encoding=rgb and adding --video-encoders=nvenc but it gave no improvement.

@totaam
Copy link
Collaborator

totaam commented Jan 10, 2022

I am still unable to reproduce any problems on my test systems.

That said, the commits above fix two issues that could have an impact:

  • the failed to setup a video pipeline for BGRA error by adding jpega to the plain jpeg video encoder (this may not help at all)
  • avoid using h264 for pixel formats with an alpha channel (ie: BGRA) - it is possible that the client ended up using BGRA with garbage in the alpha channel when the decompressor actually gives it BGRX.

The problem with this is that jpega is less efficient than x264 and browser windows do not actually use any alpha channel so we really don't want to be spending time encoding it and then wasting bandwidth sending it.
So we should try to figure out why the opaque region detection stuff isn't doing this.

When I run a browser in xpra and use xprop on its window I see:

_NET_WM_OPAQUE_REGION(CARDINAL) = 0, 0, 2342, 1775

And the xpra server does pick it up:

windows.2.opaque-region=((0, 0, 2342, 1775),)

Which means that the server correctly discards the alpha channel and we can see BGRX instead of BGRA in -d compress as 'rgb_format': 'BGRX':

compress:   0.6ms for   45x808  pixels at    0,256  for wid=2     using      webp with ratio   0.0%  (  142KB to     0KB), \
    sequence   274, client_options={'rgb_format': 'BGRX', 'quality': 100, 'encoder': 'webp', 'flush': 1}, \
    options={'quality': 100, 'speed': 65, 'rgb_formats': ('YUV420P', 'YUV422P', 'YUV444P', 'GBRP', 'BGRA', 'BGRX', 'RGBA', 'RGBX', 'RGB', 'BGR'), 'zlib': True, 'lz4': True, 'content-type': 'browser', 'window-size': (2342, 1775), 'scroll': True, 'cuda-device-context': None}

Somehow, it looks like on your system one of these steps isn't working and you end up with an alpha channel.

Did you try turning opengl on or off in the client to see if that helps with the flickering?

Rant: the opaque region stuff uses an X11 window property, which has no replacement in Wayland... so that's going to get worse as they gradually remove X11 features to even things out with Wayland and claim (lack of) feature partity.


Unrelated: Not sure why there are complaints of inability to forward sound. I have yet to try to use sound in any way within xpra.

It is enabled by default and you should turn it off if you're not using it as it uses CPU and bandwidth.
It also doesn't work at all on Fedora 34+ because they've replaced pulseaudio with pipewire which completely breaks our audio forwarding.

@karlkleinpaste
Copy link
Author

I expect this will make you unhappy:

$ xprop | grep -i opaque
[ empty output ]
$ xpra info | grep -i opaque
windows.1.opaque-region=((0, 0, 1617, 911),)
windows.3.opaque-region=((0, 0, 181, 153),)
windows.4.opaque-region=((0, 0, 817, 721),)
windows.15.opaque-region=()
windows.18.opaque-region=()

The browser is 18, which I know because I just restarted it and did xpra info before and after restart.

I had neglected to experiment with --opengl before. --opengl=yes has flickering. --opengl=no seems to stop it.

So far, I think that makes either --encoding=rgb or --opengl=no the only ways to stop it completely.

@totaam
Copy link
Collaborator

totaam commented Jan 10, 2022

I know because I just restarted it and did xpra info before and after restart

This should also work:

xpra info | egrep "opaque|\.title|\.class"

--opengl=no seems to stop it.

Ah! So it's probably a case of alpha channel being discarded.

...the only ways to stop it completely.

Or the fixes above. (new beta builds are on the way)
But as I said, this will be inefficient until we can discard the unused alpha channel.
(and I don't want to just assume that browsers don't actually use it - they should be able to tell us reliably)

@karlkleinpaste
Copy link
Author

screenshot1
Just installed 4.4-10.r30797 at both ends. Restarted at server, reconnected at client.

The browser is simply stuck in translucent state. Not flickering. Just steady translucent.

A late, possibly bad thought: I use MATE desktop with compiz WM. Any possibility that compiz is not getting along well with the remote browser window?

@totaam
Copy link
Collaborator

totaam commented Jan 11, 2022

Just installed 4.4-10.r30797 at both ends.

I believe that the changes above were in r30805 or later, server side. There were no client-side fixes.

Restarted at server, reconnected at client.

FYI: xpra upgrade saves a restart.
The client should re-connect automatically when using xpra upgrade.

The browser is simply stuck in translucent state. Not flickering. Just steady translucent.

It's probably using jpega and webp now, that's what I see with -d compress.
Those two encodings handle the alpha channel.


I've just tried it in a Fedora 35 VM and altough the opaque region attribute is present when running in Xwayland, it is missing when running in xpra. I was hoping that 4ec525a would help but it doesn't. (it also doesn't show up on Fedora 34 at all - not even in a plain X11 session..)
But even then, we're going to have to be more clever about handling opaque regions because there are multiple non-overlapping opaque regions with chromium-freeworld.

That said, I'm still not seeing any transparency issues client side. Perhaps it is an OpenGL rendering problem?
Can you post xpra opengl from your client?
I hope it's not a Wayland session or something funky?
Can you try different clients? ie: an MS Windows client perhaps?


Then there's also another problem with chromium-freeworld in Fedora 35: when starting without a default page, it seems to be spinning wildly, consuming lots of CPU doing nothing useful until I give it an address to display.

@totaam
Copy link
Collaborator

totaam commented Jan 11, 2022

There are yet more problems on Fedora 35:

  • I get reproducible "crashes" (actually server just exits!?) when mmap is used
  • chromium popup windows (ie: settings) end up in the wrong place and close almost immediately (this looks like a Wayland "feature" as this doesn't happen from X11 or MS Windows clients)

This was referenced Jan 11, 2022
@karlkleinpaste
Copy link
Author

Yesterday was too busy to look at this. Back at it now.

changes above were in r30805

OK, now running 4.4-r30818 everywhere and have done server xpra upgrade, then reconnected client without --encoding=rgb.

And the browser no longer flickers. Yay?

Is this a final fix? Or is this an intermediate code test configuration?

Client's xpra opengl output: opengl.txt
Bear in mind that this started with client (F35 + nvidia 2070 maxq) inducing server side crashes (nvidia 2060, nvidia 840M), and I'm still using --env=XPRA_VAAPI=0. Do I still need that? Maybe nvidia has done something "wrong" with the 2070 in terms of capability advertisement?

If the current degraded state of Wayland continues, I will stay with X11 essentially forever. If Wayland's failures begin to infect X11, I don't know what I'll do. Maybe retain old copies of relevant RPMs, and hand-install and -maintain them in newer systems?

@totaam
Copy link
Collaborator

totaam commented Jan 14, 2022

And the browser no longer flickers. Yay?

Excellent.
The only problem is that I'm not quite sure which commit actually fixed it since I never reproduced it!

Is this a final fix? Or is this an intermediate code test configuration?

"If it works, ship it!" (tm)

... and I'm still using --env=XPRA_VAAPI=0. Do I still need that?

No, this is now the default: b3c7c87

Maybe nvidia has done something "wrong" with the 2070 in terms of capability advertisement?

I don't think so, vaapi is just very brittle, with all cards. Just look at the screen sideways and it will crash.

If Wayland's failures begin to infect X11

That's already happening via GTK.

Let's close this ticket since all the issues are fixed.

Thank you very much for your help!

@totaam totaam closed this as completed Jan 14, 2022
@karlkleinpaste
Copy link
Author

Once again, regret to mention, but my browser flashing problem is back with a vengeance.
I had wondered if perhaps this was a side effect of using VirtualGL, but I tried it with VGL disabled, with no difference in display effect.
https://user-images.githubusercontent.com/7548920/153732206-d855d299-5b21-45d9-8c40-39a7acaad694.mp4

@totaam
Copy link
Collaborator

totaam commented Feb 17, 2022

This commit should help with that: e2d606d
We won't be sending the alpha channel for browser windows - assuming that the window is correctly detected as content-type=browser.

@karlkleinpaste
Copy link
Author

Appears to be a good heuristic:

$ xpra info | grep browser | grep -v start-menu
client.window.12.content-type=browser
windows.12.content-type=browser
windows.12.role=browser

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants