Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use kitty deflate capability on initial load when using animation protocol #1695

Closed
dankamongmen opened this issue May 29, 2021 · 20 comments · Fixed by #1965
Closed

use kitty deflate capability on initial load when using animation protocol #1695

dankamongmen opened this issue May 29, 2021 · 20 comments · Fixed by #1965
Assignees
Labels
bitmaps bitmapped graphics (sixel, kitty, mmap) enhancement New feature or request perf sweet sweet perf

Comments

@dankamongmen
Copy link
Owner

Kitty can accept PNG data (in byte form, not from a file, meaning this technique can be used remotely) using the following syntax:

<ESC>_Gf=100;<payload><ESC>\

if we are given a PNG in ncvisual_from_file(), it would probably be best to send it this way -- it ought be well-compressed. with that said, as soon as we make any change to the image (wiping etc), we'd need to send along RGBA unless we intend to rebuild the PNG (we don't). we'd also need to verify that the content is indeed PNG, which we can't determine just based off the filename etc. (and it would be nice to be able to feed a PNG which we received from memory rather than the filesystem).

as an example, rgb-sponge.png is 67MiB (4096 * 4096 * 4B), but its PNG is only 38MiB, 57.7% of the decoded form. other PNGs are likely to be far better compressed.

@dankamongmen dankamongmen added enhancement New feature or request perf sweet sweet perf bitmaps bitmapped graphics (sixel, kitty, mmap) labels May 29, 2021
@dankamongmen dankamongmen self-assigned this May 29, 2021
@dankamongmen
Copy link
Owner Author

worldmap.png is 154KB, but 860 * 475 * 4 == 1.63MB expanded. that's a reduction of over 90%. this is by far the best way to save bandwidth on images loaded as PNG. that means, however, that it won't help on any videos, and i'm not sure how (or even if) we can determine that an image was a png. by the time such a thing would be determined, we'd already have expanded it. and we can't use this with rescaling (though kitty can rescale itself).

under the principle that it's better to spend extra cycles in notcurses to save extra data transmitted, perhaps we'd want to convert all frames to PNG upon receipt? this might introduce unacceptable latency, though; pngcrush and optipng are not exactly fast programs. it looks like we have to base64-encode the PNG data just like RGBA, so there's no advantage there. hrmmm.

given that we would need to know that the input is PNG, and that's not made readily available from either of our multimedia engines, and that we'd want to use this method for most visuals if it does indeed work well, i think the operative method would be an on-the-fly PNG encoding of our RGBA data. i don't have the full PNG specification memorized, but i think you can do some cheap RLE that might get us decent savings.

also, PNG admits palette-indexed graphics, which could be very effective compression for images using fewer than 257 colors...

@dankamongmen
Copy link
Owner Author

so if we were to go this route, do we want to rely on multimedia engines to do the encoding, or do we want to add libpng as a base dependency?

good: could use this fast method even without multimedia backend, don't have to maintain two encode-to-PNG paths
bad: two PNG cores when multimedia backend is present, might be more code (might not)

there's also the idea of a third multimedia engine that is only libpng. vomit.

oh so shit if we have libpng in notcurses-core, would we then be able to decode png files? hrmmmm. and if so, if we have a multimedia engine, which one would we use for png decoding?

@dankamongmen
Copy link
Owner Author

Before doing all this, I'd like to do a simple benchmark. Take an existing PNG and build a kitty graphics escape out of it. Then take the equivalent RGBA-based graphic from Notcurses. Cat them both through kitty, and ensure that the former is significantly faster (since we'll be adding an encode step). If not, there's no point in doing this.

@dankamongmen
Copy link
Owner Author

echo -e '\e_Ga=T,f=100,t=f,s=860,v=475;L2hvbWUvZGFuay9zcmMvZGFua2Ftb25nbWVuL25vdGN1cnNlcy9kYXRhL3dvcmxkbWFwLnBuZw==\e\\' works for sending the png directly vs ncplayer -bpixel -k ../data/worldmap.png to generate an RGBA.

@dankamongmen
Copy link
Owner Author

2021-06-24-023229_1072x1417_scrot

kpng is 108 bytes, ncpng is 2183539

@dankamongmen
Copy link
Owner Author

kpng:

real	0m0.003s
user	0m0.000s
sys	0m0.003s

real	0m0.003s
user	0m0.001s
sys	0m0.002s

real	0m0.003s
user	0m0.001s
sys	0m0.002s

real	0m0.003s
user	0m0.003s
sys	0m0.000s

real	0m0.003s
user	0m0.001s
sys	0m0.001s

real	0m0.003s
user	0m0.000s
sys	0m0.003s

real	0m0.003s
user	0m0.001s
sys	0m0.001s

real	0m0.003s
user	0m0.001s
sys	0m0.001s

real	0m0.003s
user	0m0.000s
sys	0m0.003s

real	0m0.003s
user	0m0.001s
sys	0m0.001s

ncpng:





real	0m0.024s
user	0m0.000s
sys	0m0.012s

real	0m0.028s
user	0m0.001s
sys	0m0.010s

real	0m0.024s
user	0m0.001s
sys	0m0.011s

real	0m0.036s
user	0m0.001s
sys	0m0.011s

real	0m0.028s
user	0m0.001s
sys	0m0.011s

real	0m0.031s
user	0m0.001s
sys	0m0.012s

real	0m0.034s
user	0m0.000s
sys	0m0.012s

real	0m0.033s
user	0m0.000s
sys	0m0.012s

real	0m0.033s
user	0m0.000s
sys	0m0.015s

real	0m0.034s
user	0m0.000s
sys	0m0.013s

yeah, that's pretty substantial. of course, we're going to be sending PNG data, not a filename, so we need test with that.

@dankamongmen
Copy link
Owner Author

these numbers were only exacerbated by running remotely

@dankamongmen
Copy link
Owner Author

[schwarzgerat](0) $ ls -l kpng ncpng png.png 
-rw-r--r-- 1 dank dank     108 2021-06-24 02:32 kpng
-rw-r--r-- 1 dank dank 2183539 2021-06-24 02:29 ncpng
-rw-r--r-- 1 dank dank  206497 2021-06-24 02:41 png.png
[schwarzgerat](0) $ 

png.png is an order of magnitude smaller than ncpng, but three orders of magnitude larger than kpng. they're smaller orders of magnitude, though =]

@dankamongmen
Copy link
Owner Author

times:

real	0m0.004s
user	0m0.000s
sys	0m0.004s

real	0m0.020s
user	0m0.000s
sys	0m0.005s

real	0m0.017s
user	0m0.000s
sys	0m0.004s

real	0m0.018s
user	0m0.003s
sys	0m0.001s

real	0m0.017s
user	0m0.000s
sys	0m0.004s

real	0m0.017s
user	0m0.000s
sys	0m0.004s

real	0m0.017s
user	0m0.000s
sys	0m0.004s

real	0m0.017s
user	0m0.000s
sys	0m0.004s

real	0m0.017s
user	0m0.000s
sys	0m0.004s

real	0m0.017s
user	0m0.003s
sys	0m0.001s

so a healthy savings over rgba, but definitely slower than providing a file reference.

@dankamongmen
Copy link
Owner Author

ssh over local wireless:

k.png:

real	0m0.001s
real	0m0.001s
real	0m0.001s
real	0m0.001s
real	0m0.001s
real	0m0.001s
real	0m0.001s
real	0m0.001s
real	0m0.001s
real	0m0.001s

png.png:

real	0m0.004s
real	0m0.004s
real	0m0.004s
real	0m0.003s
real	0m0.003s
real	0m0.005s
real	0m0.020s
real	0m0.013s
real	0m0.017s
real	0m0.003s

nc.png:

real	0m0.085s
real	0m0.103s
real	0m0.076s
real	0m0.072s
real	0m0.068s
real	0m0.071s
real	0m0.068s
real	0m0.069s
real	0m0.071s
real	0m0.077s

so we retain big wins on remote. how the hell is k.png working remotely?

@dankamongmen
Copy link
Owner Author

kovidgoyal/kitty#3758

@dankamongmen
Copy link
Owner Author

so first off, i now know how PNG works, and it's not a super-advanced scheme. there are a few filters, and then things are run through deflate (LZ77). note that this deflate means you have to inflate to do any in-place editing. so i think this is probably not a win for the non-animated case, where we have to edit in-place and rewrite the entire image.

for the animated case, we're only writing the entire image a single time; we never edit it in place. from that point on, it's all sending null cells and auxvec-rebuilt cells. so we might as well get the one-time win from deflate, right? not this way, i think. PNG introduces other overheads in both complexity and bandwidth. i'd just as soon use the kitty protocol's built-in deflate and avoid that. so this would only be for the animation case, on the initial load. yeah, let's do it. i bet it'll compress long transparent regions really well, and boost frame rates on e.g. xray.

another idea is setting rgb for transparent regions to all 0s, so they deflate better, since we don't need preserve them any further.

@dankamongmen dankamongmen changed the title feed kitty PNG data directly when possible use kitty deflate capability on initial load when using animation protocol Jul 18, 2021
@dankamongmen
Copy link
Owner Author

just about got this working. hey @kovidgoyal i assume that all chunks have to be deflated if any are, correct? since o=z goes on the initial chunk? i'm just wondering because ideally i'd take whichever was smaller (deflated or original) for each chunk. i'm especially worried about deflate blowing up my 768 pixels past 4096 bytes, at which point i blow out a chunk (if i detect this case, i'll probably just split that chunk).

@kovidgoyal
Copy link

yes all chunks must use the same encoding.

@dankamongmen
Copy link
Owner Author

hrmm better would be eliminating the whole idea of 768 pixels == chunk when operating in this mode, and instead just stream into the zlib automaton, and take up to 4096 bytes from it at a time. yes, that's the way to do it. it just doesn't go smoothly with the existing write_kitty_data(), which we need retain for older Kittys.

dankamongmen added a commit that referenced this issue Jul 21, 2021
@dankamongmen
Copy link
Owner Author

ok, got it implemented and working.

good news: it's effective, at least on certain graphics. here's 2X xray before:

973 renders, 184.15ms (140.28µs min, 189.26µs avg, 299.67µs max)
973 rasters, 76.53ms (27.16µs min, 78.65µs avg, 142.69µs max)
973 writes, 30.95s (148.38µs min, 31.81ms avg, 46.46ms max)
2.95GiB (8.80KiB min, 3.10MiB avg, 3.11MiB max)
0 failed renders, 0 failed rasters, 0 refreshes, 0 input errors
RGB emits:elides: def 2765:263302 fg 15023:257427 bg 1595:12447 0 inputs
Cell emits:elides: 280109:13886771 (98.02%) 98.96% 94.49% 88.64%
Bitmap emits:elides: 972:0 (0.00%) 2.94GiB (99.96%) SuM: 972 (99.90%)

and after:

973 renders, 194.68ms (147.87µs min, 200.08µs avg, 322.98µs max)
973 rasters, 79.44ms (47.19µs min, 81.65µs avg, 155.99µs max)
973 writes, 655.70ms (56.35µs min, 673.89µs avg, 2.14ms max)
132.66MiB (3.23KiB min, 139.61KiB avg, 280.71KiB max) 0 inputs
0 failed renders, 0 failed rasters, 0 refreshes, 0 input errors
RGB emits:elides: def 2765:263302 fg 15347:258740 bg 1712:13967
Cell emits:elides: 281746:13885134 (98.01%) 98.96% 94.40% 89.08%
Bitmap emits:elides: 972:0 (0.00%) 131.31MiB (98.98%) SuM: 0 (0.00%)

now admittedly this is a pretty compressible bitmap, but still, a 96% reduction in transmitted bytes is a huge fucking win. the problem lies here:

before:

             runtime│ frames│output(B)│    FPS│%r│%a│%w│TheoFPS║
══╤════════╤════════╪═══════╪═════════╪═══════╪══╪══╪══╪═══════╣
 1│    xray│  20.57s│    486│   1.47Gi│   23.6│ 0│ 0│72│  32.38║
 2│    xray│  21.33s│    486│   1.47Gi│   22.8│ 0│ 0│75│  30.00║
══╧════════╧════════╪═══════╪═════════╪═══════╧══╧══╧══╧═══════╝
              41.90s│    972│   2.95Gi│

after:

             runtime│ frames│output(B)│    FPS│%r│%a│%w│TheoFPS║
══╤════════╤════════╪═══════╪═════════╪═══════╪══╪══╪══╪═══════╣
 1│    xray│  29.49s│    486│  66.33Mi│   16.5│ 0│ 0│ 1│  1.06K║
 2│    xray│  29.29s│    486│  66.32Mi│   16.6│ 0│ 0│ 1│  1.03K║
══╧════════╧════════╪═══════╪═════════╪═══════╧══╧══╧══╧═══════╝
              58.78s│    972│ 132.65Mi│

so yeah we're writing tremendously less data, but we're taking about 50% again as much total time, and our FPS have dropped correspondingly.

over local wireless

before

             runtime│ frames│output(B)│    FPS│%r│%a│%w│TheoFPS║
══╤════════╤════════╪═══════╪═════════╪═══════╪══╪══╪══╪═══════╣
 1│    xray│  65.73s│    486│   1.47Gi│    7.4│ 0│ 0│76│   9.64║
══╧════════╧════════╪═══════╪═════════╪═══════╧══╧══╧══╧═══════╝
              65.73s│    486│   1.47Gi│

after

             runtime│ frames│output(B)│    FPS│%r│%a│%w│TheoFPS║
══╤════════╤════════╪═══════╪═════════╪═══════╪══╪══╪══╪═══════╣
 1│    xray│  34.11s│    486│  66.33Mi│   14.2│ 0│ 0│ 2│ 592.13║
══╧════════╧════════╪═══════╪═════════╪═══════╧══╧══╧══╧═══════╝
              34.11s│    486│  66.33Mi│

so performance improved in the network case, where bandwidth absolutely dominates delay. i suspect we just have a crappy first implementation, and we can probably speed it up significantly. if so, this ought be a pretty solid win.

i went ahead and moved to a chunk-at-end scheme, so we're issuing optimal chunks (i.e. each is exactly 4096 bytes until we get to the end). obviously this has memory cost on the order of the bitmap size, since we're buffering up all the deflate output.

@dankamongmen
Copy link
Owner Author

Samples: 124K of event 'cycles', Event count (approx.): 127054777008                                    
  Children      Self  Command         Shared Object                Symbol
    26.02%    25.98%  notcurses-demo  libswscale.so.5.7.100        [.] yuv2rgba32_full_X_c
-   18.98%    18.70%  notcurses-demo  libnotcurses-core.so.2.3.11  [.] write_kitty_data
     write_kitty_data
+    8.39%     0.00%  notcurses-demo  [unknown]                    [.] 0000000000000000
+    7.82%     7.81%  notcurses-demo  libswscale.so.5.7.100        [.] ff_hscale14to15_X4_ssse3.innerlo
+    4.01%     0.03%  notcurses-demo  libswscale.so.5.7.100        [.] swscale

really not much in terms of zlib, interesting

@dankamongmen
Copy link
Owner Author

there it is

Samples: 113K of event 'cycles', Event count (approx.): 116203165085
  Children      Self  Command         Shared Object                  Symbol
    28.29%    28.25%  notcurses-demo  libswscale.so.5.7.100          [.] yuv2rgba32_full_X_c
-   16.80%     0.00%  notcurses-demo  [unknown]                      [k] 0000000000000000
   - 0
        7.96% kitty_blit_core
      - 0.99% 0x7f84496ea0b0
           0.98% ff_hscale14to15_X4_ssse3.innerloop
      - 0.76% 0x7f84496ea018
           0.58% rgbaToA_c
-   12.32%    12.30%  notcurses-demo  libnotcurses-core.so.2.3.11    [.] kitty_blit_core
   - 7.95% 0
        kitty_blit_core
     1.72% kitty_blit_core
   - 1.09% 0x24900000000
        kitty_blit_core
-   10.70%    10.68%  notcurses-demo  libz.so.1.2.11                 [.] deflate_slow
     10.14% deflate_slow
-    8.59%     8.58%  notcurses-demo  libswscale.so.5.7.100          [.] ff_hscale14to15_X4_ssse3.inner
   - 4.33% 0
      - 0.98% 0x7f84496ea0b0
           ff_hscale14to15_X4_ssse3.innerloop
   - 4.24% 0x616c665f73777300
        sws_context_to_name
        swscale
        ff_hscale14to15_X4_ssse3.innerloop
-    7.84%     7.83%  notcurses-demo  libz.so.1.2.11                 [.] fill_window
     fill_window
-    6.77%     6.75%  notcurses-demo  libz.so.1.2.11                 [.] longest_match
     longest_match
+    4.34%     0.03%  notcurses-demo  libswscale.so.5.7.100          [.] swscale
+    4.34%     0.00%  notcurses-demo  [unknown]                      [.] 0x616c665f73777300
+    4.34%     0.00%  notcurses-demo  libswscale.so.5.7.100          [.] sws_context_to_name
+    2.82%     2.81%  notcurses-demo  libz.so.1.2.11                 [.] adler32_z
+    2.55%     2.54%  notcurses-demo  libswscale.so.5.7.100          [.] rgbaToA_c
+    2.44%     0.00%  notcurses-demo  libavutil.so.56.51.100         [.] av_default_item_name

@dankamongmen
Copy link
Owner Author

ok so this is good; it verifies that the slowdown we're seeing is indeed accountable to zlib. let's see what happens if we tighten up our usage thereof. i definitely want to have this.

@dankamongmen
Copy link
Owner Author

i think we've lost a lot of parallelism in xray recently. once we get that kicked back up, we ought be able to hide a lot of this work.

i've taken the zlib level down to 2 from ZLIB_COMPRESSION_DEFAULT (6) and recovered about half of the lost time when running locally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bitmaps bitmapped graphics (sixel, kitty, mmap) enhancement New feature or request perf sweet sweet perf
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants