Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chafa sixel-encodes AllRGB images substantially faster than notcurses #2573

Closed
dankamongmen opened this issue Jan 23, 2022 · 85 comments
Closed
Assignees
Labels
bug Something isn't working perf sweet sweet perf
Milestone

Comments

@dankamongmen
Copy link
Owner

In the course of stunting on hpjansson/chafa#27, I discovered that chafa 1.8.0 renders AllRGB/balloon.jpg in about half the time of ncplayer on my 3970X. Representative timings:

[schwarzgerat](0) $ time chafa /media/chungus/images/allrgb/balloon.png  -f sixel > /dev/null

real    0m0.420s
user    0m0.744s
sys     0m0.057s
[schwarzgerat](0) $ 

vs

[schwarzgerat](0) $ time ~/src/dankamongmen/notcurses/build-3.0.5/ncplayer /media/chungus/images/allrgb/balloon.png  -sscale -k > /dev/null

real    0m0.774s
user    0m0.720s
sys     0m0.053s
[schwarzgerat](0) $ 

in addition, the chafa output is slightly better -- look at the purple sections, and their borders with the blue sections to their right. we have some weird green pixels in there, unlike chafa.

ncplayer:

2022-01-23-150607_883x891_scrot

chafa:

2022-01-23-150616_880x922_scrot

note that this is on amd64, where chafa AFAIK uses hand-coded SIMD. i'm not sure if that's true on other architectures, but who cares; i tolerate defeat on no platform. well, i suppose there's only one thing to do now.

oh lawd you didnt bring a victim child

THERE CAN BE ONLY ONE

@dankamongmen dankamongmen added bug Something isn't working perf sweet sweet perf labels Jan 23, 2022
@dankamongmen dankamongmen added this to the 4.0.0 milestone Jan 23, 2022
@dankamongmen dankamongmen self-assigned this Jan 23, 2022
@dankamongmen
Copy link
Owner Author

this is a case ripe for threading, which chafa makes use of. i bet tossing some in could catch us up pretty quickly. we'll then want to revalidate on small, simple images to ensure we haven't added much overhead.

@dankamongmen
Copy link
Owner Author

as for the image quality, chafa's using mincut, which is working on a 3-dimensional metric. if we expanded to three dimensions of metric, we'd eliminate this annoying tendency to prefer a close green match. that ought close the quality gap.

dank's comin' for ya

@dankamongmen
Copy link
Owner Author

also, gcc12 introduced autovectorization at O2, from which we might benefit. if we don't, it's probably time to take a look at SIMDizing this shit ourselves via intrinsics. it's always good to have competition! even better to then blow said competition out of the water.

@dankamongmen dankamongmen changed the title chafa encodes AllRGB images substantially faster than notcurses chafa sixel-encodes AllRGB images substantially faster than notcurses Jan 23, 2022
@dankamongmen
Copy link
Owner Author

i'm thinking we ought be able to do a fixed-size three-dimensional pseudometric and eliminate the green preference in constant time. if the difference is beyond first order, it's barely going to show up anyway. there's no need to search the entire space.

right now we have constant time with a single dimension by checking high and lo over the channel sum. expanding that into three dimensions would mean eight bounds instead of two, but all of them are in cache, and you could likely SIMDize the actual comparisons thus eliminating any branches. hell, we could probably do that and improve our performance as it exists simply by reducing two comparisons to a single SIMD instruction and killing the branch. hot damn, pepper your angus!

@dankamongmen
Copy link
Owner Author

looking at a single-instance run over the folder, we get:

real    0m5.382s
user    0m7.287s
sys     0m0.939s

for chafa and

real    0m8.572s
user    0m8.030s
sys     0m0.423s

for notcurses

@dankamongmen
Copy link
Owner Author

so yeah i think a good first step would be...let's see, we get 1024 RGBA pixels to a page. that's a 32x32 chunk or any equivalent rectangle. we'd want contiguous sections in memory to go to the same thread, so it just takes 1024 (or some multiple thereof, N). next thread takes the next N. you can work out the starting pixel in O(1) from your pixel offset (we'll need adapt this a bit for linestrides). threads grab chunks until it's done.

we could spin up some worker threads ahead of time maybe, upon detecting sixel support, so they're ready by the time we actually get a sixel encode request. they can be applied to any sixel encoding, even multiple concurrent ones, so long as job info is present in the work queue. there are three phases for the purposes of a concurrency graph:

  • loading qstates (parallel with random contention)
  • refinement/merging (serialized)
  • building up data table (embarrassingly parallel)

so it would be good to see how much of our time each phase is occupying. if a majority of it is refinement/merging, threads probably aren't going to get the job done (Amdahl's Law and all that). if it's 1 and 3, especially 3, they'll help very much.

i'm inclined to think 8192 or 16384 are good chunk sizes (2048 or 4096 pixels). they don't trounce the entire L1 no matter what uarch properties it has (so long as it's 32KB+). we don't want an abundance of threads--the more threads, the more contention in phase 1.

@dankamongmen
Copy link
Owner Author

then, within a thread, it would be a fine thing to vectorize at least qnode_keys(). i think that would be the most fruitful place to start applying SIMD, anyway.

@dankamongmen
Copy link
Owner Author

shifting to -O3 brought negligible gains, so i don't think autovectorization is doing much for us. looks like we'll have to get in thar ourselves.

@dankamongmen
Copy link
Owner Author

hrmmm, that's not going to be the best way to do it given how extract_color_table() currently works:

  • for each 6y
    • for each x
      • for each of 6 lines
        • do a bunch of crap with the TAM
        • call insert_color()

so we'd need parallelize the TAM crap as well. which i think we can do -- there are no carried dependencies. but we'd need arrange the workers differently; we'd want each one to do a cell at a time, or some multiple thereof.

so if we wanted to do that, i think the thing to do would be to factor out an extract_cell_color_table() which took the current args plus a cell's y/x coordinate within the image. this will update the TAM for the cell (or a NxM rectangle of cells with its origin at the specified cell) and insert all colors encountered (yes, we're sounding dangeously close to mosaics, aren't we?).

but overall this is good -- it means more of the total computation ought be parallelizable. remember, we have to do a lot more stuff than chafa, what with those other planes and such.

alright, let's get to it, you've got your marching orders!

sick of the same old thang

@dankamongmen
Copy link
Owner Author

so the first question -- can we eliminate the sixel-based loop in extract_color_table()? sixels ought not have any relevance here, and it would simplify things tremendously.

@dankamongmen
Copy link
Owner Author

i've restructured extract_color_table() to be cell-based. it's working almost perfectly; we have a weird bit of regression in the [intro] demo, but only following the first move (i.e. the orca is first drawn correctly, but on subsequent draws to the left, she has some cruft above her dorsal fin).

while doing this, i found what i thought to be a bug (lastrow is being set based on it being the last row of a sixel, not a cell), and fixed it -- was i misunderstanding something, and that's the cause of the regression? let's get this ironed out before moving on.

@dankamongmen
Copy link
Owner Author

oh ok lastrow doesn't refer to the last row of a sixel, but the last sixel of the image, duh. and firstpix doesn't refer to the first pixel of the image, but of each cell.

@dankamongmen
Copy link
Owner Author

oh, motherfucker, i bet i had the firstpix=false following a bail on transparency...yeah that's almost certainly it, motherfucker motherfucker

@dankamongmen
Copy link
Owner Author

yep, works perfectly now, w00t w00t

@dankamongmen
Copy link
Owner Author

alright, we now have an extract_color_table() which works on a per-cell basis. in addition to being suitable for parallelism, it's significantly more readable, with a full level less of looping.

@dankamongmen
Copy link
Owner Author

dankamongmen commented Jan 25, 2022

hrmmm. initial profiling looks...not promising. the majority of our time seems to be spent in sixel_blit_inner() especially write_sixel_payload() (at least for the AllRGB images). which makes sense; that's handling 409640961024, an immense 2^34 operations. ack! extract_color_table(), inlined, primarily shows up as untracked sixel_blit().

(actually, we spend more time in inflate_fast() than anything, presumably due to PNG decoding)

here's a profile for covid19.jpg and aidsrobots.jpg:

+   57.90%    56.29%  ncplayer  libnotcurses-core.so.3.0.5  [.] write_sixel_payload                         ◆
+   20.42%    14.92%  ncplayer  libnotcurses-core.so.3.0.5  [.] sixel_blit                                  ▒
+    6.09%     0.00%  ncplayer  [kernel.kallsyms]           [k] handle_mm_fault                             ▒
+    6.09%     0.00%  ncplayer  [kernel.kallsyms]           [k] __handle_mm_fault                           ▒
+    5.78%     0.08%  ncplayer  [kernel.kallsyms]           [k] asm_exc_page_fault                          ▒
+    5.70%     0.00%  ncplayer  [kernel.kallsyms]           [k] exc_page_fault                              ▒
+    5.70%     0.00%  ncplayer  [kernel.kallsyms]           [k] do_user_addr_fault                          ▒
+    5.40%     5.30%  ncplayer  [kernel.kallsyms]           [k] clear_page_rep                              ▒
+    5.30%     0.00%  ncplayer  [kernel.kallsyms]           [k] do_huge_pmd_anonymous_page                  ▒
+    4.77%     0.00%  ncplayer  [kernel.kallsyms]           [k] __alloc_pages                               ▒
+    4.77%     0.10%  ncplayer  [kernel.kallsyms]           [k] get_page_from_freelist                      ▒
+    4.73%     0.00%  ncplayer  [kernel.kallsyms]           [k] alloc_pages_vma                             ▒
+    3.42%     0.00%  ncplayer  [kernel.kallsyms]           [k] entry_SYSCALL_64_after_hwframe              ▒
+    3.42%     0.00%  ncplayer  [kernel.kallsyms]           [k] do_syscall_64                               ▒
     2.27%     2.27%  ncplayer  libc-2.33.so                [.] __vfprintf_internal         

and AllRGB:

+   37.88%    37.30%  ncplayer  libz.so.1.2.11              [.] inflate_fast
+   22.40%    21.88%  ncplayer  libnotcurses-core.so.3.0.5  [.] write_sixel_payload
+    8.87%     6.60%  ncplayer  libnotcurses-core.so.3.0.5  [.] sixel_blit
+    4.06%     0.06%  ncplayer  [kernel.kallsyms]           [k] asm_exc_page_fault
+    4.04%     0.01%  ncplayer  [kernel.kallsyms]           [k] handle_mm_fault
+    4.01%     0.01%  ncplayer  [kernel.kallsyms]           [k] __handle_mm_fault
+    3.99%     0.01%  ncplayer  [kernel.kallsyms]           [k] exc_page_fault
+    3.99%     0.00%  ncplayer  [kernel.kallsyms]           [k] do_user_addr_fault
+    3.81%     0.00%  ncplayer  [unknown]                   [.] 0000000000000000
+    3.71%     0.00%  ncplayer  [kernel.kallsyms]           [k] do_huge_pmd_anonymous_page
+    3.70%     3.66%  ncplayer  [kernel.kallsyms]           [k] clear_page_rep
+    3.04%     0.01%  ncplayer  [kernel.kallsyms]           [k] __alloc_pages
+    3.03%     0.04%  ncplayer  [kernel.kallsyms]           [k] get_page_from_freelist
+    3.01%     0.00%  ncplayer  [kernel.kallsyms]           [k] alloc_pages_vma
+    2.71%     2.65%  ncplayer  libz.so.1.2.11              [.] adler32_z
+    1.83%     0.00%  ncplayer  [kernel.kallsyms]           [k] entry_SYSCALL_64_after_hwframe
+    1.83%     0.00%  ncplayer  [kernel.kallsyms]           [k] do_syscall_64
     1.50%     1.49%  ncplayer  libz.so.1.2.11              [.] inflate
+    1.27%     0.82%  ncplayer  libc-2.33.so                [.] __memmove_avx_unaligned_erms
+    1.19%     0.00%  ncplayer  [kernel.kallsyms]           [k] __irq_exit_rcu
+    1.19%     0.01%  ncplayer  [kernel.kallsyms]           [k] __softirqentry_text_start
+    1.08%     0.00%  ncplayer  libpthread-2.33.so          [.] __libc_write

@dankamongmen
Copy link
Owner Author

so for the former case, infinite ideal parallelism in extract_color_table() can eliminate only 20% of runtime, and in the latter case, it's an even smaller ~9%.

well, good thing we ran that. so the big wins are going to come from parallelizing write_sixel_payload(), where we could knock out ~60% of runtime in the former case and ~22% in the latter.

so, what is write_sixel_payload()? we iterate over

  • six-row sections (a row of sixels) (horizontal width in pixels)
  • for each color
  • until we've handled all six-row sections

the data is stored as vectors for each color, with each vector equal in size to the total number of sixels. cache performance ought be pretty reasonable, then -- we're reading left to right for the duration of a sixel line (one byte per sixel, so 64 pixels == cacheline). worst case is horizontal width congruent to 1 modulo 64, in which case we're doing a cacheline fill of 1.5% efficiency, per color, per sixel row. when width is congruent to 0 modulo 64, cacheline fills are 100% efficient.

we could raise that efficiency by storing by sixel row, and then by color within said sixel row. something to consider.

normally we would expect to make significant use of sixel RLE (not applicable to e.g. the AllRGB images, though). if we admitted dynamic lengths, we could pre-RLE this data. this helps us not at all in the worst case, though.

in the case where we have many colors (and thus the slowest case), it's unlikely that any given color will be in a sixel row. can we not preprocess these away? one bit per color per sixel row, set in the previous step, would allow us to skip the full width for that color. this might be a very nice, very easy pickup. try it. you could place these in their own bitvector for cache friendliness.

so my first thought would be parallelizing on sixel rows. a chunk is then width * colors data, each of size 1B. for 1024 colors, you hit an x86 page at 4 pixels wide (geez, no matter we're going through so many pages -- this is a huge structure!). so a thread has plenty of work. each writes to their own internal buffer, and then you just lay those into the fbuf. ought be perfect parallelism.

alright. between these last two ideas, i think we can get somewhere. beyond that, i think we need to just shrink this elephantine structure down. too big! 819KB per sixel row for 800 pixels across @ 1024 colors -- too mas!

@dankamongmen
Copy link
Owner Author

so if we wanted to add the 1-bit "color absent" signal, that's best done in build_data_table(), no?

@dankamongmen
Copy link
Owner Author

so you need (colors * sixel rows) bits total for this map, right? yeah. initialize it to 0. whenever you load stab->map->data, set map[sixelrow * colors + color to 1. a 1 here means we must inspect the sixel row for the color.

so you know we've really built a three-pass encoder, not a two-pass. can we not integrate these two pieces of shit?

@dankamongmen
Copy link
Owner Author

ok the actionmap proposed above is looking good initially, about a 20% pickup on AllRGB, need flush out a bug though so who knows maybe it actually makes us 30000 times slower

@dankamongmen
Copy link
Owner Author

alright, i have parallelized all this now, and check this out.

3.0.5 allrgb -a

+   28.85%    28.77%  ncplayer         libz.so.1.2.11                          ◆
+   11.97%    11.94%  ncplayer         libnotcurses-core.so.3.0.5              ▒
+    5.92%     0.00%  swapper          [kernel.kallsyms]                       ▒
+    5.92%     0.00%  swapper          [kernel.kallsyms]                       ▒
+    5.92%     0.03%  swapper          [kernel.kallsyms]                       ▒
+    5.22%     0.00%  swapper          [kernel.kallsyms]                       ▒
+    5.22%     0.06%  swapper          [kernel.kallsyms]                       ▒
+    4.77%     3.68%  ncplayer         libnotcurses-core.so.3.0.5              ▒
     3.73%     3.71%  ncplayer         libswscale.so.5.9.100                   ▒
+    3.41%     0.00%  ncplayer         [unknown]                              

3.0.5 allrgb

+   43.18%    43.15%  ncplayer  libz.so.1.2.11              [.] inflate_fast
+   18.23%    18.22%  ncplayer  libnotcurses-core.so.3.0.5  [.] write_sixel_payl
+    7.09%     5.49%  ncplayer  libnotcurses-core.so.3.0.5  [.] sixel_blit
     5.53%     5.51%  ncplayer  libswscale.so.5.9.100       [.] yuv2rgba32_full_
+    5.05%     0.00%  ncplayer  [unknown]                   [k] 0000000000000000
+    3.35%     0.07%  ncplayer  [kernel.kallsyms]           [k] asm_exc_page_fau
+    3.32%     0.01%  ncplayer  [kernel.kallsyms]           [k] handle_mm_fault
+    3.30%     0.01%  ncplayer  [kernel.kallsyms]           [k] __handle_mm_faul
+    3.27%     0.00%  ncplayer  [kernel.kallsyms]           [k] exc_page_fault
+    3.27%     0.01%  ncplayer  [kernel.kallsyms]           [k] do_user_addr_fau
+    3.03%     3.02%  ncplayer  libz.so.1.2.11              [.] adler32_z
+    3.03%     3.02%  ncplayer  [kernel.kallsyms]           [k] clear_page_rep

rebuild-sixels allrgb -a

+   31.32%    31.23%  ncplayer         libz.so.1.2.11
+    5.64%     0.00%  swapper          [kernel.kallsyms]
+    5.64%     0.00%  swapper          [kernel.kallsyms]
+    5.64%     0.03%  swapper          [kernel.kallsyms]
+    4.94%     0.00%  swapper          [kernel.kallsyms]
+    4.94%     0.05%  swapper          [kernel.kallsyms]
+    3.62%     0.00%  ncplayer         [unknown]
+    3.40%     3.39%  xterm            xterm
+    3.07%     0.00%  xterm            [unknown]
+    2.64%     2.64%  ncplayer         libnotcurses-core.so.3.0.5
+    2.37%     2.37%  xterm            xterm
+    2.37%     2.36%  ncplayer         libz.so.1.2.11
+    2.34%     2.34%  ncplayer         libnotcurses-core.so.3.0.5

rebuild-sixels allrgb

+   51.01%    50.95%  ncplayer  libz.so.1.2.11              [.] inflate_fast
+    5.90%     0.00%  ncplayer  [unknown]                   [.] 0000000000000000
+    4.30%     4.29%  ncplayer  libnotcurses-core.so.3.0.5  [.] build_sixel_band
+    3.82%     3.81%  ncplayer  libnotcurses-core.so.3.0.5  [.] insert_color
+    3.80%     3.80%  ncplayer  libz.so.1.2.11              [.] adler32_z
+    2.70%     0.00%  ncplayer  [kernel.kallsyms]           [k] entry_SYSCALL_64
+    2.69%     0.00%  ncplayer  [kernel.kallsyms]           [k] do_syscall_64
     2.06%     2.06%  ncplayer  libz.so.1.2.11              [.] inflate
+    1.60%     1.33%  ncplayer  libc-2.33.so                [.] __memmove_avx_un

=] =] =] our encoder has practically fallen out of the picture.

@dankamongmen
Copy link
Owner Author

@dnkl @klamonte @kaniini check this shit out. i expect to merge today, huzzah

@dankamongmen
Copy link
Owner Author

@joseluis it has been a long struggle, but finally, Victory

@dnkl
Copy link
Contributor

dnkl commented Feb 5, 2022

Hot damn!

@dankamongmen
Copy link
Owner Author

we have a safe worker engine now. committing!

@AutumnMeowMeow
Copy link

@dankamongmen Congrats so much on your release! An epic story indeed.

(My only regret to nuking the deadname account was the loss of all my reaction emojis in prior threads, including this one. :( )

@dankamongmen dankamongmen modified the milestones: 4.0.0, 3.1.0 Feb 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working perf sweet sweet perf
Projects
None yet
Development

No branches or pull requests

3 participants