Performance regression in scrollback #1316

chrisduerr · 2018-05-17T20:25:56Z

I've just tested the scrollback PR because I wanted to make sure that scrolling is still faster than on master, and it is. However, while testing it I've noticed that there were some significant regressions comparing current master to the scrollback PR.

Test Results:

Test	Master	Scrollback
scrolling	6s	4.4s
alt-screen-random-write	8.5s	12s
scrolling-in-region	4s	13s

These tests are based on https://github.com/jwilm/alacritty.

I'm not sure why I haven't noticed this earlier, but it seems like these regressions have been present ever since the scrollback PR was created. Especially the regression in scrolling-in-region seems quite massive.

These results were captured on X11/i3/compton/amd/mesa.

The text was updated successfully, but these errors were encountered:

chrisduerr · 2018-05-17T23:22:07Z

It appears these regressions were introduced in be89dbf and are system-dependent.

According to jwilm no such performance issues can be measured on X11/i3/no compton/nvidia.

chrisduerr · 2018-05-17T23:45:40Z

It seems this strongly depends on --lines-from-bottom set in the scrolling-in-region benchmark.
Master does not change no matter what is specifed, however the scrollback branch does.

Here's another quick comparison:

Test	Result
Master	4s
Scrollback - 0 lines	6s
Scrollback - 1 line	6.5s
Scrollback - 5 lines	7.5s
Scrollback - 50 lines	22s

Important Note: It seems like using --lines-from-bottom 0 still puts one line at the bottom, so --lines-from-bottom N actually has N + 1 lines at the bottom. As a result the normal scrolling test can not be compared to scrolling-in-region with --lines-from-bottom 0.

However since the changes made in the PR had the goal to increase performance, not decrease it, even with 0 lines or 1 line, it seems like this goal was not reached.

chrisduerr · 2018-05-18T13:21:01Z

I've run callgrind on both master and the scrollback PR testing the scrolling-in-region benchmark. Here are the results:

Master: https://paste.rs/Wgq
Scrollback: https://paste.rs/1Lg

chrisduerr · 2018-05-18T16:16:51Z

Disabling branch by replacing the instruction with if false && region.start == Line(0) { fixes all performance issues with scrolling-in-region. So the culprit is probably in there.

chrisduerr · 2018-05-18T16:52:09Z

Seems like the main issues are and . Seems like just the fundamentals that cost a lot of time here, any ideas how to fix that @jwilm?

Might be possible to use the API that makes use of SIMD optimization, so we could replace the loop that swaps every single line with a single call to swap a range of lines. The only drawback is that this will not be stabilized until Rust 1.27, but that should be in the near future and might be fine.

However swap_with_slice will only work for non-overlapping ranges.

jwilm · 2018-05-30T17:19:17Z

The main problem with an optimization like swap_with_slice is that it doesn't handle the case where part of src/dst wrap through the buffer.

One thing we might consider is having a fast path swap which detects when src/dst don't wrap around underlying buffer, and a more robust swap which handles the general case. The main difference would be that in the robust case, each swap needs to compute new raw indices, whereas a fast path could just compute the limits and loop over the swapping elements.

In my testing, the degraded performance seems to only affect scrolling-in-region with a large number of fixed lines. This doesn't seem to degrade tmux performance at all in the case where there is a pane on top of another, and the top pane is scrolling. In this test, which looks like this

The performance is similar to urxvt.

	scrolling -b 10,000,000
urxvt	5.199s
alacritty	5.154s

Standard Benchmarks

For these benchmarks, I am just doing one run and taking the first result. The jitter from past experience is typically +- 0.2 seconds.

Scrolling-in-region 1

Emulates scrolling in vim or tmux (1 fixed bottom line)

time ~/code/vtebench/target/release/vtebench -b 50000000 -w $(tput cols) -h $(tput lines) scrolling-in-region --lines-from-bottom 1

	time (seconds)
alacritty (scrollback)	4.865
alacritty (master)	12.398
urxvt	4.716

Scrolling-in-region 2

time ~/code/vtebench/target/release/vtebench -b 50000000 -w $(tput cols) -h $(tput lines) scrolling-in-region --lines-from-bottom 2

Emulates scrolling in vim in tmux (2 fixed bottom lines)

	time (seconds)
alacritty (scrollback)	4.908
alacritty (master)	12.306
urxvt	4.766

Scrolling-in-region 50

This is not a common case in practice, and the fact that there's a regression here isn't a big concern, IMO.

time ~/code/vtebench/target/release/vtebench -b 10000000 -w $(tput cols) -h $(tput lines) scrolling-in-region --lines-from-bottom 50

	time (seconds)
alacritty (scrollback)	4.478
alacritty (master)	1.566
urxvt	2.202

Alt-screen random write (monochrome)

time ~/code/vtebench/target/release/vtebench -b 50000000 -w $(tput cols) -h $(tput lines) alt-screen-random-write

	time (seconds)
alacritty (scrollback)	2.564
alacritty (master)	2.484
urxvt	2.336

Alt-screen random write (random colors)

This is going to be the closest to a standard editor workload or tmux with multiple panes in the terminal. Text is written to random positions, and with some random styling. This is essentially what we've been optimizing for since the beginning.

time ~/code/vtebench/target/release/vtebench -b 50000000 -w $(tput cols) -h $(tput lines) -c alt-screen-random-write

	time (seconds)
alacritty (scrollback)	3.475
alacritty (master)	3.305
urxvt	3.682

Summary

There are minor regressions from master in most benchmarks. In the cases that matter like having 1 or 2 fixed lines at the bottom of the screen, scrollback scrolling is much faster than master.

Based on these numbers and the feel of scrollback in real use, I don't see any reason to block merging the branch. Unless of course you're still seeing dramatically different numbers.

chrisduerr · 2018-05-30T17:49:14Z

As discussed on IRC the scrollback branch results in a 60%+ slow down in alt-screen-random-write for me. Since this seems to be alacritty's biggest strength I'd be wary of introducing such a regression, however it's your call to make.

nixpulvis · 2018-05-30T18:09:26Z

@chrisduerr I'm not 100% sure how you and @jwilm are testing performance, but it seems your numbers aren't lining up. In the benchmarking results posted directly above, the numbers for alt screen random writes are very close to master.

chrisduerr · 2018-05-30T18:29:34Z

@nixpulvis I don't think there's any difference in how jwilm and I are testing performance, but rather that we use different systems. I'd love to see the results others get too. From what I can tell the main difference between jwilm's and my system currently are that I'm running amd + mesa and he's running nvidia graphics.

nixpulvis · 2018-05-30T19:04:46Z

Ok, I'll give it a shot on a few of my systems. I might need a little hand holding, I'll just find someone on IRC if I do.

nixpulvis · 2018-06-02T21:58:12Z

I just went through some debugging and performance testing with @chrisduerr on IRC. And we discovered some discrepancy between timing vtebench directly, and with timing cat some-output-of-vtebench. I've rerun all the performance tests from @jwilm's comment above.

The format of this test session is as follows.

=== name of test ===

test command (direct vtebench usage)

test commands (file buffer)

--- master ---

time output (direct vtebench usage)

time output (file buffer)

--- scrollback ---

time output (direct vtebench usage)

time output (file buffer)

I'm lazy and I don't want to format them as beautifully as he did however, sorry.

=== scrolling-in-region-1 ===

time ~/Code/vtebench/target/release/vtebench --term xterm -b 500000000 -w (tput cols) -h (tput lines) scrolling-in-region --lines-from-bottom 1

~/Code/vtebench/target/release/vtebench --term xterm -b 500000000 -w (tput cols) -h (tput lines) scrolling-in-region --lines-from-bottom 1 > scrolling-in-region-1.test
time cat scrolling-in-region-1.test

--- master ---

0.03user 65.69system 1:05.87elapsed 99%CPU (0avgtext+0avgdata 4108maxresident)k
0inputs+0outputs (0major+197minor)pagefaults 0swaps

0.00user 66.42system 1:06.53elapsed 99%CPU (0avgtext+0avgdata 1872maxresident)k
0inputs+0outputs (0major+104minor)pagefaults 0swaps

--- scrollback ---

0.03user 55.04system 0:55.15elapsed 99%CPU (0avgtext+0avgdata 3940maxresident)k
0inputs+0outputs (0major+197minor)pagefaults 0swaps

0.00user 54.32system 0:54.37elapsed 99%CPU (0avgtext+0avgdata 1700maxresident)k
0inputs+0outputs (0major+101minor)pagefaults 0swaps

=== scrolling-in-region-2 ===

time ~/Code/vtebench/target/release/vtebench --term xterm -b 500000000 -w (tput cols) -h (tput lines) scrolling-in-region --lines-from-bottom 2

~/Code/vtebench/target/release/vtebench --term xterm -b 500000000 -w (tput cols) -h (tput lines) scrolling-in-region --lines-from-bottom 2 > scrolling-in-region-2.test
time cat scrolling-in-region-2.test

--- master ---

0.01user 65.61system 1:05.74elapsed 99%CPU (0avgtext+0avgdata 3936maxresident)k
0inputs+0outputs (0major+197minor)pagefaults 0swaps

0.00user 65.42system 1:05.51elapsed 99%CPU (0avgtext+0avgdata 1956maxresident)k
0inputs+0outputs (0major+105minor)pagefaults 0swaps

--- scrollback ---

0.02user 55.41system 0:55.52elapsed 99%CPU (0avgtext+0avgdata 3928maxresident)k
0inputs+0outputs (0major+195minor)pagefaults 0swaps

0.00user 54.56system 0:54.61elapsed 99%CPU (0avgtext+0avgdata 1696maxresident)k
0inputs+0outputs (0major+101minor)pagefaults 0swaps

=== scrolling-in-region-50 ===

time ~/Code/vtebench/target/release/vtebench --term xterm -b 500000000 -w (tput cols) -h (tput lines) scrolling-in-region --lines-from-bottom 50

~/Code/vtebench/target/release/vtebench --term xterm -b 500000000 -w (tput cols) -h (tput lines) scrolling-in-region --lines-from-bottom 50 > scrolling-in-region-50.test
time cat scrolling-in-region-50.test

--- master ---

0.02user 68.02system 1:08.13elapsed 99%CPU (0avgtext+0avgdata 3952maxresident)k
0inputs+0outputs (0major+196minor)pagefaults 0swaps

0.00user 67.96system 1:08.07elapsed 99%CPU (0avgtext+0avgdata 2000maxresident)k
0inputs+0outputs (0major+107minor)pagefaults 0swaps

--- scrollback ---

0.03user 54.36system 0:54.48elapsed 99%CPU (0avgtext+0avgdata 3940maxresident)k
0inputs+0outputs (0major+197minor)pagefaults 0swaps

0.00user 53.30system 0:53.37elapsed 99%CPU (0avgtext+0avgdata 1784maxresident)k
0inputs+0outputs (0major+103minor)pagefaults 0swaps

=== alt-screen-random-write ===

time ~/Code/vtebench/target/release/vtebench --term xterm -b 500000000 -w (tput cols) -h (tput lines) alt-screen-random-write

~/Code/vtebench/target/release/vtebench --term xterm -b 500000000 -w (tput cols) -h (tput lines) alt-screen-random-write > alt-screen-random-write.test
time cat alt-screen-random-write.test

--- master ---

22.32user 2.64system 0:27.36elapsed 91%CPU (0avgtext+0avgdata 4056maxresident)k
0inputs+0outputs (0major+193minor)pagefaults 0swaps

0.00user 2.65system 0:05.89elapsed 45%CPU (0avgtext+0avgdata 1708maxresident)k
0inputs+0outputs (0major+102minor)pagefaults 0swaps

--- scrollback ---

22.34user 2.13system 0:26.88elapsed 91%CPU (0avgtext+0avgdata 3992maxresident)k
0inputs+0outputs (0major+193minor)pagefaults 0swaps

0.00user 3.35system 0:10.31elapsed 32%CPU (0avgtext+0avgdata 1700maxresident)k
0inputs+0outputs (0major+103minor)pagefaults 0swaps

=== alt-screen-random-write-color ===

time ~/Code/vtebench/target/release/vtebench --term xterm -b 500000000 -w (tput cols) -h (tput lines) -c alt-screen-random-write

~/Code/vtebench/target/release/vtebench --term xterm -b 500000000 -w (tput cols) -h (tput lines) -c alt-screen-random-write > alt-screen-random-write-color.test
time cat alt-screen-random-write-color.test

--- master ---

28.69user 2.43system 0:33.86elapsed 91%CPU (0avgtext+0avgdata 4092maxresident)k
0inputs+0outputs (0major+197minor)pagefaults 0swaps

0.00user 2.76system 0:06.51elapsed 42%CPU (0avgtext+0avgdata 1868maxresident)k
0inputs+0outputs (0major+102minor)pagefaults 0swaps

--- scrollback ---

29.12user 2.26system 0:34.27elapsed 91%CPU (0avgtext+0avgdata 4004maxresident)k
0inputs+0outputs (0major+194minor)pagefaults 0swaps

0.00user 3.44system 0:09.72elapsed 35%CPU (0avgtext+0avgdata 1700maxresident)k
0inputs+0outputs (0major+103minor)pagefaults 0swaps

This somewhat clearly shows how vtebench itself was a bottleneck for the alt-screen-random-write tests. I'll let @chrisduerr followup if he has anything else to add. It's also worth noting for anyone else following this thread, that these performance numbers are highly dependent on the screen size, and basically all of the configuration options, not to mention the host machine. So it's only relevant to compare the percentage difference between master and scrollback, and the two different methods of running. This should be somewhat obvious.

chrisduerr · 2018-06-02T22:10:29Z

Thanks @nixpulvis for the elaborate breakdown of your results. I'd be interested in seeing your results @jwilm when actually writing to a file first instead of directly timing the output of vtebench.

Based on the results from @nixpulvis it seems like I'm not the only person with these issues, however it would probably still be useful to get some system information from @nixpulvis just to make sure in case @jwilm can't reproduce this on his machine.

It might also be interesting to see some benchmarks of this on macos.

nixpulvis · 2018-06-02T22:19:09Z

Relevant system info. I use X11, running i3. My video card is a GTX 980 running nouveau drivers. Some additional info below:

uname -a
Linux pukak 4.16.9-1-ARCH #1 SMP PREEMPT Thu May 17 02:10:09 UTC 2018 x86_64 GNU/Linux

cat /proc/cpuinfo | grep "model name" | head -n 1
model name	: Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz

ghost · 2018-06-02T23:07:51Z

So MacBook Pro 13 inch Iris 550 here, some weird times, discussed with chrisduerr on irc without an idea whats going on.
Tested fullscreen.

cargo run --release -- -b 50000000 -w $(tput cols) -h $(tput lines) -c alt-screen-random-write > out.vte

### scrollback
> time cat out.vte
real    0m49,175s
user    0m0,004s
sys     0m1,171s

> time cat out.vte
real    0m38,315s
user    0m0,003s
sys     0m1,094s

### master
> time cat out.vte
real    0m25,754s
user    0m0,004s
sys     0m1,125s

> time cat out.vte
real    0m24,456s
user    0m0,004s
sys     0m1,098s

### kitty // weird thing is, even Mac terminal was about as fast as this
> time cat out.vte
real    0m3,915s
user    0m0,003s
sys     0m0,401s

> time cat out.vte
real    0m4,143s
user    0m0,003s
sys     0m0,407s

some 70MB plain text file, just if that helps

scrollback
real 0m16,870s
user 0m0,004s
sys 0m2,360s

master
real 0m14,022s
user 0m0,004s
sys 0m2,129s

kitty 
real    0m27,617s
user    0m0,002s
sys     0m2,694s

chrisduerr · 2018-06-02T23:13:31Z

Thanks again for responding to my call for testing this on macos so quickly @monouser7dig.

The results seem to indicate that there was a regression in alacritty (~25s -> ~40s) which seems to be in line with the regressions measured so far (60-100%, but the difference between benchmarks is so big that this metric probably makes little sense here).

However there clearly seems to be more at play here with alacritty losing by a significant margin to both iTerm and Kitty. So not only would I take these results with a grain of salt, I'd also recommend following up in a separate issue if anyone has ideas where this could be coming from. The only thing I have in mind would be disabling transparency, otherwise I'm not sure.

@monouser7dig if you'd like to open an issue in reference to this one to explore performance issues of the current master branch of your macos machine, please feel free to do so. I'm sure there's gonna be something at play here.

ghost · 2018-06-03T08:34:49Z

The only thing I have in mind would be disabling transparency, otherwise I'm not sure.

without transparency (0.97 before)

### scrollback
time cat out.vte
real    0m34,469s
user    0m0,003s
sys     0m0,876s

### master 
time cat out.vte
real    0m24,458s
user    0m0,004s
sys     0m0,949s

chrisduerr · 2018-06-03T09:22:23Z

Transparency is always enabled on master even if opacity is 1.0. I've added a comment on how to disable transparency in #1348 so this issue will not get cluttered with unrelated information.

chrisduerr · 2018-06-08T20:41:15Z

It seems like on the scrollback branch the parser is just taxed a lot more, here's a perf result I've got:

You can see here that with master (right) the distribution of the most expensive methods is very good, however with scrollback (left) there are a few very "hot" and expensive functions.

I'm not going to provide the full perf file because it's too big to upload in a reasonable amount of time, but for the sake of reproducability, here's the command I've used to generate the data:
perf record --call-graph dwarf -- alacritty

And this should allow easy analization of the result from the command line:
perf report -p graph --no-children -i ./perf.data

The `compute_index` method in the `Storage` struct used to normalize indices was responsible for a significant amount of the CPU time spent while running the `alt-screen-random-write` benchmark (~50%). The issue with this relatively simple method was that due to how often the method is executed, the modulo operation was too expensive. However since most of the time when a majority of the buffer changes (high througput), it's in the alternate screen, the modulo operation can be skipped with a simple branch. There might be some more optimization to be done here, however this resolved all measurable issues in the `alt-sceen-random-write` benchmark. The whole `vtebench` suite hasn't been tested, however unless there were some regressions, this fixes alacritty#1316. I will benchmark this a bit more in the near future.

jwilm · 2018-07-06T14:33:43Z

You mentioned from a commit that this issue was resolved; should we close it?

chrisduerr · 2018-07-06T16:01:13Z

I'm not quite sure. There definitely is still a regression from master to scrollback when it comes to alt-screen-random-write. However the obvious stuff has been fixed.

I'd still like to keep track of places where a potential performance improvement could be achieved, but since I haven't been able to figure out how to resolve this and the performance hit is not massive, I wouldn't block scrollback on this.

So I'd either keep this open to track that or maybe open a new issue to indicate that the main issues mentioned in here have been resolved.

jwilm · 2018-09-18T00:49:12Z

I'm going to close this. Our alt-screen performance is already exceptional, and additional investment there may only return marginal improvements.

chrisduerr added the S - scrollback label May 17, 2018

chrisduerr mentioned this issue May 19, 2018

Alacritty is consistently slower than urxvt in XQuartz (macOS) #205

Open

ghost mentioned this issue Jun 3, 2018

crazy long vtebench time on macbook #1348

Closed

jedahan mentioned this issue Jun 10, 2018

Scrollback #1147

Merged

chrisduerr mentioned this issue Jun 11, 2018

Optimize indexing of the grid's raw buffer #1369

Merged

chrisduerr self-assigned this Jun 11, 2018

jwilm closed this as completed Sep 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression in scrollback #1316

Performance regression in scrollback #1316

chrisduerr commented May 17, 2018 •

edited

chrisduerr commented May 17, 2018

chrisduerr commented May 17, 2018

chrisduerr commented May 18, 2018

chrisduerr commented May 18, 2018

chrisduerr commented May 18, 2018 •

edited

jwilm commented May 30, 2018

chrisduerr commented May 30, 2018

nixpulvis commented May 30, 2018

chrisduerr commented May 30, 2018

nixpulvis commented May 30, 2018

nixpulvis commented Jun 2, 2018 •

edited

chrisduerr commented Jun 2, 2018

nixpulvis commented Jun 2, 2018

ghost commented Jun 2, 2018

chrisduerr commented Jun 2, 2018 •

edited

ghost commented Jun 3, 2018 •

edited by ghost

chrisduerr commented Jun 3, 2018

chrisduerr commented Jun 8, 2018

jwilm commented Jul 6, 2018

chrisduerr commented Jul 6, 2018

jwilm commented Sep 18, 2018

Performance regression in scrollback #1316

Performance regression in scrollback #1316

Comments

chrisduerr commented May 17, 2018 • edited

chrisduerr commented May 17, 2018

chrisduerr commented May 17, 2018

chrisduerr commented May 18, 2018

chrisduerr commented May 18, 2018

chrisduerr commented May 18, 2018 • edited

jwilm commented May 30, 2018

Standard Benchmarks

Scrolling-in-region 1

Scrolling-in-region 2

Scrolling-in-region 50

Alt-screen random write (monochrome)

Alt-screen random write (random colors)

Summary

chrisduerr commented May 30, 2018

nixpulvis commented May 30, 2018

chrisduerr commented May 30, 2018

nixpulvis commented May 30, 2018

nixpulvis commented Jun 2, 2018 • edited

chrisduerr commented Jun 2, 2018

nixpulvis commented Jun 2, 2018

ghost commented Jun 2, 2018

chrisduerr commented Jun 2, 2018 • edited

ghost commented Jun 3, 2018 • edited by ghost

chrisduerr commented Jun 3, 2018

chrisduerr commented Jun 8, 2018

jwilm commented Jul 6, 2018

chrisduerr commented Jul 6, 2018

jwilm commented Sep 18, 2018

chrisduerr commented May 17, 2018 •

edited

chrisduerr commented May 18, 2018 •

edited

nixpulvis commented Jun 2, 2018 •

edited

chrisduerr commented Jun 2, 2018 •

edited

ghost commented Jun 3, 2018 •

edited by ghost