Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression in scrollback #1316

Closed
chrisduerr opened this issue May 17, 2018 · 21 comments
Closed

Performance regression in scrollback #1316

chrisduerr opened this issue May 17, 2018 · 21 comments
Assignees

Comments

@chrisduerr
Copy link
Member

chrisduerr commented May 17, 2018

I've just tested the scrollback PR because I wanted to make sure that scrolling is still faster than on master, and it is. However, while testing it I've noticed that there were some significant regressions comparing current master to the scrollback PR.

Test Results:

Test Master Scrollback
scrolling 6s 4.4s
alt-screen-random-write 8.5s 12s
scrolling-in-region 4s 13s

These tests are based on https://github.com/jwilm/alacritty.

I'm not sure why I haven't noticed this earlier, but it seems like these regressions have been present ever since the scrollback PR was created. Especially the regression in scrolling-in-region seems quite massive.

These results were captured on X11/i3/compton/amd/mesa.

@chrisduerr
Copy link
Member Author

It appears these regressions were introduced in be89dbf and are system-dependent.

According to jwilm no such performance issues can be measured on X11/i3/no compton/nvidia.

@chrisduerr
Copy link
Member Author

It seems this strongly depends on --lines-from-bottom set in the scrolling-in-region benchmark.
Master does not change no matter what is specifed, however the scrollback branch does.

Here's another quick comparison:

Test Result
Master 4s
Scrollback - 0 lines 6s
Scrollback - 1 line 6.5s
Scrollback - 5 lines 7.5s
Scrollback - 50 lines 22s

Important Note: It seems like using --lines-from-bottom 0 still puts one line at the bottom, so --lines-from-bottom N actually has N + 1 lines at the bottom. As a result the normal scrolling test can not be compared to scrolling-in-region with --lines-from-bottom 0.

However since the changes made in the PR had the goal to increase performance, not decrease it, even with 0 lines or 1 line, it seems like this goal was not reached.

@chrisduerr
Copy link
Member Author

I've run callgrind on both master and the scrollback PR testing the scrolling-in-region benchmark. Here are the results:

Master: https://paste.rs/Wgq
Scrollback: https://paste.rs/1Lg

@chrisduerr
Copy link
Member Author

Disabling this branch by replacing the instruction with if false && region.start == Line(0) { fixes all performance issues with scrolling-in-region. So the culprit is probably in there.

@chrisduerr
Copy link
Member Author

chrisduerr commented May 18, 2018

Seems like the main issues are this and this. Seems like just the fundamentals that cost a lot of time here, any ideas how to fix that @jwilm?

Might be possible to use the swap_with_slice API that makes use of SIMD optimization, so we could replace the loop that swaps every single line with a single call to swap a range of lines. The only drawback is that this will not be stabilized until Rust 1.27, but that should be in the near future and might be fine.

However swap_with_slice will only work for non-overlapping ranges.

@jwilm
Copy link
Contributor

jwilm commented May 30, 2018

The main problem with an optimization like swap_with_slice is that it doesn't handle the case where part of src/dst wrap through the buffer.

One thing we might consider is having a fast path swap which detects when src/dst don't wrap around underlying buffer, and a more robust swap which handles the general case. The main difference would be that in the robust case, each swap needs to compute new raw indices, whereas a fast path could just compute the limits and loop over the swapping elements.

In my testing, the degraded performance seems to only affect scrolling-in-region with a large number of fixed lines. This doesn't seem to degrade tmux performance at all in the case where there is a pane on top of another, and the top pane is scrolling. In this test, which looks like this

image

The performance is similar to urxvt.

scrolling -b 10,000,000
urxvt 5.199s
alacritty 5.154s

Standard Benchmarks

For these benchmarks, I am just doing one run and taking the first result. The jitter from past experience is typically +- 0.2 seconds.

Scrolling-in-region 1

Emulates scrolling in vim or tmux (1 fixed bottom line)

time ~/code/vtebench/target/release/vtebench -b 50000000 -w $(tput cols) -h $(tput lines) scrolling-in-region --lines-from-bottom 1
time (seconds)
alacritty (scrollback) 4.865
alacritty (master) 12.398
urxvt 4.716

Scrolling-in-region 2

time ~/code/vtebench/target/release/vtebench -b 50000000 -w $(tput cols) -h $(tput lines) scrolling-in-region --lines-from-bottom 2

Emulates scrolling in vim in tmux (2 fixed bottom lines)

time (seconds)
alacritty (scrollback) 4.908
alacritty (master) 12.306
urxvt 4.766

Scrolling-in-region 50

This is not a common case in practice, and the fact that there's a regression here isn't a big concern, IMO.

time ~/code/vtebench/target/release/vtebench -b 10000000 -w $(tput cols) -h $(tput lines) scrolling-in-region --lines-from-bottom 50
time (seconds)
alacritty (scrollback) 4.478
alacritty (master) 1.566
urxvt 2.202

Alt-screen random write (monochrome)

time ~/code/vtebench/target/release/vtebench -b 50000000 -w $(tput cols) -h $(tput lines) alt-screen-random-write
time (seconds)
alacritty (scrollback) 2.564
alacritty (master) 2.484
urxvt 2.336

Alt-screen random write (random colors)

This is going to be the closest to a standard editor workload or tmux with multiple panes in the terminal. Text is written to random positions, and with some random styling. This is essentially what we've been optimizing for since the beginning.

time ~/code/vtebench/target/release/vtebench -b 50000000 -w $(tput cols) -h $(tput lines) -c alt-screen-random-write
time (seconds)
alacritty (scrollback) 3.475
alacritty (master) 3.305
urxvt 3.682

Summary

There are minor regressions from master in most benchmarks. In the cases that matter like having 1 or 2 fixed lines at the bottom of the screen, scrollback scrolling is much faster than master.

Based on these numbers and the feel of scrollback in real use, I don't see any reason to block merging the branch. Unless of course you're still seeing dramatically different numbers.

@chrisduerr
Copy link
Member Author

As discussed on IRC the scrollback branch results in a 60%+ slow down in alt-screen-random-write for me. Since this seems to be alacritty's biggest strength I'd be wary of introducing such a regression, however it's your call to make.

@nixpulvis
Copy link
Contributor

@chrisduerr I'm not 100% sure how you and @jwilm are testing performance, but it seems your numbers aren't lining up. In the benchmarking results posted directly above, the numbers for alt screen random writes are very close to master.

@chrisduerr
Copy link
Member Author

@nixpulvis I don't think there's any difference in how jwilm and I are testing performance, but rather that we use different systems. I'd love to see the results others get too. From what I can tell the main difference between jwilm's and my system currently are that I'm running amd + mesa and he's running nvidia graphics.

@nixpulvis
Copy link
Contributor

Ok, I'll give it a shot on a few of my systems. I might need a little hand holding, I'll just find someone on IRC if I do.

@nixpulvis
Copy link
Contributor

nixpulvis commented Jun 2, 2018

I just went through some debugging and performance testing with @chrisduerr on IRC. And we discovered some discrepancy between timing vtebench directly, and with timing cat some-output-of-vtebench. I've rerun all the performance tests from @jwilm's comment above.

The format of this test session is as follows.

=== name of test ===

test command (direct vtebench usage)

test commands (file buffer)

--- master ---

time output (direct vtebench usage)

time output (file buffer)

--- scrollback ---

time output (direct vtebench usage)

time output (file buffer)

I'm lazy and I don't want to format them as beautifully as he did however, sorry.

=== scrolling-in-region-1 ===

time ~/Code/vtebench/target/release/vtebench --term xterm -b 500000000 -w (tput cols) -h (tput lines) scrolling-in-region --lines-from-bottom 1

~/Code/vtebench/target/release/vtebench --term xterm -b 500000000 -w (tput cols) -h (tput lines) scrolling-in-region --lines-from-bottom 1 > scrolling-in-region-1.test
time cat scrolling-in-region-1.test

--- master ---

0.03user 65.69system 1:05.87elapsed 99%CPU (0avgtext+0avgdata 4108maxresident)k
0inputs+0outputs (0major+197minor)pagefaults 0swaps

0.00user 66.42system 1:06.53elapsed 99%CPU (0avgtext+0avgdata 1872maxresident)k
0inputs+0outputs (0major+104minor)pagefaults 0swaps

--- scrollback ---

0.03user 55.04system 0:55.15elapsed 99%CPU (0avgtext+0avgdata 3940maxresident)k
0inputs+0outputs (0major+197minor)pagefaults 0swaps

0.00user 54.32system 0:54.37elapsed 99%CPU (0avgtext+0avgdata 1700maxresident)k
0inputs+0outputs (0major+101minor)pagefaults 0swaps

=== scrolling-in-region-2 ===

time ~/Code/vtebench/target/release/vtebench --term xterm -b 500000000 -w (tput cols) -h (tput lines) scrolling-in-region --lines-from-bottom 2

~/Code/vtebench/target/release/vtebench --term xterm -b 500000000 -w (tput cols) -h (tput lines) scrolling-in-region --lines-from-bottom 2 > scrolling-in-region-2.test
time cat scrolling-in-region-2.test

--- master ---

0.01user 65.61system 1:05.74elapsed 99%CPU (0avgtext+0avgdata 3936maxresident)k
0inputs+0outputs (0major+197minor)pagefaults 0swaps

0.00user 65.42system 1:05.51elapsed 99%CPU (0avgtext+0avgdata 1956maxresident)k
0inputs+0outputs (0major+105minor)pagefaults 0swaps

--- scrollback ---

0.02user 55.41system 0:55.52elapsed 99%CPU (0avgtext+0avgdata 3928maxresident)k
0inputs+0outputs (0major+195minor)pagefaults 0swaps

0.00user 54.56system 0:54.61elapsed 99%CPU (0avgtext+0avgdata 1696maxresident)k
0inputs+0outputs (0major+101minor)pagefaults 0swaps

=== scrolling-in-region-50 ===

time ~/Code/vtebench/target/release/vtebench --term xterm -b 500000000 -w (tput cols) -h (tput lines) scrolling-in-region --lines-from-bottom 50

~/Code/vtebench/target/release/vtebench --term xterm -b 500000000 -w (tput cols) -h (tput lines) scrolling-in-region --lines-from-bottom 50 > scrolling-in-region-50.test
time cat scrolling-in-region-50.test

--- master ---

0.02user 68.02system 1:08.13elapsed 99%CPU (0avgtext+0avgdata 3952maxresident)k
0inputs+0outputs (0major+196minor)pagefaults 0swaps

0.00user 67.96system 1:08.07elapsed 99%CPU (0avgtext+0avgdata 2000maxresident)k
0inputs+0outputs (0major+107minor)pagefaults 0swaps

--- scrollback ---

0.03user 54.36system 0:54.48elapsed 99%CPU (0avgtext+0avgdata 3940maxresident)k
0inputs+0outputs (0major+197minor)pagefaults 0swaps

0.00user 53.30system 0:53.37elapsed 99%CPU (0avgtext+0avgdata 1784maxresident)k
0inputs+0outputs (0major+103minor)pagefaults 0swaps

=== alt-screen-random-write ===

time ~/Code/vtebench/target/release/vtebench --term xterm -b 500000000 -w (tput cols) -h (tput lines) alt-screen-random-write

~/Code/vtebench/target/release/vtebench --term xterm -b 500000000 -w (tput cols) -h (tput lines) alt-screen-random-write > alt-screen-random-write.test
time cat alt-screen-random-write.test

--- master ---

22.32user 2.64system 0:27.36elapsed 91%CPU (0avgtext+0avgdata 4056maxresident)k
0inputs+0outputs (0major+193minor)pagefaults 0swaps

0.00user 2.65system 0:05.89elapsed 45%CPU (0avgtext+0avgdata 1708maxresident)k
0inputs+0outputs (0major+102minor)pagefaults 0swaps

--- scrollback ---

22.34user 2.13system 0:26.88elapsed 91%CPU (0avgtext+0avgdata 3992maxresident)k
0inputs+0outputs (0major+193minor)pagefaults 0swaps

0.00user 3.35system 0:10.31elapsed 32%CPU (0avgtext+0avgdata 1700maxresident)k
0inputs+0outputs (0major+103minor)pagefaults 0swaps

=== alt-screen-random-write-color ===

time ~/Code/vtebench/target/release/vtebench --term xterm -b 500000000 -w (tput cols) -h (tput lines) -c alt-screen-random-write

~/Code/vtebench/target/release/vtebench --term xterm -b 500000000 -w (tput cols) -h (tput lines) -c alt-screen-random-write > alt-screen-random-write-color.test
time cat alt-screen-random-write-color.test

--- master ---

28.69user 2.43system 0:33.86elapsed 91%CPU (0avgtext+0avgdata 4092maxresident)k
0inputs+0outputs (0major+197minor)pagefaults 0swaps

0.00user 2.76system 0:06.51elapsed 42%CPU (0avgtext+0avgdata 1868maxresident)k
0inputs+0outputs (0major+102minor)pagefaults 0swaps

--- scrollback ---

29.12user 2.26system 0:34.27elapsed 91%CPU (0avgtext+0avgdata 4004maxresident)k
0inputs+0outputs (0major+194minor)pagefaults 0swaps

0.00user 3.44system 0:09.72elapsed 35%CPU (0avgtext+0avgdata 1700maxresident)k
0inputs+0outputs (0major+103minor)pagefaults 0swaps

This somewhat clearly shows how vtebench itself was a bottleneck for the alt-screen-random-write tests. I'll let @chrisduerr followup if he has anything else to add. It's also worth noting for anyone else following this thread, that these performance numbers are highly dependent on the screen size, and basically all of the configuration options, not to mention the host machine. So it's only relevant to compare the percentage difference between master and scrollback, and the two different methods of running. This should be somewhat obvious.

@chrisduerr
Copy link
Member Author

Thanks @nixpulvis for the elaborate breakdown of your results. I'd be interested in seeing your results @jwilm when actually writing to a file first instead of directly timing the output of vtebench.

Based on the results from @nixpulvis it seems like I'm not the only person with these issues, however it would probably still be useful to get some system information from @nixpulvis just to make sure in case @jwilm can't reproduce this on his machine.

It might also be interesting to see some benchmarks of this on macos.

@nixpulvis
Copy link
Contributor

Relevant system info. I use X11, running i3. My video card is a GTX 980 running nouveau drivers. Some additional info below:

uname -a
Linux pukak 4.16.9-1-ARCH #1 SMP PREEMPT Thu May 17 02:10:09 UTC 2018 x86_64 GNU/Linux

cat /proc/cpuinfo | grep "model name" | head -n 1
model name	: Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz

@ghost
Copy link

ghost commented Jun 2, 2018

So MacBook Pro 13 inch Iris 550 here, some weird times, discussed with chrisduerr on irc without an idea whats going on.
Tested fullscreen.

cargo run --release -- -b 50000000 -w $(tput cols) -h $(tput lines) -c alt-screen-random-write > out.vte

### scrollback
> time cat out.vte
real    0m49,175s
user    0m0,004s
sys     0m1,171s

> time cat out.vte
real    0m38,315s
user    0m0,003s
sys     0m1,094s

### master
> time cat out.vte
real    0m25,754s
user    0m0,004s
sys     0m1,125s

> time cat out.vte
real    0m24,456s
user    0m0,004s
sys     0m1,098s

### kitty // weird thing is, even Mac terminal was about as fast as this
> time cat out.vte
real    0m3,915s
user    0m0,003s
sys     0m0,401s

> time cat out.vte
real    0m4,143s
user    0m0,003s
sys     0m0,407s

some 70MB plain text file, just if that helps

scrollback
real 0m16,870s
user 0m0,004s
sys 0m2,360s

master
real 0m14,022s
user 0m0,004s
sys 0m2,129s

kitty 
real    0m27,617s
user    0m0,002s
sys     0m2,694s

@chrisduerr
Copy link
Member Author

chrisduerr commented Jun 2, 2018

Thanks again for responding to my call for testing this on macos so quickly @monouser7dig.

The results seem to indicate that there was a regression in alacritty (~25s -> ~40s) which seems to be in line with the regressions measured so far (60-100%, but the difference between benchmarks is so big that this metric probably makes little sense here).

However there clearly seems to be more at play here with alacritty losing by a significant margin to both iTerm and Kitty. So not only would I take these results with a grain of salt, I'd also recommend following up in a separate issue if anyone has ideas where this could be coming from. The only thing I have in mind would be disabling transparency, otherwise I'm not sure.

@monouser7dig if you'd like to open an issue in reference to this one to explore performance issues of the current master branch of your macos machine, please feel free to do so. I'm sure there's gonna be something at play here.

@ghost
Copy link

ghost commented Jun 3, 2018

The only thing I have in mind would be disabling transparency, otherwise I'm not sure.

without transparency (0.97 before)

### scrollback
time cat out.vte
real    0m34,469s
user    0m0,003s
sys     0m0,876s
### master 
time cat out.vte
real    0m24,458s
user    0m0,004s
sys     0m0,949s

@chrisduerr
Copy link
Member Author

Transparency is always enabled on master even if opacity is 1.0. I've added a comment on how to disable transparency in #1348 so this issue will not get cluttered with unrelated information.

@chrisduerr
Copy link
Member Author

It seems like on the scrollback branch the parser is just taxed a lot more, here's a perf result I've got:

You can see here that with master (right) the distribution of the most expensive methods is very good, however with scrollback (left) there are a few very "hot" and expensive functions.

I'm not going to provide the full perf file because it's too big to upload in a reasonable amount of time, but for the sake of reproducability, here's the command I've used to generate the data:
perf record --call-graph dwarf -- alacritty

And this should allow easy analization of the result from the command line:
perf report -p graph --no-children -i ./perf.data

@jedahan jedahan mentioned this issue Jun 10, 2018
chrisduerr added a commit to chrisduerr/alacritty that referenced this issue Jun 11, 2018
The `compute_index` method in the `Storage` struct used to normalize
indices was responsible for a significant amount of the CPU time spent
while running the `alt-screen-random-write` benchmark (~50%).

The issue with this relatively simple method was that due to how often
the method is executed, the modulo operation was too expensive. However
since most of the time when a majority of the buffer changes (high
througput), it's in the alternate screen, the modulo operation can be
skipped with a simple branch. There might be some more optimization to
be done here, however this resolved all measurable issues in the
`alt-sceen-random-write` benchmark.

The whole `vtebench` suite hasn't been tested, however unless there were
some regressions, this fixes alacritty#1316. I will benchmark this a bit more in
the near future.
@chrisduerr chrisduerr self-assigned this Jun 11, 2018
@jwilm
Copy link
Contributor

jwilm commented Jul 6, 2018

You mentioned from a commit that this issue was resolved; should we close it?

@chrisduerr
Copy link
Member Author

I'm not quite sure. There definitely is still a regression from master to scrollback when it comes to alt-screen-random-write. However the obvious stuff has been fixed.

I'd still like to keep track of places where a potential performance improvement could be achieved, but since I haven't been able to figure out how to resolve this and the performance hit is not massive, I wouldn't block scrollback on this.

So I'd either keep this open to track that or maybe open a new issue to indicate that the main issues mentioned in here have been resolved.

@jwilm
Copy link
Contributor

jwilm commented Sep 18, 2018

I'm going to close this. Our alt-screen performance is already exceptional, and additional investment there may only return marginal improvements.

@jwilm jwilm closed this as completed Sep 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants