Architecture for less iterations over the library #1112

southerntofu · 2020-07-31T16:27:46Z

This is not exactly a bug report, but when i was investigating site::load() i realized the pages/sections collections get iterated over many times. For a simple build, it iterates over:

sections (loop l. 174)
pages (loop l. 182)
sections & pages (library.check_for_path_collisions() l.197)
pages (library.populate_taxonomies() l. 205)
sections and pages (self.populate_sections() l. 207, 5 times for sections, 2 for pages because of other calls inside this function)
sections and pages (self.render_markdown l. 208, 2 times for pages)
sections and pages (check_internal_links_with_anchors() l. 212)
sections and pages (check_external_links() l.215, only in check mode)

Without check mode, that's a total of 9 iterations over sections and 8 iterations over pages. This is only for iterations after the library is built. There may be some compiler magic behind the scenes to "reunite" some iterations in there, but my intuition is there's a lot of potential performance improvements to be investigated here.

I think some of the useless iterations are due to the absence of library structure due to the glob pattern (reuniting children pages with related sections, and building content hierarchy). Maybe a recursive descent into the content folder could help here?

Keats · 2020-07-31T17:15:00Z

We can probably trim that number down but I don't think it will make a significant change in the speed.
The main speed bottleneck is serialising the content to JSON for Tera and that is a much bigger overhead than iterating. The other speed bottleneck is syntax highlighting because it's essentially doing many many regexes and is roughly doubling the build time when enabled.

Of course if we can improve that without making the code harder to understand, I would be very interested

southerntofu · 2020-07-31T21:29:53Z

The main speed bottleneck is serialising the content to JSON for Tera

What's the best way to have some good data on this? I tried running the benchmarks but they don't really help me understand the bottlenecks. So i ran flamegraph on building medium-blog and this was the result. I've never done profiling before so i'm not entirely sure what's happening here. But rayon appears to take quite some space on there.

Of course if we can improve that without making the code harder to understand, I would be very interested

It's entirely possible, but it will take some time. Let's not hurry too much ;). In the meantime, are benchmarks run periodically? It would be nice to have benchmarks on every commit/PR. Is it easy to achieve with the current pipelines?

Keats · 2020-08-01T06:25:50Z

Try to disable rayon (RAYON_NUM_THREADS=1 as env var should work) before profiling to get rid of all the rayon stuff in the profile. I haven't done profiling on Zola for a while though, I'll need to do a small write-up of how I do it for reference.

In the meantime, are benchmarks run periodically?

Not really. The generated benchmarks are just there for sanity before making big changes and to have an order of idea of the speed. It would be nice to have them run on CI though.

southerntofu · 2020-08-01T10:33:46Z

With 1 thread, for medium-blog:

For huge-blog (still 1 thread):

As we can see, the bigger the site, the more rayon/Vec operations will take a big chunk of the pie. This is confirmed by timing the building of both sites and calculating the average number of pages rendered per second. Medium site built 250 pages in 0.56s (446 pages/s) while huge site built 10000 pages in 265s (38 pages/s) which is more than 10 times slower.

I may be wrong but i don't think serialization is the bottleneck here because it should be linear complexity. AFAIK good map/vector algorithms operate in o(n * log(n)). But running many such operations on the whole library we may be closer to o(n² * log(n)). I'll investigate this some more.

Keats · 2020-08-05T16:07:35Z

I may be wrong but i don't think serialization is the bottleneck here because it should be linear complexity

It's possible! I haven't been benchmarking for a long time and maybe it changed since I did.

southerntofu · 2020-08-07T10:48:16Z

So after running some more benchmarks i believe the numbers i had were completely wrong, skewed by swapping. Disabling swapping (sudo swapoff -a) and running the benchmarks provided much better and very linear results. So there may be optimizations to make here, but not remotely as much as i thought, my bad!

I wrote a simple script which runs 3 passes for each number of sections contained between MIN and MAX (incrementing by INC) and prints the average build time on those 3 passes. Each section contains 10 pages. For example, ./grow.sh 10 100 10 will print the average build times for 10, 20, 30, 40, 50, 60, 70, 80, 90, and 100 sections (which is 10 times as many articles).

I'm happy with it as a benchmarking tool. I'm just not sure it should be included in the repo because it would add bash as a dependency for running benchmarks (there's a few bashisms in the script). Maybe gen.py and grow.sh should be rewritten in rust? Or in plain, portable shell?

Here's the benchmark data for further reference. Times are in seconds:

Pages built	Multithread	Single thread
100	0.477333333333333	0.796333333333333
200	0.104666666666667	0.182333333333333
300	0.155333333333333	0.330333333333333
400	0.203	0.357666666666667
500	0.255333333333333	0.446333333333333
600	0.318	0.536666666666666
700	0.353	0.627
800	0.445333333333333	0.758666666666666
900	0.459666666666667	0.803
1000	0.503666666666666	0.882666666666666
1100	0.577666666666667	1.01566666666667
1200	0.606	1.13333333333333
1300	0.661	1.2
1400	0.740666666666666	1.2
1500	0.755333333333333	1.33333333333333
1600	0.854666666666666	1.46666666666667
1700	1.044	1.5
1800	0.913333333333333	1.6
1900	0.971666666666666	1.7
2000	1.03333333333333	1.83333333333333
2100	1.16666666666667	1.86666666666667
2200	1.16666666666667	2.06666666666667
2300	1.16666666666667	2.13333333333333
2400	1.23333333333333	2.2
2500	1.26666666666667	2.3
2600	1.33333333333333	2.4
2700	1.36666666666667	2.6
2800	1.5	2.5
2900	1.53333333333333	2.66666666666667
3000	1.56666666666667	2.8
3100	1.66666666666667	2.83333333333333
3200	1.66666666666667	2.86666666666667
3300	1.8	3.03333333333333
3400	1.76666666666667	3.06666666666667
3500	1.86666666666667	3.16666666666667
3600	1.96666666666667	3.2
3700	2	3.26666666666667
3800	2.03333333333333	3.46666666666667
3900	2.13333333333333	3.6
4000	2.16666666666667	3.66666666666667
4100	2.23333333333333	3.76666666666667
4200	2.13333333333333	3.83333333333333
4300	2.2	3.93333333333333
4400	2.46666666666667	4.03333333333333
4500	2.4	4.23333333333333
4600	2.4	4.16666666666667
4700	2.5	4.36666666666667
4800	2.56666666666667	4.96666666666667
4900	2.63333333333333	4.73333333333333
5000	2.83333333333333	4.7
5100	2.8	4.56666666666667
5200	2.93333333333333	5
5300	2.93333333333333	4.86666666666667
5400	2.83333333333333	5.16666666666667
5500	2.96666666666667	5.26666666666667
5600	3.03333333333333	5.16666666666667
5700	3.13333333333333	5
5800	3.26666666666667	5.1
5900	3.3	5.46666666666667
6000	3.16666666666667	5.46666666666667
6100	3.2	5.63333333333333
6200	3.36666666666667	5.73333333333333
6300	3.46666666666667	5.8
6400	3.5	5.8
6500	3.73333333333333	6.06666666666667
6600	3.66666666666667	5.96666666666667
6700	3.66666666666667	5.96666666666667
6800	3.7	6.1
6900	3.76666666666667	6.03333333333333
7000	3.9	6.2
7100	3.9	6.56666666666667
7200	4.06666666666667	6.4
7300	4.2	6.9
7400	4.03333333333333	7
7500	4.13333333333333	7.03333333333333
7600	4.33333333333333	7.23333333333333
7700	4.23333333333333	7.13333333333333
7800	4.6	7.73333333333333
7900	4.16666666666667	7.43333333333333
8000	4.7	7.3
8100	4.5	7.46666666666667
8200	4.96666666666667	7.7
8300	5.03333333333333	7.6
8400	5.06666666666667	7.76666666666667
8500	5.1	7.76666666666667
8600	5.23333333333333	7.96666666666667
8700	5.43333333333333	8.23333333333333
8800	5.43333333333333	8.63333333333333
8900	5.23333333333333	8.3
9000	5.46666666666667	8.43333333333333
9100	5.4	8.23333333333333
9200	5.53333333333333	8.66666666666667
9300	5.46666666666667	8.56666666666667
9400	5.53333333333333	8.9
9500	5.43333333333333	8.63333333333333
9600	5.6	8.76666666666667
9700	5.66666666666667	9
9800	5.93333333333333	8.9
9900	6.06666666666667	8.83333333333333
10000	5.9	9.03333333333333

Keats · 2020-08-07T13:05:13Z

Some bash scripts would be fine but I would rely on hyperfine just to have warmup and everything built-in.
Something to consider if you want to benchmark Zola itself: disable syntax highlighting. It's probably around half of those runtimes

savente93 · 2020-12-19T15:48:03Z

Already addressed somewhat in #1218 Looking through the code I think it would be non-trivial, but fairly doable to pull the path collision checks and taxonomy population into the bigger loop to save a few iterations. This would mean moving some of the checking work to the insertion state, which may or may not be faster, I'm not sure.

Everything needs to be populated before we can start rendering, so that will almost certainly have to be a separate loop, though I'm willing to be corrected on that. Since rendering needs to be its own loop, the best we can do after that is combining internal and external link checking. That would bring us down to 3 loops in the code, though I'm not sure how many iterations that would actually give us.

I could look at it if I have some time in the next few weeks, but I wanted to check first how much demand there still is for this, and whether there are any objections to any of the things I mentioned.

Keats · 2020-12-19T16:48:20Z

Performance improvements are always welcome!
It would be great to have the perf breakdown details from @southerntofu or someone else analysing a zola build. I wouldn't be surprised that there are some low hanging fruits still.

savente93 · 2020-12-25T16:09:23Z

So I have a few things about this issue that are worth discussing:

The first two iterations mentioned by OP (174 & 182) were already unified in Make sections draftable #1218
move translation population and path collision checking to insert stage #1272 pulls the path collision into that loop as well. Additionally, I also unified a few loops in the populate_sections function, which might also help.

I'm not 100% sure since the program flow is a bit hard to understand at times, but if we are willing to keep a few more index-like objects in the library (like reverse_aliases) then we could unify much more, for example:

populate_taxonomies could effectively also be moved to the insertion stage
if we were willing to keep a similar index for weight/dates we could perform insertion sort on the pages instead of sorting them after the fact which seems like a big one.
As I understand it, the only reason the link checking is moved to after the rendering stage is because the anchors are only constructed during the rendering. With this more in place architecture, I think it would be possible to move the anchor construction to the insertion state to unify those loops.

The thing is that these things will require more memory since we have to keep those index alive during the whole thing. Personally, I think this is fine, but I didn't want to put time into a PR that might be rejected for that reason, so I want to check here first. One upside of this idea is that serve might gain a lot of performance from architecture like this since we don't have to rebuild all of the structures every time a new update is made, (which I think is the case at the moment? I'm willing to be corrected on this). What would the chances be of such a PR getting merged?

I don't think it is possible to unify beyond that unless we find a better way of lazy loading page & section DefaultKeys, but that would be quite the operation, and I'm not comfortable doing that on my own.

savente93 · 2021-01-06T08:20:39Z

@Keats thoughts? would that be an acceptable tradeoff?

Keats · 2021-01-06T09:18:28Z

I think 1 can be done but I believe rendering is by far the biggest part of the runtime so while it's nice to optimise the insertion, it's not going to be as impactful as making taxonomies rendering faster for example (the current slowest part of Zola by far I think).
For 2, I don't think sorting is that much of an overhead - it should be pretty fast.
For 3, I'm not 100% sure I understand: how can you know if an anchor is valid before being sure you've processed everything?

Keats added the enhancement label Sep 4, 2020

Keats added the help wanted label Nov 27, 2020

savente93 mentioned this issue Dec 25, 2020

move translation population and path collision checking to insert stage #1272

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architecture for less iterations over the library #1112

Architecture for less iterations over the library #1112

southerntofu commented Jul 31, 2020

Keats commented Jul 31, 2020

southerntofu commented Jul 31, 2020

Keats commented Aug 1, 2020 •

edited

Loading

southerntofu commented Aug 1, 2020

Keats commented Aug 5, 2020

southerntofu commented Aug 7, 2020

Keats commented Aug 7, 2020

savente93 commented Dec 19, 2020

Keats commented Dec 19, 2020

savente93 commented Dec 25, 2020

savente93 commented Jan 6, 2021

Keats commented Jan 6, 2021

Architecture for less iterations over the library #1112

Architecture for less iterations over the library #1112

Comments

southerntofu commented Jul 31, 2020

Keats commented Jul 31, 2020

southerntofu commented Jul 31, 2020

Keats commented Aug 1, 2020 • edited Loading

southerntofu commented Aug 1, 2020

Keats commented Aug 5, 2020

southerntofu commented Aug 7, 2020

Keats commented Aug 7, 2020

savente93 commented Dec 19, 2020

Keats commented Dec 19, 2020

savente93 commented Dec 25, 2020

savente93 commented Jan 6, 2021

Keats commented Jan 6, 2021

Keats commented Aug 1, 2020 •

edited

Loading