Skip to content

perf: allocate Normalize buffers dynamically, store in AT#803

Open
jodavies wants to merge 1 commit intoform-dev:masterfrom
jodavies:user-normsize
Open

perf: allocate Normalize buffers dynamically, store in AT#803
jodavies wants to merge 1 commit intoform-dev:masterfrom
jodavies:user-normsize

Conversation

@jodavies
Copy link
Collaborator

@jodavies jodavies commented Feb 26, 2026

Normalize is always a large contributor to FORM's run time. Profiling reveals that the large stack allocations in this function are costly: since NORMSIZE is 1000, they total ~90KB.

I tried running the usual benchmarks with NORMSIZE set to 100, which is sufficient for these tests, and the performance difference is rather large.

Since we can't just reduce NORMSIZE without (in principle) breaking user scripts which have very complicated terms, I experimented a bit:

  • using dynamic allocations slows things down significantly
  • using permanent arrays in, say, AT doesn't work: I think Normalize is called recursively
  • using part of the WorkSpace, as commentary already suggested, works in principle but breaks various test suite tests which have a tight workspace contraint

I decided that one option is to just make NORMSIZE a user-controlled parameter, with default 1000. Then nothing will break, and users can experiment with making it smaller to speed up their scripts. This PR implements this, by making a "NormSize" setup parameter, and using Variable Length Arrays in Normalize. VLAs are in c99 but optional in c11.

What I don't yet understand, is that I now get the same performance improvement WITHOUT reducing the value from 1000.

Here are the numbers. Can someone try to reproduce this?

Benchmark Speedup w.r.t. v5.0.0
chromatic 1.10 ± 0.01
color 1.13 ± 0.01
fmft 1.10 ± 0.01
forcer 1.04 ± 0.02
forcer-exp 1.11 ± 0.01
mass-fact 1.21 ± 0.09
mbox1l 1.02 ± 0.03
minceex 1.10 ± 0.02
mincer 1.25 ± 0.08
sort-disk 1.58 ± 0.01
sort-large 2.03 ± 0.07
sort-small 1.83 ± 0.02
trace 1.33 ± 0.01

The github runners seem a bit dodgy recently... I think the CI should pass.

@jodavies
Copy link
Collaborator Author

jodavies commented Feb 26, 2026

The reason for the performance improvement is due to "stack clash protection". Compiling with -fno-stack-clash-protection also gives the same performance improvement. (Adding additionally -fno-stack-protector is maybe an additional percent or two).

The way it is implemented is to touch each 4k page in the function's stack. This leads to a lot of L1 cache misses -- my CPU has an 8-way associative cache. In the VLA case, stack clash protection is also active, however it touches each 4k page of each array in separate loops, compared to the whole 90k stack in a single loop. Since NORMSIZE is 1000, the base addresses of the arrays are not separated by a power of 2, and so it manages much better L1 usage. If, in the VLA case, I make NORMSIZE 1024, the performance gain disappears.

So: should we just compile by default with -fno-stack-clash-protection -fno-stack-protector? I don't know if this provides anything, security wise, when FORM has #write and #system anyway?

I will also investigate further using the WorkSpace to hold this data.

@vermaseren
Copy link
Collaborator

vermaseren commented Feb 26, 2026 via email

@jodavies
Copy link
Collaborator Author

When you programmed this, "stack clash protection" did not exist ;) GCC added it in 2018 (8.0).

You can get the same performance improvement by allocating arrays in AT for Normalize to use. You need to allocate, I think, twice the required size. As far as I can tell the only way to have a recursive Normalize call in the current test suite and benchmarks is Normalize -> ExpandRat -> Normalize. I think this is much simpler than messing around with WorkSpace, since many of the functions which Normalize calls, use the WorkSpace.

So, what seems to be the better solution, which both have similar performance?

  • Compile by default with -fno-stack-clash-protection. This is simple and doesn't change FORM's behaviour/code at all.
  • Allocate arrays in AT. This then implies a maximum recursion depth for Normalize. The advantage is that buffer overruns in these arrays will be caught by valgrind. Currently if one overruns a buffer, there is silent corruption of the rest of the stack.

@vermaseren
Copy link
Collaborator

vermaseren commented Feb 26, 2026 via email

Large stack allocations have a performance implication due to stack clash
protection. Dynamically allocate the buffers needed by Normalize in AT,
instead. This leads to a large performance improvement.
@jodavies jodavies changed the title WIP perf: make NORMSIZE a user-controlled setup parameter perf: allocate Normalize buffers dynamically, store in AT Feb 27, 2026
@jodavies
Copy link
Collaborator Author

Here is the next iteration. Pointers to Normalize buffers live in a struct, and the space is allocated dynamically such that valgrind can catch errors in their use. I removed the user control of NORMSIZE again.

One set of buffers is allocated at startup (per thread), and another is allocated if necessary.

For now, I made it so that debugging builds Terminate if Normalize is called with more than two recursions, which we don't expect to happen.

The performance numbers are more-or-less the same as the first comment.

@coveralls
Copy link

coveralls commented Feb 28, 2026

Coverage Status

coverage: 58.035% (-0.008%) from 58.043%
when pulling 070aaaf on jodavies:user-normsize
into c134010 on form-dev:master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants