perf: improve parallelism of SortBotMerge#831
Open
jodavies wants to merge 2 commits into
Open
Conversation
Currently, the sortbot stage of tform sorting does not achieve good parallelism. Primarily, this is because the "sortblock" size has grown with default SmallSize and LargeSize adjustments, such that in many cases the sortbot levels do not run in parallel at all, because each thread's output fits in a single block. This commit adjusts the logic for filling and unlocking the blocks such that sortbot threads can start work as soon as possible: - Only put complete terms into the blocks, no splitting over the blocks. - Track the number of terms in each block, and use this when reading the data to determine when a block is complete, rather than waiting for a term to overlap the "stop" pointer. - When filling blocks (in PutToMaster), if a certain (small) minimum number of terms has been written, probe if a reading thread is waiting on the current block by attempting to lock the previous block. If a thread is waiting (so the lock fails), unlock the current block immediately and write the term into the next. With these changes, the "zeroth" block is no longer necessary since the terms never span two blocks. Also reduce the MINIMUMNUMBEROFTERMS parameter from 10 to 2, so that the small+large buffer does not scale (so much) with large MaxTermSize and many threads.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This change comes from discussion and benchmarking efforts in collaboration with @AnaIPereira. See the commit message for a full description.
On my 12-core 7900X, the results are as follows. @AnaIPereira is finishing up some testing on higher core-count sytems, to make sure we have a good value for
MINWRITENUMBEROFTERMSand that the adjusted, lower value ofMINIMUMNUMBEROFTERMSdoes not affect performance.Speedup w.r.t. master:
chromaticcolorfmftforcer-expforcerhyperformmass-factmbox1lminceexmincermzv-dmsort-disksort-largesort-smalltrace