Add DeviceRadixSort #82

b0nes164 · 2023-12-09T01:13:19Z

Add DeviceRadixSort (MIT License). Keep FidelityFX as a fallback, but modify host CPU code to use DeviceRadixSort. DeviceRadixSort still untested on AMD hardware. See issue #48.

Add DeviceRadixSort(MIT Licensce). Keep FidelityFX as a fallback, but modify host CPU code to use DeviceRadixSort.

aras-p · 2023-12-09T15:32:25Z

Hmm, testing it out on a Mac (M1 Max), and with every frame it seems to add more & more black results, as if it makes splats disappear (which would probably mean that some of the key payload data gets overwritten with zero or something).

b0nes164 · 2023-12-10T04:21:33Z

What is the wave-width of the M1? I found a bug affecting warp-widths < 32.

Edit:
Reworked counting of bits less than a lane from shifting to using a mask. I suspect that DX12 and Metal handle shifting 1<<n for n > 31 differently, resulting in incorrect behavior in Metal case.

…ask. Reworked counting of bits less than a lane from shifting to using a mask. I suspect that DX12 and Metal handle shifting `1<<n` for `n > 31` differently, resulting in incorrect behavior in Metal case.

aras-p · 2023-12-10T13:20:27Z

@b0nes164 the pushed fix does not help (and FWIW Apple M1 has warp width of 32). Wild guess: is the shader relying on D3D behavior of out-of-bounds buffer reads returning zeroes? Metal (and Vulkan, for that matter) does not have that guarantee, out of bounds buffer reads are undefined (and out of bounds buffer writes are too, and can hang/crash the system - whereas on D3D they are guaranteed to be ignored).

b0nes164 · 2023-12-11T04:34:03Z

No out-of-bounds hacks, input sizes that are not perfect multiples of the partition tiles size are handled by a specific routine. In my mind there is either an issue resulting from the cross compilation, or there is a race condition which I am not accounting for that does not present on my hardware.

Based on some light reading of the Metal language specification, it seems like there could be an issue with the Metal equivalent of WaveActiveBallot() because it returns a wrapper type rather than a uint4, but I have zero experience with Metal or cross compiling or how Unity specifically does the cross compilation. I also don't have a Mac, so I can't troubleshoot the code either.

I'm still trying to think of possibles fixes, but for now, this will have to be on hold.

b0nes164 · 2023-12-26T19:37:26Z

@aras-p, this might be a bit of a reach, but would it be possible to send you a standalone version of the sort to collect debug information, then send the results to me? Because LSD radix sorting performs a constant amount of work regardless of the input, the bug should reproduce in a standalone environment.

aras-p · 2023-12-27T08:23:28Z

@b0nes164 I do have a separate "testbed" project for GPU sorting codes, and I just plugged yours from this PR in there. At least on a Mac here, DeviceRadixSort works fine for smaller array sizes, and starts to have occasional failures at 10000 items and up, and "quite many" failures at sizes of several millions, i.e.

The actual errors (i.e. how many elements are not correctly sorted, and where they are located) are different from run to run though.

If you want to send me something else to test here, I'd be happy to!

b0nes164 · 2024-01-11T09:30:42Z

@aras-p apologies for the delay, but here is the debugging routine for the sort. The limited testing you did was actually very useful because it rules out issues with the inter-threadblock prefix sum. My running theory is that for whatever reason on the M1, stale values are occasionally being pulled from shared memory, causing non-catastrophic errors in the downsweep kernel. According to the HLSL specification:

[I]n Direct3D 10 there is no synchronization of threads when writing to groupshared, so this means that each thread is limited to a single location in an array for writing.

Although we are using Dx12, I am unaware of any improved guarantees on shared memory synchronization without barriers. My best guess is that on my hardware, shared memory accesses are kept coherent on a warp level, guaranteeing that fresh values from within the warp are being pulled from shared memory. Because this scenario only occurs in the "full partition" portion of the code, it supports your testing results where the bug only presents in sorts of input size 7680 or greater.

The testing procedure:

Attach the dispatcher to an empty game object.
Attach the compute shader to the script,
Run the scene, upon completion "Test Complete" will output in the debug log (should take less than 30s).
Zip the resulting text files in the Unity project back to me here.

Both files can be found under "MetalDebug" as public gists at:
https://gist.github.com/b0nes164

aras-p · 2024-01-11T10:28:26Z

My best guess is that on my hardware, shared memory accesses are kept coherent on a warp level, guaranteeing that fresh values from within the warp are being pulled from shared memory

@b0nes164 "Something along those lines" was also my guess, i.e. some sort of coherency or memory model assumptions that happen to work out fine on your GPU architecture, but not necessarily on others.

Anyway here's the test results I got on M1 Max. I had to add #pragma require wavebasic to the compute shader btw, since without it Unity's shader compiler targets too old Metal version, and errors out about wave ops.

MetalDebugOutput.zip

b0nes164 · 2024-01-11T12:17:53Z

Interesting, there was an issue with the inter-threadblock prefix-sum with the offending line here:

UnityGaussianSplatting/package/Shaders/DeviceRadixSort.hlsl

Lines 146 to 148 in efc5601

    
           g_scan[circularLaneShift + waveOffset] = g_scan[WaveGetLaneIndex() + waveOffset] + 
        
               WavePrefixSum(g_scan[WaveGetLaneIndex() + waveOffset]); 
        
           b_passHist[i + offset] = (WaveGetLaneIndex() ? g_scan[WaveGetLaneIndex() + waveOffset] : 0) + aggregate;

Fortunately this line can be fixed without adding an additional barrier. Although the downsweep kernel came out clean, I will be adding additional barriers as a safety measure.

Fix scan kernel pulling stale values from shared memory by moving operations into registers. Fix bug affecting wave sizes != 32 where the preprocessor macro incorrectly defined the size of the downsweep histograms. Fix non-presenting but potential bug where the final barrier in the partial partition routine could display undefined behaviour.

b0nes164 · 2024-01-11T14:16:09Z

Alright the fix for several bugs has been pushed! Overall the performance didn't suffer too badly-- about a ~5% reduction on extremely large inputs, and a negligible reduction on the lower end. I've been having trouble activating a Unity license on my AMD computer, so the algorithm is still untested on that front. I plan to make a headless DX12 implementation to circumvent the license issue, but it might be a couple more weeks until I get the chance to do it.

aras-p · 2024-01-11T15:43:36Z

Nice! That seems to work correctly now here. On Bicycle 3DGS scene at "Medium" quality, total render frame time goes 23.3ms -> 22.0ms on M1 Max. I'll do some more testing and merge this request sometime later (maybe once you or someone does testing on AMD / Intel). Thanks!

raphlinus · 2024-01-21T15:41:26Z

package/Shaders/DeviceRadixSort.hlsl

+                    if (bits == 1)
+                    {
+                        for (int wavePart = 0; wavePart < waveParts; ++wavePart)
+                            g_localHist[key >> e_radixShift & RADIX_MASK] += countbits(waveFlags[wavePart]);


Wouldn't it be faster to do the reduce of countbits first and then do a single update to g_localHist?

It could be, but my concern in this case is the register count. Assuming that the HLSL compiler will return a result similar to CUDA, we are sitting right on the fence where registers start cutting into occupancy. Furthermore, this portion of the routine is only invoked for the last partition, and has extremely negligible impact on the performance of the algorithm. Whereas, using more registers here might make the compiler think it needs more registers, which could severely impact the entire algorithm.

raphlinus · 2024-01-21T15:50:58Z

I reviewed this for a work-in-progress survey of GPU sorting techniques. Overall it looks impressive, though I did find a couple of correctness concerns.

First, there is no wave barrier between offset = g_localHist[...] (line 334) and g_local_hist[...] += ... (line 338). See gpuweb/gpuweb#4437 for a more in-depth discussion of this issue, which I believe motivates adding subgroup (wave) barrier to WebGPU. HLSL does not have this intrinsic, though it's likely that most GPUs execute waves in lockstep so the barrier is implicit. Even so, when the HLSL gets translated to Metal or Vulkan (both of which do have subgroup barriers), it is at least technically a data race. Using a full GroupMemoryBarrierWithGroupSync would satisfy correctness concerns with some performance loss. On M1 Max on my (very similar) codebase, I observed about a 2% hit, but it may be larger on other GPUs. I'd certainly be interested in more data on this.

In an earlier draft of that survey, I complained that wave size assigned to one compute shader may not be constant on another, compiled on the same GPU (especially Intel), so the probe strategy is invalid. That's still true, but I think it's not a real worry, as wave size is guaranteed to be between 4 and 128 inclusive on all GPUs. This is documented in WaveGetLaneCount and these limits are the same as Vulkan. So I believe testing wave size at all is a no-op.

I'm impressed that this algorithm is portable to different wave sizes, but I think this comes at a performance cost. My experiment was specialized to 32, and I was able to extract some gains:

wave ballot accumulation is a single uint, so no need to iterate over wavePart
WLMS accumulation can happen on all waves in parallel, no need to sequentialize by wave

Of course, the downside to the latter is that the histogram array needs storage proportional to digit size * number of waves. I was only able to get good results with 4 bit digits, so it's possible that doing fewer passes will outweigh the concerns here.

The other concern is of course that support for specializing by wave size is extremely spotty in GPU infrastructure; really only Vulkan 1.3 does it properly. I'm pushing for WebGPU to get this right; not sure how Unity handles it.

aras-p · 2024-01-21T15:59:56Z

not sure how Unity handles it

Starting with Unity 2023 it seems that they made it easier to target same shader for different wave sizes, but the docs are not exactly clear how to use it (currently this gaussian splatting project targets Unity 2022 because it's the current LTS version though).

Starting with 2023 it looks like you can do a #pragma multi_compile your compute shader into several variants using UNITY_DEVICE_SUPPORTS_WAVE_8 .. UNITY_DEVICE_SUPPORTS_WAVE_128 special defines, and then unity will pick the right one at runtime (docs page). Before that, you could compile into several variants, but would have to pick the right one manually, which is somewhat more cumbersome.

raphlinus · 2024-01-21T17:08:26Z

package/Shaders/DeviceRadixSort.hlsl

+    //Each wave clears its own portion of shared memory
+    {
+        const int waveHistEnd = (WAVE_INDEX + 1) * RADIX;
+        for (int i = WaveGetLaneIndex() + WAVE_INDEX * RADIX; i < waveHistEnd; i += WaveGetLaneCount())


I'm really having trouble understanding how this will perform with wave sizes less than 32. In particular, with a wave size of 16, WAVE_INDEX will be 0..32, and the range of locations cleared will be 0..8192. But g_localHist is sized to 7680. The situation is much worse for a wave size of 8.

By my logic, the size of g_localHist should be RADIX * DS_THREADS / WaveGetLaneCount(), and I do not understand at all where this 7680 is coming from. Of course the problem with that is that WaveGetLaneCount() is not known at compile time.

Yea, no this is definitely an issue. The partition size of 7680 was simply the largest partition tile size I could choose that would also fit into HLSL's 32kb limit of shared memory during the scattering.

It seems like multi-compiling might the only option to cover all the partition sizes. see the comment below on why I was trying to avoid speicialized shader variants,

b0nes164 · 2024-01-21T17:08:40Z

First, there is no wave barrier between offset = g_localHist[...] (line 334) and g_local_hist[...] += ... (line 338). See gpuweb/gpuweb#4437 for a more in-depth discussion of this issue, which I believe motivates adding subgroup (wave) barrier to WebGPU. HLSL does not have this intrinsic, though it's likely that most GPUs execute waves in lockstep so the barrier is implicit.

There should definitely be a barrier there, but as you mentioned, there is no subgroup level barrier available in HLSL. I have a mind to change that portion of the code entirely because of these issues.

wave ballot accumulation is a single uint, so no need to iterate over wavePart
WLMS accumulation can happen on all waves in parallel, no need to sequentialize by wave

Of course, the downside to the latter is that the histogram array needs storage proportional to digit size * number of waves. I was only able to get good results with 4 bit digits, so it's possible that doing fewer passes will outweigh the concerns here.

As I mentioned in the comment, we only use sequential WLMS in the final partition as a way of easily dealing with non-standard size partitions. Because it is only invoked once per binning pass, the performance difference is negligible. (For example you have an input of size 2 million, with a partition size of 7680, 261 partitions, only one is performed sequentially).

In an earlier draft of that survey, I complained that wave size assigned to one compute shader may not be constant on another, compiled on the same GPU (especially Intel), so the probe strategy is invalid.

I did not know that, and will change this routine then. As Aras mentioned, I didn't find any documentation on Unity exposing the functionality to grab the adapter description as you would in Vulkan or DX12.

I'm impressed that this algorithm is portable to different wave sizes, but I think this comes at a performance cost.

I thought this too at first. The primary downsides being less loop unrolling, and more control logic being necessary, but after implementing here, I found the difference in performance to be negligible on my GPU. More testing is definitely required, but as Aras mentioned, the solution in Unity is essentially to maintain a specialized version of the code for every possible wave size, which makes maintainability harder, and cuts into compile times for the shader.

raphlinus · 2024-01-21T17:19:26Z

A quick comment: I now see I misread the code and see that "last partition" is its own specialized thing. So I withdraw my assertion that sequentializing in this case has a meaningful performance impact.

Getting the wave sizes is a big problem across the GPU ecosystem, and WebGPU is struggling with it. My gut feeling is that a production implementation will need specialized versions when you do know the wave size, and a fallback implementation when you don't. But perhaps we will figure out how to make WLMS wave-size portable.

b0nes164 · 2024-01-21T18:45:59Z

As Raph pointed out, there are currently issues with the static allocation of shared memory, and as Aras pointed out, it does not seem like Unity 2022 has support for appropriately querying the GPU for the wave/warp/subgroup size of the device. So I will assume that #pragma multi_compile is off the table.

The constraints we are working under is that each warp needs a histogram of size equivalent to the number of radix digit bins, 256. As warps get smaller, we need more shared memory to accommodate the warp histograms. Shared memory has the added drawback that it must be statically allocated (at least in HLSL). Fortunately there is a solution that should allow us to support waves down to size 8, with no changes to occupancy or thread-count:

We combine the g_waveHist and g_localHist into one contiguous block of memory, giving us a total memory size of 32kb.
We then delay pulling the global histogram values into shared memory until after the WLMS is performed, allowing us to use the entire memory block for WLMS.
Because we use a fixed partition size, we know that the total sum of digit counts for any given digit cannot exceed the size of the partition tile. Thus we can use uint16_t instead of uint32_t to store our digit accumulations, effectively doubling the size of the shared memory during WLMS.

This may be messy in HLSL, because it does not have reinterpret_cast, but it is worth a shot.

Upsweep kernel: Change the threadgroup size from 64 to 128 to accommodate larger minimum waves. Fix a potential race condition caused by circular shifting during the exclusive prefix sum for wave width greater or equal to 16. Add a fused brent-kung scan to support wave widths less than 16. Scan kernel: Removed wave size polling. Change to fixed threadblock dispatch sizes. Update the exclusive scan to accommodate the fixed threadblock dispatch. Add support wave widths less than sixteen. Downsweep: Moved loading the pass and local histogram to later in the kernel to allow full use of the shared memory during WLMS. Fixed a potential race condition by adding a second barrier during the WLMS. Fixed a potential race condition caused by circular shifting during the exclusive prefix sum across reduced wave histograms. Added support for wave width 8, with wave width 4 planned. Added bitpacking to maximize shared memory utilization during WLMS for wave width 4. Fixed a potential race condition during the partial routine by adding a second barrier during WLMS.

b0nes164 · 2024-01-28T08:11:27Z

Some general notes about progress so far:

This is a preliminary commit to show the general direction of the algorithm. Each kernel now contains sections to handle different wave sizes, which is done logically by using WaveGetLaneCount(). In later versions this should be done with the compiler using UNITY_DEVICE_SUPPORTS_WAVE_XXX.

Each kernel contains of portion of code to support "big" waves, 16 and above, and "small" waves, below 16. We choose 16 as the dividing size because waves 16 and above can easily perform the local prefix sums required for a 8-bit radix pass, $16^{2} = 256$, whereas waves 8 and below are too small and require a more specialized approach. Namely, a wave-width-base Brent-Kung scan with a fused "upsweep" and "downsweep."

Although the code is still untested on wave-widths not 32, I have completed my DX12 implementation, and am beginning testing in earnest. In particular, DX12 has access to the [WaveSize(<numLanes>)] attribute as part of shader model 6.6, which allows the user to force specific wave-widths on kernels, which will be extremely useful for testing.

This version of the code only supports down to a wave width of 8, but wave-width of 4 is very achievable simply by performing the WLMS serially. I will be adding this soon, and I expect this to still be a significant performance win over losing occupancy due to shared memory constraints.

I have increased the threadgroup sizes of the Upsweep and Scan to 128 to make the kernels more amenable to the maximum possible wave-width. I also plan to reduce the threadgroup size of the Downsweep kernel to 256 to accommodate hardware with lower warp hosting capacity.

Upsweep Kernel: Fix bug with prefix sum for wave size less than 16, where aggregate was not being calculated properly. Fix bug with prefix sum for wave size less than 16, where downsweep was off by one. Scan Kernel: Fix bug with prefix sum for wave size less than 16, where aggregate was not being calculated properly. Fix bug with prefix sum for wave size less than 16, where downsweep was off by one. Downsweep Kernel: Add support for wave size 4. Split key loading and scattering into less than 16 and greater or equal to 16. Fix bug where waveflag was improperly initialized for waves less than 16.

b0nes164 · 2024-01-31T00:22:24Z

Test coverage results:

Adapter	Wave Size	Test Results
WARP	[4, 4]	Passing
Intel HD Graphics 620	[16, 16]	Passing
Nvidia RTX 2080 Super	[32, 32]	Passing
AMD 7900 XT	allowed to vary from [32, 64]	Passing
AMD 7900 XT	locked to 64	Passing

Algorithm version	$2^{28}$ uniform random 32-bit key value pairs per second, DX12, 2080 Super
Dec 8th commit	$4.15 * 10 ^ {9}$
Current	$4.17 * 10 ^ {9}$

Interestingly, shuffling things around, and moving some things into registers increased performance, which is why the performance increased slightly, despite significantly more barriers used. Performance could be better on higher end cards because lowering partition sizes cuts into cache-line efficiency during the scatter, but is unavoidable without compiler variants.

I am now confident that this version of the sort is ready to be merged.

b0nes164 · 2024-02-14T20:44:07Z

I have one last update in the works, which slightly improves performance, and refactors the code for readability. I plan to commit this in the next few days.

Removed all preprocessor macro functions. Refactor some variable names, and cut line length to 100. Unroll WLMS loop. Use laneMaskLt instead of laneMaskLe for WLMS. Remove redundant logic in scan kernel. Fix silent bug during WLT16 exclusive prefix sum.

b0nes164 · 2024-02-15T08:44:54Z

Test coverage results:

Adapter	Wave Size	Test Results
WARP	[4, 4]	Passing
Intel HD Graphics 620	[16, 16]	Passing
Nvidia RTX 2080 Super	[32, 32]	Passing
AMD 7900 XT	allowed to vary from [32, 64]	Passing
AMD 7900 XT	locked to 64	Passing

Algorithm version	$2^{28}$ uniform random 32-bit key value pairs per second, DX12 standalone, 2080 Super
Dec 8th commit	$4.15 * 10 ^ {9}$
Jan 30th commit	$4.17 * 10 ^ {9}$
Current	$4.44 * 10 ^ {9}$

aras-p · 2024-02-19T08:07:41Z

@b0nes164 very nice! Do you have some sort of "high level outline" of the algorithm used here? Like, is this similar to OneSweep at this point, or different in some way?

b0nes164 · 2024-02-19T15:17:23Z

@aras-p

Do you have some sort of "high level outline" of the algorithm used here?

Not yet, but it is on my to do list.

[I]s this similar to OneSweep at this point, or different in some way?

Yes, this is almost exactly the same as OneSweep except for one key difference: Reduce-Then-Scan (RTS) is used in place of Chained-Scan-with-Decoupled-Lookback (CSDL) to perform the inter-threadblock prefix sum over digit counts.

During a counting sort each key is serially dependent on the counts of preceeding keys, and on a GPU these key counts somehow have to be communicated between the threadblocks during the sort. RTS and CSDL are two approaches to solving this problem. The issue is that, in my experience, CSDL does not play nicely outside of CUDA. This is because Nvidia and AMD have different ways of performing the inter-threadblock communication that is required during CSDL. According to Raph, Nvidia utilizes forward thread progress guaruntees, whereas AMD utilizes something called Global Data Share (see 1.2.2.2 RDNA 3 ISA3) which is dedicated low latency but globally coherent device level memory. To the best of my knowledge, neither of these hardware specific features are exposed in the public D3D12 api, which makes CSDL and OneSweep impractical.

That being said, I know from talking to a D3D12 dev that Global Data Share is exposed interally, and is used for what I'm guessing is D3D12's mesh shader implementation where CSDL would be used to compact culled primitives.

Also apologies, but I have one last update for DeviceRadixSort to fix a silent bug that passed my testing.

jacemiller · 2024-02-19T16:41:16Z

Thank you to aras-p and b0nes164 for this effort!

Add DeviceRadixSort, modify host CPU code to use instead of FidelityFX.

08e7c89

Add DeviceRadixSort(MIT Licensce). Keep FidelityFX as a fallback, but modify host CPU code to use DeviceRadixSort.

b0nes164 force-pushed the main branch from 7e17a39 to 96a83a7 Compare December 10, 2023 12:39

Reworked counting of bits less than a lane from shifting to using a m…

efc5601

…ask. Reworked counting of bits less than a lane from shifting to using a mask. I suspect that DX12 and Metal handle shifting `1<<n` for `n > 31` differently, resulting in incorrect behavior in Metal case.

b0nes164 force-pushed the main branch from 96a83a7 to efc5601 Compare December 10, 2023 13:09

raphlinus reviewed Jan 21, 2024

View reviewed changes

Refactor, improve performance.

c303564

Removed all preprocessor macro functions. Refactor some variable names, and cut line length to 100. Unroll WLMS loop. Use laneMaskLt instead of laneMaskLe for WLMS. Remove redundant logic in scan kernel. Fix silent bug during WLT16 exclusive prefix sum.

aras-p merged commit a1febce into aras-p:main Feb 19, 2024

aras-p mentioned this pull request Feb 19, 2024

Performance: investigate faster sorting options #48

Closed

b0nes164 mentioned this pull request Apr 29, 2024

Sort not working on Android devices, Quest 3 and Lenovo A3 #112

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DeviceRadixSort #82

Add DeviceRadixSort #82

b0nes164 commented Dec 9, 2023 •

edited

Loading

aras-p commented Dec 9, 2023

b0nes164 commented Dec 10, 2023 •

edited

Loading

aras-p commented Dec 10, 2023

b0nes164 commented Dec 11, 2023 •

edited

Loading

b0nes164 commented Dec 26, 2023

aras-p commented Dec 27, 2023 •

edited

Loading

b0nes164 commented Jan 11, 2024 •

edited

Loading

aras-p commented Jan 11, 2024

b0nes164 commented Jan 11, 2024

b0nes164 commented Jan 11, 2024 •

edited

Loading

aras-p commented Jan 11, 2024 •

edited

Loading

raphlinus Jan 21, 2024

b0nes164 Jan 21, 2024 •

edited

Loading

raphlinus commented Jan 21, 2024

aras-p commented Jan 21, 2024

raphlinus Jan 21, 2024

b0nes164 Jan 21, 2024 •

edited

Loading

b0nes164 commented Jan 21, 2024 •

edited

Loading

raphlinus commented Jan 21, 2024

b0nes164 commented Jan 21, 2024

b0nes164 commented Jan 28, 2024 •

edited

Loading

b0nes164 commented Jan 31, 2024 •

edited

Loading

b0nes164 commented Feb 14, 2024

b0nes164 commented Feb 15, 2024

aras-p commented Feb 19, 2024

b0nes164 commented Feb 19, 2024 •

edited

Loading

jacemiller commented Feb 19, 2024

Add DeviceRadixSort #82

Add DeviceRadixSort #82

Conversation

b0nes164 commented Dec 9, 2023 • edited Loading

aras-p commented Dec 9, 2023

b0nes164 commented Dec 10, 2023 • edited Loading

aras-p commented Dec 10, 2023

b0nes164 commented Dec 11, 2023 • edited Loading

b0nes164 commented Dec 26, 2023

aras-p commented Dec 27, 2023 • edited Loading

b0nes164 commented Jan 11, 2024 • edited Loading

aras-p commented Jan 11, 2024

b0nes164 commented Jan 11, 2024

b0nes164 commented Jan 11, 2024 • edited Loading

aras-p commented Jan 11, 2024 • edited Loading

raphlinus Jan 21, 2024

Choose a reason for hiding this comment

b0nes164 Jan 21, 2024 • edited Loading

Choose a reason for hiding this comment

raphlinus commented Jan 21, 2024

aras-p commented Jan 21, 2024

raphlinus Jan 21, 2024

Choose a reason for hiding this comment

b0nes164 Jan 21, 2024 • edited Loading

Choose a reason for hiding this comment

b0nes164 commented Jan 21, 2024 • edited Loading

raphlinus commented Jan 21, 2024

b0nes164 commented Jan 21, 2024

b0nes164 commented Jan 28, 2024 • edited Loading

b0nes164 commented Jan 31, 2024 • edited Loading

b0nes164 commented Feb 14, 2024

b0nes164 commented Feb 15, 2024

aras-p commented Feb 19, 2024

b0nes164 commented Feb 19, 2024 • edited Loading

jacemiller commented Feb 19, 2024

b0nes164 commented Dec 9, 2023 •

edited

Loading

b0nes164 commented Dec 10, 2023 •

edited

Loading

b0nes164 commented Dec 11, 2023 •

edited

Loading

aras-p commented Dec 27, 2023 •

edited

Loading

b0nes164 commented Jan 11, 2024 •

edited

Loading

b0nes164 commented Jan 11, 2024 •

edited

Loading

aras-p commented Jan 11, 2024 •

edited

Loading

b0nes164 Jan 21, 2024 •

edited

Loading

b0nes164 Jan 21, 2024 •

edited

Loading

b0nes164 commented Jan 21, 2024 •

edited

Loading

b0nes164 commented Jan 28, 2024 •

edited

Loading

b0nes164 commented Jan 31, 2024 •

edited

Loading

b0nes164 commented Feb 19, 2024 •

edited

Loading