Device Clusterization Reorganization, main branch (2024.04.17.) #545

krasznaa · 2024-04-17T20:49:18Z

After #543, this is an even more ambitious PR for splitting traccc::cuda::clusterization_algorithm and traccc::sycl::clusterization_algorithm in two. (Re-)Introducing the traccc::cuda::spacepoint_formation_algorithm and traccc::sycl::spacepoint_formation_algorithm algorithms.

The biggest change is in traccc::device::ccl_kernel. Since now that function receives a resizable measurement buffer, it no longer deals with figuring out the index of the measurement that it should fill. It just uses vecmem's built-in ability for doing this sort of a thing.

I also:

Got rid of traccc::cuda::experimental::clusterization_algorithm and traccc::sycl::experimental::clusterization_algorithm. The updated algorithms now behave in the way that those "experimental" algorithms were meant to.
- But I kept traccc::cuda::experimental::spacepoint_formation and traccc::sycl::experimental::spacepoint_formation around for now. Since the Detray usage in those is still something that I'll want to steal from. 😉
Got rid of some minute duplications in the code. Introducing some new code that the CUDA and SYCL implementations would use from the same place.
Did some general code cleanup here and there.

With all of this, I simplified the device code in a few places a bit. So I wanted to see a little more in detail what happens to the performance of the code with these updates. Unfortunately the answer is a bit complicated... With the multi-threaded throughput tests, using 4 CPU threads, I see the following:

	$\mu$ = 40	$\mu$ = 140	$\mu$ = 200	$\mu$ = 300
Old CUDA	2285.54	1454.86	920.469	527.403
New CUDA	1685.72	1038.05	742.887	424.643
Old SYCL	1568.65	1084.27	746.129	431.996
New SYCL	1587.81	778.757	523.391	296.209

It indeed looks far from great. 😦 However, when I do the same with the single-threaded throughput tests, I get:

	$\mu$ = 40	$\mu$ = 140	$\mu$ = 200	$\mu$ = 300
Old CUDA	1395.47	674.289	471.894	300.83
New CUDA	1407.7	674.373	478.964	300.309
Old SYCL	963.808	456.35	331.508	227.2
New SYCL	971.387	390.687	283.706	187.615

The CUDA numbers here look all good, but SYCL is still not great. 😟 But I don't fully understand why.

At first I was convinced that it would be the simplifications that I made in traccc::device::ccl_kernel that would be responsible for the slowdown. But not really. Even when I went back to the previous logic for how positions in the output container are chosen, the throughput numbers didn't change much.

I'll just give up for today... But let's discuss about this tomorrow...

krasznaa · 2024-04-18T09:38:44Z

I'm still stumped. 😦

According to NSight Systems the new code does not execute kernels or memory copies any more slowly. If anything, it is even faster than the current code. 🤔

With the current code I get this overall summary for running 1000 $\mu$=200 events on 4 CPU threads:

With the new code, for the same setup, I get:

Notice how the CCL kernel became faster with the updates. 🤔 Remember that, while I do introduce an additional algorithm with this re-shuffling, the number of kernels being executed remains the same!

So, since I could just not understand how the throughput test would become slower, while the kernels seemingly become faster, I ran the applications through VTune as well. And there is indeed a smoking gun there. 🤔 The current code achives the following CPU occupancy during the throughput test:

While this PR's code does this:

(The big spike at 1 thread is due to the initialization in both cases.)

However, I can't figure out what is causing this. 😦 So if anybody has any good idea, I'm all ears. Since the "threading analysis" of VTune is just not revealing anything to me at the moment...

beomki-yeo · 2024-04-18T10:13:05Z

I don't know why you get rid of the later part of clusterization which counts the measurement. (I also don't know why you removed the sorting part. It will need to be rolled back if you encounter weird results from tracking finding once everything is in the single piece.)
How are you counting the number of measurements used for the spacepoint_formation? Have you checked if you have redundant spacepoints during the seeding?

device/common/include/traccc/clusterization/device/impl/ccl_kernel.ipp

beomki-yeo · 2024-04-18T14:15:10Z

Out of curiosity have you checked if you are getting the same results from cpu and cuda?

device/common/include/traccc/clusterization/device/impl/reduce_problem_cell.ipp

krasznaa · 2024-04-19T07:37:37Z

So... fun! I now bumped into the same issue that I believe is giving us grief with clusterizing the ODD cells on a GPU. It seems that the code meant to order cells during CSV reading, is not doing its job. 😦

I was getting afraid that there would be no thing left to debug... 😮‍💨

krasznaa · 2024-04-19T08:10:48Z

Never mind, we don't get two birds with just the one stone... 😦 I just didn't set up the new assertion correctly in the code...

krasznaa · 2024-04-19T08:40:11Z

On the ODD front, since I'm now trying to see if I can make the device clusterization work for that in this PR, I now see this:

...
WARNING: No material in detector
WARNING: No entries in volume finder
Detector check: OK
[New Thread 0x7fffee7ff000 (LWP 560859)]
[New Thread 0x7fffed3ff000 (LWP 560860)]
[Detaching after fork from child process 560861]
[New Thread 0x7fffecbfe000 (LWP 560873)]
[New Thread 0x7fffe5fbd000 (LWP 560874)]

Assertion failed.

Thread 1 "traccc_seq_exam" received signal CUDA_EXCEPTION_12, Warp Assert.
[Switching focus to CUDA kernel 0, grid 1, block (26,0,0), thread (126,0,0), device 0, sm 3, warp 3, lane 30]
0x00007fffbee511a0 in __assert_fail ()
(cuda-gdb) bt
#0  0x00007fffbee511a0 in __assert_fail ()
#1  0x00007fffbee75770 in traccc::device::reduce_problem_cell (cells=..., cid=0, start=1073741824, end=33554451, 
    adjc=<error reading variable: Unknown storage specifier (read) 0x10000>, adjv=0x100000000)
    at /home/krasznaa/ATLAS/projects/traccc/traccc/device/common/include/traccc/clusterization/device/impl/reduce_problem_cell.ipp:70
#2  0x00007fffbee441e0 in traccc::device::ccl_kernel<traccc::cuda::barrier> (threadId=126, blckDim=128, blockId=26, 
    cells_view=..., modules_view=..., max_cells_per_partition=1536, target_cells_per_partition=1024, 
    partition_start=<error reading variable: Unknown storage specifier (read) 0x10000>, 
    partition_end=<error reading variable: Unknown storage specifier (read) 0x10000>, 
    outi=<error reading variable: Unknown storage specifier (read) 0x10000>, f=0x7fffdd000010, gf=0x7fffdd000c10, barrier=..., 
    measurements_view=..., cell_links=...)
    at /home/krasznaa/ATLAS/projects/traccc/traccc/device/common/include/traccc/clusterization/device/impl/ccl_kernel.ipp:226
#3  0x00007fffbee40d90 in traccc::cuda::kernels::ccl_kernel<<<(81,1,1),(128,1,1)>>> (cells_view=..., modules_view=..., 
    max_cells_per_partition=1536, target_cells_per_partition=1024, measurements_view=..., cell_links=...)
    at /home/krasznaa/ATLAS/projects/traccc/traccc/device/cuda/src/clusterization/clusterization_algorithm.cu:40
(cuda-gdb)

(This is one of the reasons that I added the many new assertions to the code. To try to understand this lingering issue.)

Any quick ideas for how we could end up with a cell that has more than 8 neighbours? 😕 (I'll try to figure this out myself as well, but if somebody already has an idea, that would help...)

stephenswat · 2024-04-19T08:51:45Z

It would be possible - although weird - that the same cell has multiple activations in the same input?

stephenswat · 2024-04-19T08:55:07Z

Aha!

traccc/data/odd/muon100GeV-geant4 $ cat event000000000-cells.csv | cut -d ',' -f 1,3,4 | sort | uniq -c

...

2 2161728508332653312,219,0

So there are some duplicate cells in the input data!

stephenswat · 2024-04-19T08:55:45Z

In fact, some are repeated up to 12 times:

      7 2089670364539207936,226,0
      7 2089670364539207936,229,0
      7 2089670364539207936,230,0
      7 2089670364539207936,231,0
      7 2089670364539317760,156,0
      7 2089670501978336256,132,0
      8 1224979373522735104,328,703
      8 1224979373522735104,329,704
      8 1729382531788187392,177,42
      8 1729382531788189440,12,27
      8 1729382531788189440,13,27
      8 1729382531788189440,14,27
      8 1729382669227169536,6,87
      8 1729382669227169536,7,87
      8 2089670364539207936,227,0
      8 2089670364539207936,228,0
      8 2089670501977937664,38,0
      8 2089670501977937664,39,0
      8 2089670501977937664,43,0
      8 2089670501977937664,44,0
      8 2089670501978336256,129,0
      9 1729382531788187392,178,42
      9 1729382531788187392,179,42
     10 2089670501977937664,40,0
     11 2089670501977937664,42,0
     12 2089670501977937664,41,0

krasznaa · 2024-04-19T08:58:02Z

That was one of my guesses. But, the following is now also failing for me. 😕

diff --git a/io/src/read_geometry.cpp b/io/src/read_geometry.cpp
index c88e8fe..f52ce52 100644
--- a/io/src/read_geometry.cpp
+++ b/io/src/read_geometry.cpp
@@ -19,6 +19,7 @@
 #include <vecmem/memory/host_memory_resource.hpp>
 
 // System include(s).
+#include <cassert>
 #include <stdexcept>
 
 namespace {
@@ -50,6 +51,7 @@ read_json_geometry(std::string_view filename) {
         barcode_map = std::make_unique<
             std::map<std::uint64_t, detray::geometry::barcode>>();
         for (const auto& surface : detector.surfaces()) {
+            assert(barcode_map->find(surface.source) == barcode_map->end());
             (*barcode_map)[surface.source] = surface.barcode();
         }
     }

I.e. the same surface identifier shows up multiple times from Detray. Could it be that we end up merging modules in some weird way?

Though this is probably just a red herring. At least for the clusterization...

krasznaa · 2024-04-19T09:01:05Z

In fact, some are repeated up to 12 times:

      7 2089670364539207936,226,0
      7 2089670364539207936,229,0
      7 2089670364539207936,230,0
      7 2089670364539207936,231,0
      7 2089670364539317760,156,0
      7 2089670501978336256,132,0
      8 1224979373522735104,328,703
      8 1224979373522735104,329,704
      8 1729382531788187392,177,42
      8 1729382531788189440,12,27
      8 1729382531788189440,13,27
      8 1729382531788189440,14,27
      8 1729382669227169536,6,87
      8 1729382669227169536,7,87
      8 2089670364539207936,227,0
      8 2089670364539207936,228,0
      8 2089670501977937664,38,0
      8 2089670501977937664,39,0
      8 2089670501977937664,43,0
      8 2089670501977937664,44,0
      8 2089670501978336256,129,0
      9 1729382531788187392,178,42
      9 1729382531788187392,179,42
     10 2089670501977937664,40,0
     11 2089670501977937664,42,0
     12 2089670501977937664,41,0

Since you have some "shell magic" at the ready as it seems, could you remove the duplicates from the files and send me the updated TGZ? I'll upload it as a new version.

stephenswat · 2024-04-19T09:02:45Z

Let me doctor up some deduplication code and produce some deduplicated files.

stephenswat · 2024-04-19T09:03:25Z

It's also worth understanding how this happens, because I can't say it makes much sense to me. Is there any physical reason why we would get a read-out for one pixel twice? Or 12 times?

stephenswat · 2024-04-19T09:05:35Z

Interestingly, the full TML dataset also seems to have some duplicated hits, although the number is much smaller than it is for the ODD files. Could be that this just succeeded by chance, e.g. there was never a duplicated hit with 8 other hits in the neighbourhood, so there was some buffer in the 8-length array to compensate.

krasznaa · 2024-04-19T09:13:03Z

There must be some issue in the Acts "digitization" code. 🤔 This behaviour sounds like a plausible outcome if some mistake is made there. (Remember, simulation just tells us how much energy was deposited exactly where in the detector volumes. We then need to turn those energy deposits into information that the real hardware would've read out as well. I.e. we need to "digitize" the simulated data, as we call it in ATLAS.)

Pinging @asalzburger for info. 😉

Made vecmem::cuda::clusterization_algorithm and vecmem::sycl::clusterization_algorithm both output measurement containers, and introduced vecmem::cuda::spacepoint_formation_algorithm and vecmem::sycl::spacepoint_formation_algorithm for turning those measurements into spacepoints. At the same time modified the shared clusterization code a little. Simplifying how traccc::device::ccl_kernel would fill its output container, and making sure that functions from traccc::core are re-used wherever possible. Implemented the setting of unique identifiers on the measurements, meant for the ambiguity resolution algorithm.

Instead of asking the CUDA runtime for it on every event.

stephenswat · 2024-04-19T09:18:55Z

Okay, pcadp04:/mnt/ssd1/sswatman/traccc/data/odd has been deduplicated.

stephenswat · 2024-04-19T09:23:05Z

$ build/bin/traccc_seq_example_cuda --input-directory=odd/muon100GeV-geant4/ --detector-file=geometries/odd/odd_geometry_detray.json --use-detray-detector --digitization-file=geometries/odd/odd-digi-geometric-config.json --input-events=10

Running Full Tracking Chain Using CUDA

>>> Detector Options <<<
  Detector file       : geometries/odd/odd_geometry_detray.json
  Material file       : 
  Surface rid file    : 
  Use detray::detector: yes
  Digitization file   : geometries/odd/odd-digi-geometric-config.json
>>> Input Data Options <<<
  Input data format             : csv
  Input directory               : odd/muon100GeV-geant4/
  Number of input events        : 10
  Number of input events to skip: 0
>>> Clusterization Options <<<
  Target cells per partition: 1024
>>> Track Seeding Options <<<
  None
>>> Performance Measurement Options <<<
  Run performance checks: no
>>> Accelerator Options <<<
  Compare with CPU results: no

WARNING: No material in detector
WARNING: No entries in volume finder
Detector check: OK
==> Statistics ... 
- read    792140 cells from 112034 modules
- created (cpu)  0 measurements     
- created (cpu)  0 spacepoints     
- created (cuda) 319346 spacepoints     
- created  (cpu) 0 seeds
- created (cuda) 42343 seeds
==>Elapsed times...
           File reading  (cpu)  1464 ms
         Clusterization (cuda)  42 ms
   Spacepoint formation (cuda)  24 ms
                Seeding (cuda)  47 ms
           Track params (cuda)  3 ms
                     Wall time  1587 ms

😄

stephenswat · 2024-04-19T09:23:54Z

See pcadp04:/mnt/ssd1/sswatman/traccc/data/traccc-data-v6.tar.gz.

Requiring the users to provide the function with vector views instead.

asalzburger · 2024-04-19T11:20:07Z

Ouch, I guess we could see an issue of Geant4 versus Fatras simulation, while Fatras is guaranteed to have one step per module (i.e. the particle only intersects the module once), Geant4 physics can force to split a segment into several steps as some fancy physics processes may happen. e.g. delta rays.

If we don't catch them before the clusterizer, we might produce duplicate cells ?1?

krasznaa · 2024-04-19T11:25:09Z

So... Stephen, since I'd like to sort this out, so that we could move on to doing all the rest that we still need to do: How do you feel about the changes?

The very last one, where I made the code go from using bare pointers to the shared memory blocks, to using vecmem vectors, did cause a small performance drop. (O(1%)) Which I personally am okay with, given the debugging benefits. (In debug mode the code will be more talkative about where it encountered an issue.) But I'm open to a discussion about this.

In any case, I'd want to sort out the addition of the new data file in a separate PR. And then of course I'll want to move on to looking at the full ODD tracking chain with CUDA. 😉

stephenswat · 2024-04-19T11:26:26Z

I think all the issues are resolved, I don't think a 1% performance drop warrants any debate, we can just get it in no problem.

krasznaa added tests Make sure the code keeps working cuda Changes related to CUDA sycl Changes related to SYCL examples Changes to the examples labels Apr 17, 2024

krasznaa requested review from stephenswat and beomki-yeo April 17, 2024 20:49

krasznaa changed the title ~~Host Clusterization Reorganization, main branch (2024.04.17.)~~ Device Clusterization Reorganization, main branch (2024.04.17.) Apr 17, 2024

stephenswat reviewed Apr 18, 2024

View reviewed changes

device/common/include/traccc/clusterization/device/impl/ccl_kernel.ipp Show resolved Hide resolved

krasznaa commented Apr 18, 2024

View reviewed changes

device/common/include/traccc/clusterization/device/impl/ccl_kernel.ipp Show resolved Hide resolved

krasznaa force-pushed the DeviceClusterizationReorg-main-20240419 branch from 6e07118 to 35f10fa Compare April 18, 2024 13:49

stephenswat reviewed Apr 18, 2024

View reviewed changes

device/common/include/traccc/clusterization/device/impl/reduce_problem_cell.ipp Show resolved Hide resolved

krasznaa commented Apr 18, 2024

View reviewed changes

device/common/include/traccc/clusterization/device/impl/reduce_problem_cell.ipp Outdated Show resolved Hide resolved

krasznaa force-pushed the DeviceClusterizationReorg-main-20240419 branch from 53e0b50 to ac95998 Compare April 19, 2024 08:09

krasznaa added 2 commits April 19, 2024 11:16

Small tweaks, in purpuit of the lost performance.

325e603

krasznaa added 5 commits April 19, 2024 11:16

Get the GPU's warp size during initialization.

3c877a6

Instead of asking the CUDA runtime for it on every event.

Fixed the unit tests for traccc::cuda::clusterization_algorithm.

4a9e30f

Removed some unnecessary implicit synchronization points.

7a82b2a

Fixing the cell distance calculation in the algorithm.

6c9885d

Additional minor tweaks to the clusterization code.

4bb7d10

krasznaa force-pushed the DeviceClusterizationReorg-main-20240419 branch from ac95998 to 4bb7d10 Compare April 19, 2024 09:16

stephenswat mentioned this pull request Apr 19, 2024

Device Clusterization Crashes on ODD Cells (2024.03.13.) #527

Closed

Removed the unprotected bare pointers from the clusterization function.

6a4c6f2

Requiring the users to provide the function with vector views instead.

stephenswat approved these changes Apr 19, 2024

View reviewed changes

krasznaa merged commit 4665550 into acts-project:main Apr 19, 2024
17 of 18 checks passed

krasznaa deleted the DeviceClusterizationReorg-main-20240419 branch April 19, 2024 11:29

This was referenced Apr 20, 2024

SYCL Clusterization Test Fix, main branch (2024.04.20.) #548

Merged

ODD CUDA Reconstruction, main branch (2024.04.23.) #552

Merged

krasznaa mentioned this pull request May 4, 2024

(Asynchronous) Error in CUDA Clusterization, main branch (2024.05.04.) #565

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Device Clusterization Reorganization, main branch (2024.04.17.) #545

Device Clusterization Reorganization, main branch (2024.04.17.) #545

krasznaa commented Apr 17, 2024

krasznaa commented Apr 18, 2024

beomki-yeo commented Apr 18, 2024 •

edited

Loading

beomki-yeo commented Apr 18, 2024

krasznaa commented Apr 19, 2024

krasznaa commented Apr 19, 2024

krasznaa commented Apr 19, 2024

stephenswat commented Apr 19, 2024

stephenswat commented Apr 19, 2024

stephenswat commented Apr 19, 2024

krasznaa commented Apr 19, 2024

krasznaa commented Apr 19, 2024

stephenswat commented Apr 19, 2024

stephenswat commented Apr 19, 2024

stephenswat commented Apr 19, 2024

krasznaa commented Apr 19, 2024

stephenswat commented Apr 19, 2024

stephenswat commented Apr 19, 2024

stephenswat commented Apr 19, 2024

asalzburger commented Apr 19, 2024

krasznaa commented Apr 19, 2024

stephenswat commented Apr 19, 2024

Device Clusterization Reorganization, main branch (2024.04.17.) #545

Device Clusterization Reorganization, main branch (2024.04.17.) #545

Conversation

krasznaa commented Apr 17, 2024

krasznaa commented Apr 18, 2024

beomki-yeo commented Apr 18, 2024 • edited Loading

beomki-yeo commented Apr 18, 2024

krasznaa commented Apr 19, 2024

krasznaa commented Apr 19, 2024

krasznaa commented Apr 19, 2024

stephenswat commented Apr 19, 2024

stephenswat commented Apr 19, 2024

stephenswat commented Apr 19, 2024

krasznaa commented Apr 19, 2024

krasznaa commented Apr 19, 2024

stephenswat commented Apr 19, 2024

stephenswat commented Apr 19, 2024

stephenswat commented Apr 19, 2024

krasznaa commented Apr 19, 2024

stephenswat commented Apr 19, 2024

stephenswat commented Apr 19, 2024

stephenswat commented Apr 19, 2024

asalzburger commented Apr 19, 2024

krasznaa commented Apr 19, 2024

stephenswat commented Apr 19, 2024

beomki-yeo commented Apr 18, 2024 •

edited

Loading