-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Device Clusterization Reorganization, main branch (2024.04.17.) #545
Device Clusterization Reorganization, main branch (2024.04.17.) #545
Conversation
I'm still stumped. 😦 According to NSight Systems the new code does not execute kernels or memory copies any more slowly. If anything, it is even faster than the current code. 🤔
Notice how the CCL kernel became faster with the updates. 🤔 Remember that, while I do introduce an additional algorithm with this re-shuffling, the number of kernels being executed remains the same! So, since I could just not understand how the throughput test would become slower, while the kernels seemingly become faster, I ran the applications through VTune as well. And there is indeed a smoking gun there. 🤔 The current code achives the following CPU occupancy during the throughput test: While this PR's code does this: (The big spike at 1 thread is due to the initialization in both cases.) However, I can't figure out what is causing this. 😦 So if anybody has any good idea, I'm all ears. Since the "threading analysis" of VTune is just not revealing anything to me at the moment... |
I don't know why you get rid of the later part of clusterization which counts the measurement. (I also don't know why you removed the sorting part. It will need to be rolled back if you encounter weird results from tracking finding once everything is in the single piece.) |
6e07118
to
35f10fa
Compare
Out of curiosity have you checked if you are getting the same results from cpu and cuda? |
device/common/include/traccc/clusterization/device/impl/reduce_problem_cell.ipp
Show resolved
Hide resolved
device/common/include/traccc/clusterization/device/impl/reduce_problem_cell.ipp
Outdated
Show resolved
Hide resolved
So... fun! I now bumped into the same issue that I believe is giving us grief with clusterizing the ODD cells on a GPU. It seems that the code meant to order cells during CSV reading, is not doing its job. 😦 I was getting afraid that there would be no thing left to debug... 😮💨 |
53e0b50
to
ac95998
Compare
Never mind, we don't get two birds with just the one stone... 😦 I just didn't set up the new assertion correctly in the code... |
On the ODD front, since I'm now trying to see if I can make the device clusterization work for that in this PR, I now see this:
(This is one of the reasons that I added the many new assertions to the code. To try to understand this lingering issue.) Any quick ideas for how we could end up with a cell that has more than 8 neighbours? 😕 (I'll try to figure this out myself as well, but if somebody already has an idea, that would help...) |
It would be possible - although weird - that the same cell has multiple activations in the same input? |
Aha!
So there are some duplicate cells in the input data! |
In fact, some are repeated up to 12 times:
|
That was one of my guesses. But, the following is now also failing for me. 😕 diff --git a/io/src/read_geometry.cpp b/io/src/read_geometry.cpp
index c88e8fe..f52ce52 100644
--- a/io/src/read_geometry.cpp
+++ b/io/src/read_geometry.cpp
@@ -19,6 +19,7 @@
#include <vecmem/memory/host_memory_resource.hpp>
// System include(s).
+#include <cassert>
#include <stdexcept>
namespace {
@@ -50,6 +51,7 @@ read_json_geometry(std::string_view filename) {
barcode_map = std::make_unique<
std::map<std::uint64_t, detray::geometry::barcode>>();
for (const auto& surface : detector.surfaces()) {
+ assert(barcode_map->find(surface.source) == barcode_map->end());
(*barcode_map)[surface.source] = surface.barcode();
}
} I.e. the same surface identifier shows up multiple times from Detray. Could it be that we end up merging modules in some weird way? Though this is probably just a red herring. At least for the clusterization... |
Since you have some "shell magic" at the ready as it seems, could you remove the duplicates from the files and send me the updated TGZ? I'll upload it as a new version. |
Let me doctor up some deduplication code and produce some deduplicated files. |
It's also worth understanding how this happens, because I can't say it makes much sense to me. Is there any physical reason why we would get a read-out for one pixel twice? Or 12 times? |
Interestingly, the full TML dataset also seems to have some duplicated hits, although the number is much smaller than it is for the ODD files. Could be that this just succeeded by chance, e.g. there was never a duplicated hit with 8 other hits in the neighbourhood, so there was some buffer in the 8-length array to compensate. |
There must be some issue in the Acts "digitization" code. 🤔 This behaviour sounds like a plausible outcome if some mistake is made there. (Remember, simulation just tells us how much energy was deposited exactly where in the detector volumes. We then need to turn those energy deposits into information that the real hardware would've read out as well. I.e. we need to "digitize" the simulated data, as we call it in ATLAS.) Pinging @asalzburger for info. 😉 |
Made vecmem::cuda::clusterization_algorithm and vecmem::sycl::clusterization_algorithm both output measurement containers, and introduced vecmem::cuda::spacepoint_formation_algorithm and vecmem::sycl::spacepoint_formation_algorithm for turning those measurements into spacepoints. At the same time modified the shared clusterization code a little. Simplifying how traccc::device::ccl_kernel would fill its output container, and making sure that functions from traccc::core are re-used wherever possible. Implemented the setting of unique identifiers on the measurements, meant for the ambiguity resolution algorithm.
Instead of asking the CUDA runtime for it on every event.
ac95998
to
4bb7d10
Compare
Okay, |
😄 |
See |
Requiring the users to provide the function with vector views instead.
Ouch, I guess we could see an issue of Geant4 versus Fatras simulation, while Fatras is guaranteed to have one step per module (i.e. the particle only intersects the module once), Geant4 physics can force to split a segment into several steps as some fancy physics processes may happen. e.g. delta rays. If we don't catch them before the clusterizer, we might produce duplicate cells ?1? |
So... Stephen, since I'd like to sort this out, so that we could move on to doing all the rest that we still need to do: How do you feel about the changes? The very last one, where I made the code go from using bare pointers to the shared memory blocks, to using vecmem vectors, did cause a small performance drop. (O(1%)) Which I personally am okay with, given the debugging benefits. (In debug mode the code will be more talkative about where it encountered an issue.) But I'm open to a discussion about this. In any case, I'd want to sort out the addition of the new data file in a separate PR. And then of course I'll want to move on to looking at the full ODD tracking chain with CUDA. 😉 |
I think all the issues are resolved, I don't think a 1% performance drop warrants any debate, we can just get it in no problem. |
After #543, this is an even more ambitious PR for splitting
traccc::cuda::clusterization_algorithm
andtraccc::sycl::clusterization_algorithm
in two. (Re-)Introducing thetraccc::cuda::spacepoint_formation_algorithm
andtraccc::sycl::spacepoint_formation_algorithm
algorithms.The biggest change is in
traccc::device::ccl_kernel
. Since now that function receives a resizable measurement buffer, it no longer deals with figuring out the index of the measurement that it should fill. It just uses vecmem's built-in ability for doing this sort of a thing.I also:
traccc::cuda::experimental::clusterization_algorithm
andtraccc::sycl::experimental::clusterization_algorithm
. The updated algorithms now behave in the way that those "experimental" algorithms were meant to.traccc::cuda::experimental::spacepoint_formation
andtraccc::sycl::experimental::spacepoint_formation
around for now. Since the Detray usage in those is still something that I'll want to steal from. 😉With all of this, I simplified the device code in a few places a bit. So I wanted to see a little more in detail what happens to the performance of the code with these updates. Unfortunately the answer is a bit complicated... With the multi-threaded throughput tests, using 4 CPU threads, I see the following:
It indeed looks far from great. 😦 However, when I do the same with the single-threaded throughput tests, I get:
The CUDA numbers here look all good, but SYCL is still not great. 😟 But I don't fully understand why.
At first I was convinced that it would be the simplifications that I made in
traccc::device::ccl_kernel
that would be responsible for the slowdown. But not really. Even when I went back to the previous logic for how positions in the output container are chosen, the throughput numbers didn't change much.I'll just give up for today... But let's discuss about this tomorrow...