New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect and remove all duplicate pixels #38946
Conversation
type bugfix |
enable gpu |
please test |
urgent |
@VinInn @AdrianoDee FYI |
+heterogeneous |
A new Pull Request was created by @fwyzard (Andrea Bocci) for master. It involves the following packages:
@jpata, @civanch, @clacaputo, @mdhildreth can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-d29d53/26624/summary.html GPU Comparison SummarySummary:
Comparison SummarySummary:
|
+1 |
+heterogeneous |
@clacaputo @jpata as this is something we would like to deploy online sooner rather than later, could you let me know if you have any concerns about, if you think we should involve directly the DPG, etc. ? |
No concerns from my side, just busy with other stuff. I'm going to sign it |
+reconstruction |
This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @qliphy, @rappoccio (and backports should be raised in the release meeting by the corresponding L2) |
Sure, no problem!
Thanks. |
+1 |
@fwyzard Could you explain why this way is more efficient than the one proposed by @VinInn here? I can see that Vincenzo's way accesses the global memory (the Is your way more efficient due to the global memory access being sequential (even though it's done twice?) |
hi @nothingface0,
Shared memory is local to the processor, like the L1 cache, so it is much faster than global memory. |
I totally understood the use of shared memory in your approach, but I'm still a bit confused, since, for example, Vincenzo's approach does not need an extra array for storing occurrences. So it's not like the shared memory is used for existing data but for extra data. To make my question clearer: Vincezo's approach: Each thread does a pairwise comparison of the coordinates of the pixel it is assigned, with every the coordinates of every other pixel, i.e. the first thread compares pixel Your approach: Use some extra (shared) memory to store occurrences of each pixel. To do that, you
This means that you access the global memory at most 2 x TL;DR: Is this statement correct?: Sorry for insisting, I'm just trying to understand this statement:
|
IIUC, the approach used by Vincenzo compares every pixel with every other pixel, so the complexity of the algorithm grows with the square of the number of pixels in a module, i.e. is The approach I used reads every pixel a fixed number of times, so the complexity grows linearly with the number of pixels in a module, i.e. is The exact number of operations and memory accesses i not exactly N² or N, but those are the order of the leading terms in the two cases.. My approach does make use of an extra memory buffer, which normally would add a large cost; being able to keep that buffer in shared memory makes the cost acceptable. If you feel like it, it could be interesting to measure the impact of using a buffer in global memory instead (with a byte per pixel, without all the bitwise operations). |
Right, there's the complexity of the approach, too. Thanks!
I might ask you again on how to run profiling using patatrack 😅 |
PR description:
Detect and remove all duplicate pixels, after unpacking each pixel module but before running the clustering.
Use shared memory for inter-thread communication and to speed up marking and detecting the duplicates.
PR validation:
Running the online pixel reconstruction and the full HLT menu on GPU over non-problematic events shows only a moderate slow down:
atomicOR
(ce8a57b)atomicCAS
(2522012)If this PR will be backported please specify to which release cycle the backport is meant for:
To be backported to 12.4.x for data taking (#38947).