Detect and remove all duplicate pixels #38946

fwyzard · 2022-08-02T15:57:51Z

PR description:

Detect and remove all duplicate pixels, after unpacking each pixel module but before running the clustering.
Use shared memory for inter-thread communication and to speed up marking and detecting the duplicates.

PR validation:

Running the online pixel reconstruction and the full HLT menu on GPU over non-problematic events shows only a moderate slow down:

reconstruction	pixel tracking	full HLT
no duplicate removal	1566 ± 16 ev/s (--)	873 ± 4 ev/s (--)
duplicate removal with `atomicOR` (`ce8a57b`)	1530 ± 17 ev/s (-2.3%)	872 ± 4 ev/s (-0.2%)
duplicate removal with `atomicCAS` (`2522012`)	1519 ± 14 ev/s (-3.0%)	869 ± 2 ev/s (-0.4%)

If this PR will be backported please specify to which release cycle the backport is meant for:

To be backported to 12.4.x for data taking (#38947).

fwyzard · 2022-08-02T15:58:12Z

type bugfix

fwyzard · 2022-08-02T15:58:18Z

enable gpu

fwyzard · 2022-08-02T15:58:22Z

please test

fwyzard · 2022-08-02T15:58:54Z

urgent

fwyzard · 2022-08-02T16:03:10Z

@VinInn @AdrianoDee FYI

fwyzard · 2022-08-02T16:03:24Z

+heterogeneous

cmsbuild · 2022-08-02T16:03:41Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-38946/31388

This PR adds an extra 40KB to repository
There are other open Pull requests which might conflict with changes you have proposed:
- File DataFormats/SiPixelDigi/interface/SiPixelDigiConstants.h modified in PR(s): [DO NOT MERGE] remove duplicate pixels #37359, [DO NOT MERGE] Detect and remove duplicate pixels #38934
- File HeterogeneousCore/CUDAUtilities/interface/cudaCompat.h modified in PR(s): Use cooperative groups to populate Associations (Histograms) in Pixel Patatrack #35713
- File RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu modified in PR(s): [DO NOT MERGE] remove duplicate pixels #37359, Tracker Traits and Enabling Phase2 for Inner Tracker Reconstruction on GPU #38761, Various fixes for pixel clustering on GPU #38920, [DO NOT MERGE] Detect and remove duplicate pixels #38934
- File RecoLocalTracker/SiPixelClusterizer/plugins/gpuClusterChargeCut.h modified in PR(s): [DO NOT MERGE] remove duplicate pixels #37359, Tracker Traits and Enabling Phase2 for Inner Tracker Reconstruction on GPU #38761, [DO NOT MERGE] Detect and remove duplicate pixels #38934
- File RecoLocalTracker/SiPixelClusterizer/plugins/gpuClustering.h modified in PR(s): [DO NOT MERGE] remove duplicate pixels #37359, Tracker Traits and Enabling Phase2 for Inner Tracker Reconstruction on GPU #38761, [DO NOT MERGE] Detect and remove duplicate pixels #38934, Use cooperative groups to populate Associations (Histograms) in Pixel Patatrack #35713
- File RecoLocalTracker/SiPixelClusterizer/test/gpuClustering_t.h modified in PR(s): [DO NOT MERGE] remove duplicate pixels #37359, Tracker Traits and Enabling Phase2 for Inner Tracker Reconstruction on GPU #38761, [DO NOT MERGE] Detect and remove duplicate pixels #38934

cmsbuild · 2022-08-02T16:04:05Z

A new Pull Request was created by @fwyzard (Andrea Bocci) for master.

It involves the following packages:

DataFormats/SiPixelDigi (simulation)
HeterogeneousCore/CUDAUtilities (heterogeneous)
RecoLocalTracker/SiPixelClusterizer (reconstruction)

@jpata, @civanch, @clacaputo, @mdhildreth can you please review it and eventually sign? Thanks.
@mtosi, @VourMa, @makortel, @felicepantaleo, @GiacomoSguazzoni, @JanFSchulte, @rovere, @VinInn, @OzAmram, @ferencek, @dkotlins, @gpetruc, @mmusich, @threus, @tvami this is something you requested to watch as well.
@perrotta, @dpiparo, @qliphy, @rappoccio you are the release manager for this.

cms-bot commands are listed here

fwyzard · 2022-08-02T16:09:54Z

Note: #37559 should be the equivalent implementation for the CPU-only reconstruction.

While the HLT can move forward without it (since we are running exclusively on GPUs), #37559 should be validated and backported to 12.4.x to keep the two implementation coherent.

cmsbuild · 2022-08-03T18:01:26Z

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-d29d53/26624/summary.html
COMMIT: 2522012
CMSSW: CMSSW_12_5_X_2022-08-03-1100/el8_amd64_gcc10
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/38946/26624/install.sh to create a dev area with all the needed externals and cmssw changes.

GPU Comparison Summary

Summary:

No significant changes to the logs found
Reco comparison results: 0 differences found in the comparisons
Reco comparison had 3 failed jobs
DQMHistoTests: Total files compared: 4
DQMHistoTests: Total histograms compared: 19876
DQMHistoTests: Total failures: 8
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 19868
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 3 files compared)
Checked 12 log files, 9 edm output root files, 4 DQM output files
TriggerResults: found differences in 2 / 3 workflows

Comparison Summary

Summary:

No significant changes to the logs found
Reco comparison results: 6 differences found in the comparisons
DQMHistoTests: Total files compared: 51
DQMHistoTests: Total histograms compared: 3691510
DQMHistoTests: Total failures: 13
DQMHistoTests: Total nulls: 1
DQMHistoTests: Total successes: 3691474
DQMHistoTests: Total skipped: 22
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: -0.004 KiB( 50 files compared)
DQMHistoSizes: changed ( 312.0 ): -0.004 KiB MessageLogger/Warnings
Checked 212 log files, 49 edm output root files, 51 DQM output files
TriggerResults: no differences found

civanch · 2022-08-03T20:18:05Z

+1

fwyzard · 2022-08-04T08:11:11Z

+heterogeneous

fwyzard · 2022-08-04T09:26:16Z

@clacaputo @jpata as this is something we would like to deploy online sooner rather than later, could you let me know if you have any concerns about, if you think we should involve directly the DPG, etc. ?

clacaputo · 2022-08-04T10:08:30Z

@clacaputo @jpata as this is something we would like to deploy online sooner rather than later, could you let me know if you have any concerns about, if you think we should involve directly the DPG, etc. ?

No concerns from my side, just busy with other stuff. I'm going to sign it

clacaputo · 2022-08-04T10:10:13Z

+reconstruction

cmsbuild · 2022-08-04T10:10:37Z

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @qliphy, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

fwyzard · 2022-08-04T10:22:19Z

No concerns from my side, just busy with other stuff.

Sure, no problem!

I'm going to sign it

Thanks.

qliphy · 2022-08-04T14:00:52Z

+1

nothingface0 · 2022-08-11T10:22:26Z

@fwyzard Could you explain why this way is more efficient than the one proposed by @VinInn here?

I can see that Vincenzo's way accesses the global memory (the x and y arrays) multiple times, and in the worst case those accesses are numElements words apart. Does this mean that there will probably be many cache misses?

Is your way more efficient due to the global memory access being sequential (even though it's done twice?)

fwyzard · 2022-08-11T11:32:06Z

hi @nothingface0,
the main difference is that this approach is using shared memory:

cmssw/RecoLocalTracker/SiPixelClusterizer/plugins/gpuClustering.h

Line 105 in c797f7f

    
           __shared__ uint32_t status[pixelStatusSize];  // packed words array used to store the PixelStatus of each pixel

Shared memory is local to the processor, like the L1 cache, so it is much faster than global memory.
The limitation is that there is only order of 48 - 64 kB available, so for example we cannot use the same approach for Phase-2, where the pixel modules are much larger.

nothingface0 · 2022-08-11T12:56:26Z

I totally understood the use of shared memory in your approach, but I'm still a bit confused, since, for example, Vincenzo's approach does not need an extra array for storing occurrences. So it's not like the shared memory is used for existing data but for extra data.

To make my question clearer:

Vincezo's approach: Each thread does a pairwise comparison of the coordinates of the pixel it is assigned, with every the coordinates of every other pixel, i.e. the first thread compares pixel 0 with 1, then 0 with 2.. up to 0 with msize-1, which are far apart in memory, meaning that this might lead to cache misses. Not to mention that this happens in parallel with every other thread trying to do its own comparisons. In general, this needs msize over 2 memory/cache accesses in total, whose pattern is not predictable.

Your approach: Use some extra (shared) memory to store occurrences of each pixel. To do that, you

Sweep once over the x and y arrays sequentially (i.e. predictable for the GPU) to count occurrences of each pixel (pixelStatus::promote(status, x[i], y[i]);)
Sweep a second time sequentially over the same arrays to check if they've been marked duplicates (if (pixelStatus::isDuplicate(status, x[i], y[i])))

This means that you access the global memory at most 2 x msize times predictably, meaning better use of the available cache.

TL;DR: Is this statement correct?:
Your approach is more efficient, not because of shared memory usage (since Vincenzo's method simply does not need an extra array), but because its global memory access patterns are more predictable and make better use of the GPU cache.

Sorry for insisting, I'm just trying to understand this statement:

to be clear. this solution is NOT computational-sustainable.

fwyzard · 2022-08-11T13:12:15Z

IIUC, the approach used by Vincenzo compares every pixel with every other pixel, so the complexity of the algorithm grows with the square of the number of pixels in a module, i.e. is O(N²).

The approach I used reads every pixel a fixed number of times, so the complexity grows linearly with the number of pixels in a module, i.e. is O(N).

The exact number of operations and memory accesses i not exactly N² or N, but those are the order of the leading terms in the two cases..

My approach does make use of an extra memory buffer, which normally would add a large cost; being able to keep that buffer in shared memory makes the cost acceptable. If you feel like it, it could be interesting to measure the impact of using a buffer in global memory instead (with a byte per pixel, without all the bitwise operations).

nothingface0 · 2022-08-11T13:18:09Z

Right, there's the complexity of the approach, too. Thanks!

measure the impact of using a buffer in global memory instead

I might ask you again on how to run profiling using patatrack 😅

fwyzard added 2 commits August 2, 2022 17:52

Implement atomicAnd and atomicOr in cudaCompat

aee2e95

Detect and remove duplicate pixels

ce8a57b

cmsbuild added this to the CMSSW_12_5_X milestone Aug 2, 2022

fwyzard mentioned this pull request Aug 2, 2022

Detect and remove all duplicate pixels [12.4.x] #38947

Merged

cmsbuild added code-checks-pending heterogeneous-pending orp-pending pending-signatures reconstruction-pending simulation-pending tests-pending labels Aug 2, 2022

cmsbuild added bug-fix tests-started and removed tests-pending labels Aug 2, 2022

cmsbuild added the urgent label Aug 2, 2022

cmsbuild added heterogeneous-approved and removed heterogeneous-pending labels Aug 2, 2022

cmsbuild added code-checks-approved and removed code-checks-pending labels Aug 2, 2022

fwyzard mentioned this pull request Aug 2, 2022

Remove duplicate pixels #37559

Merged

fwyzard mentioned this pull request Aug 2, 2022

[DO NOT MERGE] Detect and remove duplicate pixels #38934

Closed

cmsbuild added tests-approved and removed tests-started labels Aug 3, 2022

cmsbuild added simulation-approved and removed simulation-pending labels Aug 3, 2022

cmsbuild added heterogeneous-approved and removed heterogeneous-pending labels Aug 4, 2022

cmsbuild added fully-signed and removed reconstruction-pending pending-signatures labels Aug 4, 2022

cmsbuild added the reconstruction-approved label Aug 4, 2022

cmsbuild added orp-approved and removed orp-pending labels Aug 4, 2022

cmsbuild merged commit efa7f41 into cms-sw:master Aug 4, 2022

fwyzard deleted the gpu_duplicate_pixel_removal branch August 5, 2022 15:13

fwyzard restored the gpu_duplicate_pixel_removal branch August 5, 2022 15:14

AdrianoDee mentioned this pull request Aug 24, 2022

Tracker Traits and Enabling Phase2 for Inner Tracker Reconstruction on GPU #38761

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect and remove all duplicate pixels #38946

Detect and remove all duplicate pixels #38946

fwyzard commented Aug 2, 2022 •

edited

fwyzard commented Aug 2, 2022

fwyzard commented Aug 2, 2022

fwyzard commented Aug 2, 2022

fwyzard commented Aug 2, 2022

fwyzard commented Aug 2, 2022

fwyzard commented Aug 2, 2022

cmsbuild commented Aug 2, 2022

cmsbuild commented Aug 2, 2022

fwyzard commented Aug 2, 2022

cmsbuild commented Aug 3, 2022

civanch commented Aug 3, 2022

fwyzard commented Aug 4, 2022

fwyzard commented Aug 4, 2022

clacaputo commented Aug 4, 2022

clacaputo commented Aug 4, 2022

cmsbuild commented Aug 4, 2022

fwyzard commented Aug 4, 2022

qliphy commented Aug 4, 2022

nothingface0 commented Aug 11, 2022 •

edited

fwyzard commented Aug 11, 2022

nothingface0 commented Aug 11, 2022 •

edited

fwyzard commented Aug 11, 2022 •

edited

nothingface0 commented Aug 11, 2022 •

edited

Detect and remove all duplicate pixels #38946

Detect and remove all duplicate pixels #38946

Conversation

fwyzard commented Aug 2, 2022 • edited

PR description:

PR validation:

If this PR will be backported please specify to which release cycle the backport is meant for:

fwyzard commented Aug 2, 2022

fwyzard commented Aug 2, 2022

fwyzard commented Aug 2, 2022

fwyzard commented Aug 2, 2022

fwyzard commented Aug 2, 2022

fwyzard commented Aug 2, 2022

cmsbuild commented Aug 2, 2022

cmsbuild commented Aug 2, 2022

fwyzard commented Aug 2, 2022

cmsbuild commented Aug 3, 2022

GPU Comparison Summary

Comparison Summary

civanch commented Aug 3, 2022

fwyzard commented Aug 4, 2022

fwyzard commented Aug 4, 2022

clacaputo commented Aug 4, 2022

clacaputo commented Aug 4, 2022

cmsbuild commented Aug 4, 2022

fwyzard commented Aug 4, 2022

qliphy commented Aug 4, 2022

nothingface0 commented Aug 11, 2022 • edited

fwyzard commented Aug 11, 2022

nothingface0 commented Aug 11, 2022 • edited

fwyzard commented Aug 11, 2022 • edited

nothingface0 commented Aug 11, 2022 • edited

fwyzard commented Aug 2, 2022 •

edited

nothingface0 commented Aug 11, 2022 •

edited

nothingface0 commented Aug 11, 2022 •

edited

fwyzard commented Aug 11, 2022 •

edited

nothingface0 commented Aug 11, 2022 •

edited